hparser document

时间 2020-05-30

标签 hparser document 繁體版

原文原文链接

github : https://github.com/chloro-pn/...html

本篇做为hparser的文档，主要分为三部分进行说明。node

hparser查询接口介绍

使用类hparser解析xhtml文件：ios

#include "hparser.h"
#include <exception>
#include <iostream>

int main() {
  //content is a std::string object.
  //content 存储了utf8编码的xhtml文本。
  //在构造函数过程当中进行解析，解析失败会抛出parser_error异常。
  try {
    hparser doc(content);
  }
  catch(const std::exception& e) {
    std::cout << e.what() << std::endl;
  }
}

类hparser有如下接口函数：
string_type global_notes() const;
此函数的做用是返回不在顶层标签内的注释信息，DOCTYPE declaration等文本。c++

std::shared_ptr<element_type> get_root() const;
返回xhtml文本基于dom模型的顶层元素，即<html>...</html>元素。git

std::vector<std::shared_ptr<element_type>> find_tag(std::string str) const;

std::vector<std::shared_ptr<element_type>> find_attr(std::string str) const;

std::vector<std::shared_ptr<element_type>> find_content(std::string str) const;

三个查询函数，分别在xhtml文本中按照tag信息，attr信息和content信息查询，返回符合查询条件的element集合。输入的std::string能够是一个regex pattern，内部会经过std::regex进行匹配。最好保证xhtml是ascii编码，关于std::regex和unicode编码的问题见如下连接：https://www.zhihu.com/questio...
若是你确实须要处理ascii以外的扩展字符并须要用正则匹配，或者上述接口的查询能力不足，使用如下接口：github

std::vector<std::shared_ptr<element_type>> find(std::function<bool(std::shared_ptr<element_type> each)> func) const;

find函数接收一个可调用对象，输入参数为指向element_type的共享指针，你能够根据该元素的信息肯定是否查询成功，若是是则返回true，不然返回false。固然，你能够在该函数中将元素的记录文本由utf8转为你须要的编码，而后用正则表达式匹配并肯定是否成功。正则表达式

auto check_func = [](std::shared_ptr<element_type> node)->bool {
  std::u32string tmp = encode_cast(node->content());
  bool found = regex_by_encode(tmp, yourpattern);
  return found;
};
auto result = h.find(chekc_func);

上述代码不是合法的c++代码，仅做为伪代码展现。dom

类型element_type介绍

element_type表明了xml/html基于文档对象模型（DOM）的元素概念，元素之间造成树结构，每一个元素拥有父元素（根元素除外），可选的子元素，标签tag，文本content等信息。该类型具备如下接口：函数

//公开的kv_type类型，此类型表明元素属性的类型，是一个key-value对，此类型的对象保证拥有公开可访问的数据成员key_和value_。
using kv_type = inner_kv_type;

//返回此元素的标签
string_type tag() const;

//返回此元素内的内容，但不包括子元素的内容
string_type content() const;

//返回属性的数量
size_t attrs_size() const;

//返回子元素的数量
size_t childs_size() const;

//按照index返回属性，index从0开始
kv_type get_attr(size_t index) const;

//按照index返回子元素，index从0开始
std::shared_ptr<hparser::element_type> get_child(size_t index) const;

//返回全部属性
std::vector<kv_type> get_all_attrs() const;

//返回全部子元素
std::vector<std::shared_ptr<hparser::element_type>> get_all_childs() const;

//[]运算符重载，根据属性的key访问value，若是key不存在则返回空的value ""。
string_type operator[](string_type str) const;

//判断此元素是否为根元素。
bool root() const;

//返回此元素的父元素，若是此元素为根元素，则返回空的std::shared_ptr。
std::shared_ptr<hparser::element_type> parent() const;

结合使用find接口与element_type类，你将会得到很是强大灵活的查询能力，几乎能够实现任意复杂的查询条件。例如：编码

auto check_func = [](std::shared_ptr<element_type> node)->bool {
  if(node->root() == false && node->parent()->tag() == u8"div") {
    if(node->tag() == u8"a" && (*node)[u8"href"] != u8"" && node->attrs_size() == 1) {
    return true;
    }
  }
  return false;
};
auto result = h.find(chekc_func);

这个查询函数要求返回元素的父元素标签为div，本元素标签为a且只含有属性“href”。
体会到find接口强大的查询能力了么？你能够根据element_type所拥有的信息定制任何查询条件。接下来看第三部分，编码与正则匹配，这会进一步提高find接口的查询能力：）

utf8_to_utf32/utf32_to_utf8接口介绍

c++处理编码真是难，c++标准库中的string只是个char array，其并不含有编码信息，而一个合理的正则匹配应该是codepoint-by-codepoint的，所以std::regex可以处理ascii码，但对扩展编码则心有余而力不足。虽然标准库又提供了wchar_t和std::wregex，但其在不一样平台上其占用字节大小竟然不一样。。。在这种状况下，首先咱们不能手工编码而后存储wchar_t，由于其占用字节大小不定，其次，标准库提供的std::string和std::wstring相互转换的组件在c++17标准中被废除。如今std::string和std::wstring彻底成了两套东西，相互之间的转换已经不能。

c++11提出了两种新的字符存储类型char16_t和char32_t，其具备肯定的大小。原本觉得曙光来临，将全部外部输入编码都转化为utf32在内部表示，而后用std::basic_regex<char32_t>进行正则处理，目前为止的unicode-code-point都能用一个wchar32_t装下，天下太平了。然而std::basic_regex模板类不直接提供char32_t的支持，须要实现std::regex_traits<char32_t>（由于char32_t只是个更大的存储单位，并不带有编码信息，而我的认为正则处理应该创建在具体的编码之上实现。）。见连接：https://stackoverflow.com/que...

机智的我中止填这个无底深坑，转而提供utf8-utf32的转换接口，若是你有支持utf32的正则库，则能够经过此接口转换编码而后作正则匹配。

//将c转换为本机字节序，en是c当前的字节序
//endian是一个enum class，该类型的对象可被设置为
//endian::little_endian or endian::big_endian。
char32_t endian_cast(char32_t c, endian en);

//将src中的utf8编码转换为utf32编码，en为输出参数的字节序，默认为本机字节序。
size_t utf8_to_utf32(const std::string& src, std::u32string& dst, endian en = local_endian().get());

//将str中的utf32编码转换为utf8编码，en为输入参数的字节序，默认为本机字节序。
size_t utf32_to_utf8(const std::u32string& src, std::string& dst, endian en = local_endian().get());

上述两个接口的返回值为第一个转换失败的文本的index，转换成功时应知足return_index == src.size()。不然src[return_index]及其以后的输入文本转换失败。综上，你能够在查询函数中将utf8编码转换为utf32编码，而后利用支持char32_t的正则表达式库作正则匹配。

1. document
2. Document
3. Single document interface和Multiple document interface
4. Document以及Document CRUD操作
5. New Document
6. Elasticsearch Document
7. Document类
8. $(document).ready()
9. JavaScript document
10. javascript document
更多相关文章...
• XSLT document() 函数 - XSLT 教程
• XML DOM - Document 对象 - XML DOM 教程