lxml的XPath解析

BeautifulSoup 能够将lxml做为默认的解析器使用，一样lxml能够单独使用。下面比较这二者之间优缺点：

BeautifulSoup和lxml原理不同，BeautifulSoup是基于DOM的，会载入整个文档，解析整个DOM树，所以时间和内存开销都会比较大不少。而lxml是使用XPath技术查询和处理HTML/XML文档的库，只会局部遍历，因此速度会快一些。幸亏如今BeautifulSoup能够使用lxml做为默认解析库html
关于XPath的用法，请点击：https://www.cnblogs.com/guguobao/p/9401643.htmlpython
示例：url

#coding:utf-8

from lxml import etree
html_str = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
html = etree.HTML(html_str)
result = etree.tostring(html)
print(result)

能够发现html_str最后<body><html>是没有闭合的，但能够经过etree.tostring(html)自动修正HTML代码spa

from lxml import etree
html = etree.parse('index.html')
result = etree.tostring(html, pretty_print=True)
print(result)

除了读取字符串以外，lxml还能够直接读取html文件。假设html_str被复制index.html,则能够用parse方法解析(代码在上)。code

接下来使用XPath语句抽取html中的URL

html = etree.HTML(html_str)
urls = html.xpath(".//*[@class='sister']/@href")
print urls