测试库:lxml库;连接连接:http://www.sxchxx.com/index-13-1075-1.htmlhtml
我的比较喜欢用xpath解析网页,但时常获得的结果倒是一个空列表。测试
from lxml import etree import requests url = 'http://www.sxchxx.com/index-13-1075-1.html' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36', } resposne = requests.get(url, headers=headers) parser = etree.HTMLParser(encoding="utf-8") html = etree.HTML(resposne.text, parser=parser) resu=html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()') print(resu)
当用如上代码解析以下网页时,能够获取正文url
但发现咱们并无在rule里面加入tbody标签。相反,加入tbody标签会使的解析结果变成一个空列表spa
html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') # 这样会获得空列表
3d
使用etree.parse和etree.HTML刚好相反code
from lxml import etree import requests parser = etree.HTMLParser(encoding="utf-8") html = etree.parse('test.html', parser=parser) content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') print(content)
将网页保存成test.html,再用etree.parse加载,发现rule中加入tbody标签才能得到预期的结果;不加tbody标签会得到一个空列表xml
from lxml import etree import requests parser = etree.HTMLParser(encoding="utf-8") html = etree.parse('test.html', parser=parser) content = html.xpath('//*[@id="large_mid"]/table[2]/tbody/tr[3]/td/p//text()') print(content) print('----------------分割线-------------------') url = 'http://www.sxchxx.com/index-13-1075-1.html' headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36', } resposne = requests.get(url, headers=headers) parser = etree.HTMLParser(encoding="utf-8") html = etree.HTML(resposne.text, parser=parser) content = html.xpath('//*[@id="large_mid"]/table[2]/tr[3]/td/p//text()') print(content)
若是解析在线网页,不要添加tbody标签
反则解析本地(离线)网页,添加tbody标签htm
请看下面的缘由分析blog
对比上面两种方法,差别在于html = etree.parse('test.html', parser=parser)
html = etree.HTML(resposne.text)
这两行代码utf-8
而解析器是相同的parser = etree.HTMLParser(encoding="utf-8")
所以,我猜想,多是parse或者HTML对代码作了某种“格式化”调整。
貌似lxml这个库使用其余语言编写,看不到源代码,没法从源代码下手检查