16.Python使用lxml爬虫

时间 2019-11-10

标签 16.python python 使用 lxml 爬虫栏目 Python 繁體版

原文原文链接

1.lxml是解析库，使用时须要导入该包，直接在命令行输入：pip3 install lxml，基本上会报错。正确应该去对应的网址：https://pypi.org/project/lxml/#files，直接下载对应的lxmlhtml

（根据python版本本身去选择，笔者是python3.6，故下载：lxml-4.2.4-cp36-cp36m-win32.whl ，切换到下载的whl目录，在该目录下执行：python

pip3 install lxml-4.2.4-cp36-cp36m-win32.whl ）url

2.代码以下所示：命令行

import requests
from lxml import etree

url = 'https://www.mafengwo.cn/gonglve/ziyouxing/2033.html'

response = requests.get(url)   #返回一个response对象
page = response.text

html = etree.HTML(page)      #返回一个Element对象，将字符串解析为HTML文档
content = html.xpath('//h2')

for i in content:
    print(i.text)

3.代码解释：xml

A：定义好url的路径，使用url获取到response对象如：url = ''htm

B：须要将reponse对象转化为字符串格式，page = response.text对象

C：使用解析库将字符串转为为HTML文档，根据本身想要获取的内容去定义xpath路径blog