认识爬虫：优秀的爬虫利器，pyquery 框架爬虫到底有多简洁？

时间 2021-04-07

标签 css html 前端 python jquery 微信 dom 函数 post url 栏目网络爬虫繁體版

原文原文链接

了解过了 BeautifulSoup 对象的爬虫解析、lxml 扩展库的 xpath 语法等 html 的解析库，如今来讲说 pyquery ，看名称就长得和 jquery 很像。其实，pyquery 就是仿照 jquery 的语法来实现的，语法使用能够说是几乎相同，算是前端爬虫的福利语言，若是你刚好会一些 jquery 的语法使用起来就会很是简单。css

一、安装并导入 pyquery 扩展库html

1pip install -i https://pypi.mirrors.ustc.edu.cn/simple/ pyquery
2
3# -*- coding: UTF-8 -*-
4
5# 导入 pyquery 扩展库
6from pyquery import PyQuery as pq

二、pyquery 执行网页请求(不经常使用)前端

1'''
2直接使用 PyQuery 对象便可发送网页请求，返回响应信息
3'''
4
5# GET 请求
6print(PyQuery(url='http://www.baidu.com/', data={},headers={'user-agent': 'pyquery'},method='get'))
7
8# POST 请求
9print(PyQuery(url='http://httpbin.org/post',data={'name':u"Python 集中营"},headers={'user-agent': 'pyquery'}, method='post', verify=True))

三、pyquery 执行网页源代码解析(经常使用)python

解析对象初始化

1# 首先获取到网页下载器已经下载到的网页源代码
 2# 这里直接取官方的案例
 3html_doc = """
 4<html><head><title>The Dormouse's story</title></head>
 5<body>
 6<p class="title"><b>The Dormouse's story</b></p>
 7
 8<p class="story">Once upon a time there were three little sisters; and their names were
 9<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
10<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
11<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
12and they lived at the bottom of a well.</p>
13
14<p class="story">...</p>
15"""
16
17# 初始化解析对象
18pyquery_obj = PyQuery(html_doc)

css选择器模式提取元素及元素文本

1# 获取a标签元素、文本
 2print(pyquery_obj('a'))
 3print(pyquery_obj('a').text())
 4
 5# 获取class=story元素、文本
 6print(pyquery_obj('.story'))
 7print(pyquery_obj('.story').text())
 8
 9# 获取id=link3元素、文本
10print(pyquery_obj('#link3'))
11print(pyquery_obj('#link3').text())
12
13# 获取body下面p元素、文本
14print(pyquery_obj('body p'))
15print(pyquery_obj('body p').text())
16
17# 获取body和p元素、文本
18print(pyquery_obj('p,a'))
19print(pyquery_obj('p,a').text())
20
21# 获取body和p元素、文本
22print(pyquery_obj("[class='story']"))
23print(pyquery_obj("[class='story']").text())

获取元素以后再进一步提取信息

1# 提取元素文本
2print("......元素再提取......")
3print("全部a元素文本",pyquery_obj('a').text())
4print("第一个a元素的html文本",pyquery_obj('a').html())
5print("a元素的父级元素",pyquery_obj('a').parent())
6print("a元素的子元素",pyquery_obj('a').children())
7print("全部a元素中id是link3的元素",pyquery_obj('a').filter('#link3'))
8print("最后一个a元素的href属性值",pyquery_obj('a').attr.href)

dom操做

1# attr() 函数获取属性值
 2print(pyquery_obj('a').filter('#link3').attr('href'))
 3# attr.属性，获取属性值
 4print(pyquery_obj('a').filter('#link3').attr.href)
 5print(pyquery_obj('a').filter('#link3').attr.class_)
 6# 添加 class 属性值 w
 7pyquery_obj('a').filter('#link3').add_class('w')
 8print(pyquery_obj('a').filter('#link3').attr('class'))
 9
10# 移除 class 属性值 w
11pyquery_obj('a').filter('#link3').remove_class('sister')
12print(pyquery_obj('a').filter('#link3').attr('class'))
13# 移除标签
14pyquery_obj('html').remove('a')
15print(pyquery_obj)

更多精彩前往微信公众号【Python 集中营】，专一于 python 技术栈，资料获取、交流社区、干货分享，期待你的加入~jquery