数据之路 - Python爬虫 - urllib、Request、正则、XPath、Beautiful Soup、Pyquery

时间 2019-11-08

标签数据之路 python 爬虫 urllib request 正则 xpath beautiful soup pyquery 栏目 Python 繁體版

原文原文链接

1、基本库-urllib库

urllib库，它是Python内置的HTTP请求库。它包含4个模块：html

request：它是最基本的HTTP请求模块，能够用来模拟发送请求。html5
error：异常处理模块，若是出现请求错误，咱们能够捕获这些异常，而后进行重试或其余操做以保证程序不会意外终止。node
parse：一个工具模块，提供了许多URL处理方法，好比拆分、解析、合并等。python
robotparser：主要是用来识别网站的robots.txt文件，而后判断哪些网站能够爬，哪些网站不能够爬，它其实用得比较少。正则表达式

1.urllib.request模块

request模块主要功能：构造HTTP请求，利用它能够模拟浏览器的一个请求发起过程，浏览器

request模块同时还有：处理受权验证（authenticaton）、重定向（redirection)、浏览器Cookies以及其余内容。cookie

- urlopen方法

 urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

urlopen参数介绍：网络

url用于请求URL函数
data不传：GET请求，传：POST请求工具
timeout设置超时时间，单位为秒，意思就是若是请求超出了设置的这个时间，尚未获得响应，就会抛出异常。若是不指定该参数，就会使用全局默认时间。它支持HTTP、HTTPS、FTP请求。
context必须是ssl.SSLContext类型，用来指定SSL设置。
cafile指定CA证书
capath指定CA证书的路径，这个在请求HTTPS连接时会有用。

- Request方法

class urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

Request参数介绍：

url用于请求URL，这是必传参数，其余都是可选参数。
data若是要传，必须传bytes（字节流）类型的。若是它是字典，能够先用urllib.parse模块里的urlencode()编码。
headers是一个字典，它就是请求头，咱们能够在构造请求时经过headers参数直接构造，也能够经过调用请求实例的add_header()方法添加。添加请求头最经常使用的用法就是经过修改User-Agent来假装浏览器
origin_req_host指的是请求方的host名称或者IP地址。
unverifiable表示这个请求是不是没法验证的，默认是False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，咱们请求一个HTML文档中的图片，可是咱们没有自动抓取图像的权限，这时unverifiable的值就是True`。
method是一个字符串，用来指示请求使用的方法，好比GET、POST和PUT等。

from urllib import request, parse
 
url = 'http://httpbin.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

- Handler处理器

urllib.request模块里的BaseHandler类，它是全部其余Handler的父类。

常见Handler介绍：

HTTPDefaultErrorHandler：用于处理HTTP响应错误，错误都会抛出HTTPError类型的异常。
HTTPRedirectHandler：用于处理重定向。
HTTPCookieProcessor：用于处理Cookies。
ProxyHandler：用于设置代理，默认代理为空。
HTTPPasswordMgr：用于管理密码，它维护了用户名和密码的表。
HTTPBasicAuthHandler：用于管理认证，若是一个连接打开时须要认证，那么能够用它来解决认证问题。

- 代理

ProxyHandler，其参数是一个字典，键名是协议类型（好比HTTP或者HTTPS等），键值是代理连接，能够添加多个代理。

而后，利用这个Handler及build_opener()方法构造一个Opener，以后发送请求便可。

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
 
proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:9743',
    'https': 'https://127.0.0.1:9743'
})
opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

- cookies

# 从网页获取cookie，并逐行输出
import http.cookiejar, urllib.request
 
cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name+"="+item.value)

# 从网页获取cookie，保存为文件格式
filename = 'cookies.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)　　# cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

PS：MozillaCookieJar是CookieJar的子类，LWPCookieJar与MozillaCookieJar都可读取、保存cookie，但格式不一样

调用load()方法来读取本地的Cookies文件，获取到了Cookies的内容。

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookies.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
print(response.read().decode('utf-8'))

2.urllib.error模块

from urllib import request, error
try:
    response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
    print(e.reason)

3.urllib.parse模块

urlparse()
urlunparse()
urlsplit()
urlunsplit()
urljoin()
urlencode()
parse_qs()
parse_qsl()
quote()
unquote()

4.urllib.robotparser模块

Robots协议也称做爬虫协议、机器人协议，它的全名叫做网络爬虫排除标准（Robots Exclusion Protocol），用来告诉爬虫和搜索引擎哪些页面能够抓取，哪些不能够抓取。它一般是一个叫做robots.txt的文本文件,

通常放在网站的根目录下。www.taobao.com/robots.txt

robotparser模块提供了一个类RobotFileParser，它能够根据某网站的robots.txt文件来判断一个爬取爬虫是否有权限来爬取这个网页。

urllib.robotparser.RobotFileParser(url='')

# set_url()：用来设置robots.txt文件的连接。
# read()：读取robots.txt文件并进行分析。
# parse()：用来解析robots.txt文件。
# can_fetch()：该方法传入两个参数，第一个是User-agent，第二个是要抓取的URL。
# mtime()：返回的是上次抓取和分析robots.txt的时间。
# modified()：将当前时间设置为上次抓取和分析robots.txt的时间。

from urllib.robotparser import RobotFileParser
 
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')
rp.read()
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', "http://www.jianshu.com/search?q=python&page=1&type=collections"))

2、基本库-requests库

1.GET、POST请求

get()、post()、put()、delete()方法分别用于实现GET、POST、PUT、DELETE请求。

import requests
========================================================
# GET请求 

data = {
    'name': 'germey',
    'age': 22
}
r = requests.get("http://httpbin.org/get", params=data)
print(r.text)

========================================================
# POST请求

data = {'name': 'germey', 'age': '22'}
r = requests.post("http://httpbin.org/post", data=data)
print(r.text)

2.响应

import requests
 
r = requests.get('http://www.jianshu.com')
print(type(r.status_code), r.status_code)    # status_code属性获得状态码
print(type(r.headers), r.headers)    # 输出headers属性获得响应头
print(type(r.cookies), r.cookies)    # 输出cookies属性获得Cookies
print(type(r.url), r.url)    # 输出url属性获得URL
print(type(r.history), r.history)    # 输出history属性获得请求历史

3.文件上传

import requests
 
files = {'file': open('favicon.ico', 'rb')}
r = requests.post("http://httpbin.org/post", files=files)
print(r.text)

4.cookies

# 获取Cookies
import requests
 
r = requests.get("https://www.baidu.com")
print(r.cookies)
for key, value in r.cookies.items():
    print(key + '=' + value)

5.会话维持

import requests
 
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
r = s.get('http://httpbin.org/cookies')
print(r.text)

6.SSL证书验证

requests还提供了证书验证的功能。当发送HTTP请求的时候，它会检查SSL证书，咱们可使用verify参数控制是否检查此证书。其实若是不加verify参数的话，默认是True，会自动验证。

# 经过verity参数设置忽略警告
import requests
from requests.packages import urllib3
 
urllib3.disable_warnings()
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 经过捕获警告到日志的方式忽略警告
import logging
import requests
logging.captureWarnings(True)
response = requests.get('https://www.12306.cn', verify=False)
print(response.status_code)

# 指定一个本地证书用做客户端证书，这能够是单个文件（包含密钥和证书）或一个包含两个文件路径的元组
import requests
 
response = requests.get('https://www.12306.cn', cert=('/path/server.crt', '/path/key'))
print(response.status_code)

7.代理

import requests
 
proxies = {
  "http": "http://10.10.1.10:3128",
  "https": "http://10.10.1.10:1080",
}
 
requests.get("https://www.taobao.com", proxies=proxies)

8.超时设置

import requests

# 超时抛出异常
r = requests.get("https://www.taobao.com", timeout = 1)
print(r.status_code)

# 请求分为两个阶段，即链接（connect）和读取（read），能够分别指定，传入一个元组
r = requests.get('https://www.taobao.com', timeout=(5,11, 30))

# 永久等待    
r = requests.get('https://www.taobao.com', timeout=None)
r = requests.get('https://www.taobao.com')

9.身份认证

# 使用requests自带的身份认证功能
import requests
from requests.auth import HTTPBasicAuth

r = requests.get('http://localhost:5000', auth=HTTPBasicAuth('username', 'password'))
print(r.status_code)

# 传一个元组，默认使用HTTPBasicAuth类来认证
import requests
 
r = requests.get('http://localhost:5000', auth=('username', 'password'))
print(r.status_code)

3、正则表达式

1.经常使用匹配规则

模式	描述
\w	匹配字母、数字及下划线
\W	匹配不是字母、数字及下划线的字符
\s	匹配任意空白字符，等价于[\t\n\r\f]
\S	匹配任意非空字符
\d	匹配任意数字，等价于[0-9]
\D	匹配任意非数字的字符
\A	匹配字符串开头
\Z	匹配字符串结尾，若是存在换行，只匹配到换行前的结束字符串
\z	匹配字符串结尾，若是存在换行，同时还会匹配换行符
\G	匹配最后匹配完成的位置
\n	匹配一个换行符
\t	匹配一个制表符
^	匹配一行字符串的开头
$	匹配一行字符串的结尾
.	匹配任意字符，除了换行符，当re.DOTALL标记被指定时，则能够匹配包括换行符的任意字符
[...]	用来表示一组字符，单独列出，好比[amk]匹配a、m或k
[^...]	不在[]中的字符，好比[^abc]匹配除了a、b、c以外的字符
*	匹配0个或多个表达式
+	匹配1个或多个表达式
?	匹配0个或1个前面的正则表达式定义的片断，非贪婪方式
{n}	精确匹配n个前面的表达式
{n,m}	匹配n到m次由前面正则表达式定义的片断，贪婪方式
a\|b	匹配a或b
( )	匹配括号内的表达式，也表示一个组

2.修饰符

修饰符	描述
re.I	使匹配对大小写不敏感
re.L	作本地化识别（locale-aware）匹配
re.M	多行匹配，影响^和$
re.S	使.匹配包括换行在内的全部字符
re.U	根据Unicode字符集解析字符。这个标志影响\w、\W、 \b和\B
re.X	该标志经过给予你更灵活的格式以便你将正则表达式写得更易于理解

3.经常使用正则函数

match()方法会尝试从字符串的起始位置匹配正则表达式，match()方法中，第一个参数传入了正则表达式，第二个参数传入了要匹配的字符串。group()方法能够输出匹配到的内容；span()方法能够输出匹配的范围。
search()方法在匹配时会扫描整个字符串，而后返回第一个成功匹配的结果。
findall()方法会搜索整个字符串，而后返回匹配正则表达式的全部内容。
sub()方法可将一串文本中的全部数字都去掉。
compile()方法将正则字符串编译成正则表达式对象，以便在后面的匹配中复用。
split()方法将字符串用给定的正则表达式匹配的字符串进行分割，分割后返回结果list。

4、解析库-XPath

XPath，全称XML Path Language，即XML路径语言，它是一门在XML文档中查找信息的语言。

使用XPath来对网页进行解析，首先导入lxml库的etree模块，而后声明了一段HTML文本，调用HTML类进行初始化，这样就成功构造了一个XPath解析对象。etree模块能够自动修正HTML文本。


html = etree.HTML(text)
result = etree.tostring(html)
print(result.decode('utf-8'))

# 利用XPath规则提取信息
html = etree.parse('./test.html', etree.HTMLParser()) 
result = html.xpath(’//*’) 
print(result)

# 属性多值匹配,采用contains()函数
html = etree.HTML(text) 
result = html. xpath (’//li[contains(@class,”li”)]/a/text()’) 
print(result)

# 多属性匹配，借助and运算符实现
html = etree.HTML(text) 
result = html. xpath(' //li[contains(@class,”li") and @name＝item”］／a/text()' )
print(result)

# 按序选择节点，借助中括号传入索引的方法获取特定次序的节点
html = etree.HTML(text) 
result = html. xpath (’//li[l]/a/text()’) 
print(result) 
result = html.xpath(’I /li[last()] /a/text()’) 
print(result) 
result = html.xpath(’I !li [position() <3] I a/text()’) 
print (resl肚）
result = html. xpath (’I /li [last ()-2] /a/text()’) 
print(result)

# 节点轴选择，未完待续

1.XPath经常使用规则

表达式	描述
nodename	选取此节点的全部子节点
/	从当前节点选取直接子节点
//	从当前节点选取子孙节点
.	选取当前节点
..	选取当前节点的父节点
@	选取属性

2.XPath基本用法

text = '''
<div>
    <ul>
         <li class="item-0"><a href="link1.html">first item</a></li>
         <li class="item-1"><a href="link2.html">second item</a></li>
         <li class="item-inactive"><a href="link3.html">third item</a></li>
         <li class="item-1"><a href="link4.html">fourth item</a></li>
         <li class="item-0"><a href="link5.html">fifth item</a>
     </ul>
 </div>
'''

- 全部节点、子节点、父节点

from lxml import etree
html = etree.parse('./test.html',etree.HTMLParser())

# 选取全部节点
result = html.xpath('//*')
print(result)

result = html.xpath('//li')
print(result)
print(result[0])

# 选取子节点
result = html.xpath('//li/a')
print(result)

# 选取父节点
result = html.xpath('//a[@href='link4.html']/../@class')
print(result)

- 选取属性

for lxml import etree

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class='item-0']')
print(result)

- 选取文本

for lxml import etree

html = etree.parse('./test.html',etree.HTMLParser())
result = html.xpath('//li[@class='item-0']/text()')
print(result)

# 属性多值匹配
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,'li')]/a/text()')
print(result)

# 多属性匹配
html = etree.HTML(text)
result = html.xpath('//li[contains(@class,li) and @name='item']/a/text()')
print(result)

5、解析库-Beautiful Soup

Beautiful Soup就是Python的一个HTML或XML的解析库，能够用它来方便地从网页中提取数据。Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为UTF-8编码。

1.Beautiful Soup基本用法

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

- 标签选择器·选择元素

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title)
print(soup.head)
print(soup.p)    # 若是有多个p标签，只输出第一个

- 标签选择器·获取名称

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.title.name)

- 标签选择器·获取属性

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])

- 子节点和子孙节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
print(soup.p.contents)        # 获取子节点

print(soup.p.children)        # 获取子节点
for i,child in enumerate(soup.p.children):
    print(i,child)            
    
print(soup.p.descendants)     # 获取子孙节点
for i,child in enumerate(soup.p.descendants):
    print(i,child)

- 父节点、祖先节点、兄弟节点

from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')

print(soup.a.parent)    　　　　　　　　　　　　　　　　 # 获取父节点
print(list(enumerate(soup.a.parents)))    　　　　　　# 获取祖先节点

print(list(enumerate(soup.a.next_siblings)))        # 获取下一兄弟节点
print(list(enumerate(soup.a.previous_siblings)))    # 获取上一个兄弟节点

2.Beautiful Soup支持的解析器

解析器	使用方法	优点	劣势
Python标准库	BeautifulSoup(markup, "html.parser")	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2以前的版本文档容错能力差
lxml HTML解析器	BeautifulSoup(markup, "lxml")	速度快、文档容错能力强	须要安装C语言库
lxml XML解析器	BeautifulSoup(markup, "xml")	速度快、惟一支持XML的解析器	须要安装C语言库
html5lib	BeautifulSoup(markup, "html5lib")	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

3.方法选择器

find_all()根据标签名、属性、内容查找文档
find_all(narne,attrs,recursive,text,**kwargs)

# 标签名查询
print(soup.findall(name=’ul'))
print(type(soup.find_all(name=’ul’)[0]))

# 属性查询
print(soup.干ind_all(attrs＝｛’id＇：’list-1'｝））
print(soup.于ind_all(attrs＝｛’name＇：’elements’｝））

# 文本查询
print(soup.find_all(text=re.compile(’link')))

find_all()　　　　　        # 返回全部元素
find()　　　　　　　        # 返回单个元素
                            
find_parents()　　          # 返回全部祖先节点
find_parent()　　           # 返回直接父节点
                            
find_next_siblings()　　    # 返回后面全部的兄弟节点
find_next_sibling()　　     # 返回后面第一个兄弟节点
                            
find_previous_siblings()    # 返回前面全部兄弟节点
find_previous_sibling()     # 返回前面第一个兄弟节点
                            
find_all_next()             # 返回节点后全部符合条件的节点
find_next()                 # 返回第一个符合条件的节点
                            
find_all_previous()         # 返回节点后全部符合条件的节点
find_previous()             # 返回第一个符合条件的节点

4.CSS选择器

经过select()直接传入CSS选择器便可完成选择

html= '''
<div class='panel'>
    <div class='panel-heading'>
        <h4>Hello</h4>
    </div>    
    <div class='panel-body'>
        <ul class='list' id='list-1'>
            <li class='element'>Foo</li>
            <li class='element'>Bar>
            <li class='element'>Jay</li>
        </ul>
        <ul class='list list-small' id='list-2'>
            <li class='element'>Foo</li>
            <li class='element'>Bar</li>
        </ul>
    </div>
</div>
'''

- 选择标签

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
print(soup.select('.panel.panel-heading'))    
print(soup.select('ul li'))
print(soup.select('#list-2.element'))

- 选择属性

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('ul'):
    print(ul['id'])
    print(ul.attrs['id'])

- 选择文本

from  bs4 import  BeautifulSoup 
soup = BeautifulSoup(html, ’lxml' ) 
for ul in soup.select('li'):
    print(ul.get_text())

6、解析库-Pyquery

html = '''
<div> 
　　<ul> 
　　　　<li class="item-0">first item<lli> 
　　　　<li class="item-1"><a href="link2.html"＞second item</a><lli> 
　　　　<li class="item-0 active"><a href="link3.html"><span class="bold"＞third item</span></a></li> 
　　　　<li class ="item-1 active"><a href="link4 . html">fourth item</a></li> 
　　　　<li class="item-0"＞＜a href="link5.html">fifth item</a></li> 
　　</ul> 
</div>
'''

1.初始化

# 字符串初始化
from pyquery import PyQuery as pq
doc = pd(html)
print(doc('li'))

# URL初始化
from pyquery import PyQuery as pq 
doc = pq(url=' https://cuiqingcai.com’) 
print(doc(’title'))

# 文件初始化
from  pyquery import  PyQuery as pq 
doc = pq(filename=’demo.html’) 
print(doc(’li’))

2.CSS选择器

- 获取标签

from pyquery import PyQuery as pq
doc = pd(html)

# 子元素
items = doc('.list')
lis = items.find('li')

lis = items.children()
lis = items.children('.active')
print(lis)

# 父元素
items = doc('.list')
container =items.parents()
print(container)

parent = items.parents('.wrap')
print(parent)

# 兄弟元素
li = doc('.list.item-0.active')
print(li.siblings())
print(li.siblings('.active'))

- 获取属性

from pyquery import PyQuery as pq
doc = pd(html)
a = doc('.item-0.active a')
print(a)
print(a.attr.href)
print(a.attr('href')

- 获取内容

from pyquery import PyQuery as pq
doc = pd(html)
a = doc('.item-0.active a')
print(a)
print(a.text())

- 获取HTML

from pyquery import PyQuery as pq
doc = pd(html)
li = doc('.item-0.active')
print(li)
print(li.html())