Python爬虫之 urllib库

时间 2021-08-13

标签 html python 浏览器服务器 socket ide 函数 post 网站 url 栏目 Python 繁體版

原文原文链接

　　一、urllib库介绍html

　　 urllib库是Python内置的请求库，可以实现简单的页面爬取功能。值得注意的是，在Python2中，有urllib和urllib2两个库来实现请求的发送。但在Python3中，就只有urllib库了。因为如今广泛流行只用Python3了，因此了解urllib库就好了。查看Python源文件知道urllib库包括5个模块，分别是：request、error、parse、robotparser、response。但我翻阅了一些资料后，发现robotparser和response不多说起，故我只对其余三个模块有所了解。python

　　二、request模块浏览器

　　顾名思义，request就是用来发送请求的，咱们能够经过设置参数来模拟浏览器发送请求。值得注意的是，此处request是一个urllib的一个子模块与另一个请求库request要区分。原本在写这篇博客以前想仔细看看request模块的源码，打开发现有2700+行代码，遂放弃。服务器

　　 request模块中主要是用urlopen()和Request()来发送请求和一些Handler处理器。下面用代码演示，具体用法在代码注释中。socket

　　urlopen()方法演示：ide

　　from urllib import request函数

　　from urllib import parsepost

　　from urllib import error网站

　　import socketurl

　　if __name__ == '__main__':

　　'''

　　def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,

　　*, cafile=None, capath=None, cadefault=False, context=None):

　　参数分析：

　　url:请求路径

　　data:可选;若是要添加这个参数，须要将字典格式的数据转化为字节流数据，而且请求方式从get变为post

　　timeout:可选;超时时间，若是访问超时了变会抛出一个异常

　　其余三个参数是用来设置证书和SSL的，默认设置便可

　　'''

　　# 一次简单的请求了

　　response_1 = request.urlopen(url="http://www.baidu.com") # 返回一个HttpResponse对象

　　print(response_1.read().decode("utf-8")) #这样就完成了一次简单的请求了

　　print("状态码:" , response_1.status)

　　print("请求头:" , response_1.getheaders())

　　print("----------------------------------华丽分割线-----------------------------------------------")

　　# 一次复杂的请求

　　dict = {"name" : "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　try:

　　response_2 = request.urlopen(url="http://www.httpbin.org/post",data=data,timeout=10)

　　except error.URLError as e:

　　if isinstance(e.reason,socket.timeout):

　　print("请求超时了")

　　print(response_2.read().decode('utf-8'))

　　使用Request构造请求体

　　from urllib import request,parse

　　if __name__ == '__main__':

　　"""

　　Request是一个类，经过初始化函数对其进行赋值，其做用是构造一个更强大的请求体

　　def __init__(self, url,

　　data=None, headers={},

　　origin_req_host=None,

　　unverifiable=False,

　　method=None):

　　url:请求路径

　　data:可选;若是要添加这个参数，须要将字典格式的数据转化为字节流数据

　　headers:可选;参数类型是一个字典。咱们能够修改User-Agent来假装成浏览器，能够防止反爬虫

　　origin_req_host:可选;设置主机IP

　　unverifiable:可选;表示请求是不是没法验证的

　　method:可选;指示请求方式是GET,POST,PUT

　　"""

　　dict = {"name": "Tom"}

　　data = bytes(parse.urlencode(dict),encoding="utf-8")

　　headers = {郑州妇科在线医生 http://www.zzkdfk120.com/

　　"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36"

　　} #假装成corome浏览器

　　req = request.Request(url="http://www.httpbin.org/post",data=data,headers=headers,method="POST")

　　response = request.urlopen(req)

　　print(response.read().decode("utf-8"))

　　三、error模块

　　 error模块有两个子类：URLError和HTTPError

　　from urllib import request,error

　　if __name__ == '__main__':

　　try:

　　# 尝试打开一个不存在的网站

　　response_1 = request.urlopen(

　　except error.URLError as e:

　　print(e.reason)

　　try:

　　# 请求出现错误

　　response_2 = request.urlopen("http://www.baidu.com/aaa.html")

　　except error.HTTPError as e:

　　print(e.reason)

　　#如果报400，则表示网页不存在;如果报500，则表示服务器异常

　　print(e.code)

　　print(e.headers)

　　四、parse模块

　　urlparse()：对url字符串进行解析

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#锚点"

　　result = parse.urlparse(url=url)

　　print(result)

　　# 输出结果：

　　ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='param1', query='ie=UTF-8&wd=python', fragment='锚点')

　　urlunparse()： urlparse()的逆过程，传入一个长度为6的列表便可，列表的参数顺序与urlparse()的结果一致。

　　urlsplit()与urlunsplit() :与上述两个方法基本一致，只是将path和params的结果放在一块儿了

　　from urllib import parse

　　if __name__ == '__main__':

　　url = "https://www.baidu.com/s;param1?ie=UTF-8&wd=python#锚点"

　　result = parse.urlsplit(url=url)

　　print(result)

　　# 输出结果：

　　SplitResult(scheme='https', netloc='www.baidu.com', path='/s;param1', query='ie=UTF-8&wd=python', fragment='锚点')

　　其它的方法也是差很少的做用，都是对url进行解析的。