urllib库详解

时间 2019-11-20

标签 urllib 详解繁體版

原文原文链接

urllib库是python内置的一个http请求库html

其实urllib库很差用，没有requests库好用，下一篇文章再写requests库，requests库是基于urllib库实现的python

做为最最基本的请求库，了解一下原理和用法仍是颇有必要的chrome

包含的四个模块：

urllib.request　　请求模块(就像在浏览器输入网址，敲回车同样)json

urllib.error　　　异常处理模块(出现请求错误，能够捕捉这些异常)浏览器

urllib.parse　　 url解析模块cookie

urllib.robotparser robots.txt解析模块，判断哪一个网站能够爬，哪一个不能够爬，用的比较少app

在python2与python3中有所不一样

在python2中：dom

import urllib2socket

response = urllib2.urlopen('http://www.baidu.com')ide

在python3中：

import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')

用法讲解：

urlopen函数

urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,*, cafile=None, capath=None, cadefault=False, context=None)

对前三个参数进行讲解：

url参数：

from urllib import request
response = request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

data参数：

没有data参数时，发送的是一个get请求，加上data参数后，请求就变成了post方式

import urllib
from urllib import request
from urllib import parse

data1= bytes(urllib.parse.urlencode({'word':'hello'}),encoding='utf-8')
data2= bytes(str({'word':'hello'}),encoding='utf-8')

response1= urllib.request.urlopen('http://httpbin.org/post',data = data1)
response2= urllib.request.urlopen('http://httpbin.org/post',data = data2)
print(response1.read())
print(response2.read())

b'{"args":{},"data":"","files":{},"form":{"word":"hello"},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"10","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.5"},"json":null,"origin":"113.71.243.133","url":"http://httpbin.org/post"}\n'
b'{"args":{},"data":"","files":{},"form":{"{\'word\': \'hello\'}":""},"headers":{"Accept-Encoding":"identity","Connection":"close","Content-Length":"17","Content-Type":"application/x-www-form-urlencoded","Host":"httpbin.org","User-Agent":"Python-urllib/3.5"},"json":null,"origin":"113.71.243.133","url":"http://httpbin.org/post"}\n'

data参数须要bytes类型，因此须要使用bytes()函数进行编码，而bytes函数的第一个参数须要时str类型，因此使用urllib.parse.urlencode将字典转化为字符串。

提交的url是httpin.org，这个网址用于提供http请求测试，http://httpin.org/post用来测试post请求，能够输出请求和响应信息。

通过测试，经过str({'word':''hello})，将字典转化为字符串也能够

timeout参数：

设置一个超时的时间，若是在这个时间内没有响应，便会抛出异常

import urllib
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://www.baidu.com',timeout=0.001)
    print(response.read())
except :
    print('error')

将超时时间设置为0.001秒，在这个时间内，没有响应，输出error

使用urlopen()发送请求后，会获得一个响应response，response学习一下：

response

响应类型：

import urllib
from urllib import request

response = urllib.request.urlopen('http://www.baidu.com')
print(type(response))
输出为：
<class 'http.client.HTTPResponse'>

状态码与响应头：

import urllib
from urllib import request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.status)
print(response.getheaders())
print(response.getheader('Server'))
#经过getheader('Server')能够得到一个特定的响应头

200
[('Bdpagetype', '1'), ('Bdqid', '0xf6ba47940001da56'), ('Cache-Control', 'private'), ('Content-Type', 'text/html'), ('Cxy_all', 'baidu+a77af89c048e9272d9feda1e4fd31907'), ('Date', 'Sat, 09 Jun 2018 04:04:51 GMT'), ('Expires', 'Sat, 09 Jun 2018 04:03:53 GMT'), ('P3p', 'CP=" OTI DSP COR IVA OUR IND COM "'), ('Server', 'BWS/1.1'), ('Set-Cookie', 'BAIDUID=B6DF381069A580546B5E16B4BE9FF3AE:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BIDUPSID=B6DF381069A580546B5E16B4BE9FF3AE; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'PSTM=1528517091; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com'), ('Set-Cookie', 'BDSVRTM=0; path=/'), ('Set-Cookie', 'BD_HOME=0; path=/'), ('Set-Cookie', 'H_PS_PSSID=26600_1451_21078_18559_26350_26577_20927; path=/; domain=.baidu.com'), ('Vary', 'Accept-Encoding'), ('X-Ua-Compatible', 'IE=Edge,chrome=1'), ('Connection', 'close'), ('Transfer-Encoding', 'chunked')]
BWS/1.1

read方法;

import urllib
import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(type(response.read()))
print(response.read().decode('utf-8'))

response.read()返回的是bytes形式的数据，因此须要用decode('utf-8')进行解码。

urlopen实现了简单的请求，若是咱们须要发送复杂的请求那应该怎么办?在urllib库中就须要使用一个Request对象

Request对象　　

import urllib
from urllib import request

#直接声明一个Request对象，并把url看成参数直接传递进来
request = urllib.request.Request('http://www.baidu.com')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

声明了一个Request对象，把url看成参数传递给这个对象，而后把这个对昂做为urlopen函数的参数

有了这个Request对象就能够实现更复杂的请求，好比加headers等

#利用Request对象实现一个post请求
import urllib
from urllib import request

url = 'http://www.baidu.com'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
}
data = {'word':'hello'}
data = bytes(str(data),encoding='utf-8')
req = urllib.request.Request(url = url,data = data,headers = headers,method = 'POST')
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

上面的这个请求包含了请求方式、url，请求头，请求体，逻辑清晰。

Request对象还有一个add_header方法，这样也能够添加header

-------------------------------------------------------------------------------------------

一些高级请求方式　

好比设置代理、处理cookie等一些操做，须要一些handler来实现这些功能　

handler至关于辅助的工具来帮助咱们处理这些操做

好比ProxyHandler(设置代理的handler)，能够变换本身的IP地址

cookie能够维持登陆状态

-------------------------------------------------------------------------------------------

异常处理urllib.error

能够捕获三种异常：URLError,HTTPError(是URLError类的一个子类)，ContentTooShortError

URLError只有一个reason属性

HTTPError有三个属性：code,reason,headers

import urllib
from urllib import request
from urllib import error

try:
    response = urllib.request.urlopen('http://123.com')
except error.URLError as e:
    print(e.reason)

import urllib
from urllib import request
from urllib import error

try:
    response = urllib.request.urlopen('http://123.com')
except error.HTTPError as e:
    print(e.reason,e.code,e.headers)
except error.URLError as e:
    print(e.reason)
else:
    print('RequestSucess!')

URL解析urllib.parse　　

urlparse函数

该函数是对传入的url进行分割,分割成几部分，并对每部分进行赋值

import urllib
from urllib import parse

result = urllib.parse.urlparse('http://www,baidu.com')
print(type(result))
print(result)
输出结果为：

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www,baidu.com', path='', params='', query='', fragment='')

从输出结果能够看出，这几部分包括：协议类型、域名、路径、参数、query、fragment

urlparse有几个参数：url,scheme,allow_fragments

在使用urlparse时，能够经过参数scheme = 'http'的方式来指定默认的协议类型,若是url有协议类型，scheme参数就不会生效了

import urllib
from urllib import parse

result = urllib.parse.urlparse('https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=python.org&oq=%25E6%2596%25B0%25E6%25B5%25AA&rsv_pq=d28ff08c000024df&rsv_t=d3d8kj5yW7d89rZNhlyrAw%2FRXjh8%2FrDWinUOKVobUbk6BVzP5U8UMplpW1w&rqlang=cn&rsv_enter=1&inputT=1213&rsv_sug3=69&rsv_sug1=63&rsv_sug7=100&bs=%E6%96%B0%E6%B5%AA')
print(type(result))
print(result)
输出结果为：

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='www.baidu.com', path='/s', params='', query='ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=python.org&oq=%25E6%2596%25B0%25E6%25B5%25AA&rsv_pq=d28ff08c000024df&rsv_t=d3d8kj5yW7d89rZNhlyrAw%2FRXjh8%2FrDWinUOKVobUbk6BVzP5U8UMplpW1w&rqlang=cn&rsv_enter=1&inputT=1213&rsv_sug3=69&rsv_sug1=63&rsv_sug7=100&bs=%E6%96%B0%E6%B5%AA', fragment='')

urlunparse函数

与urlparse函数做用相反，是对url进行拼接的　　

urljoin函数

用来拼接url

urlencode函数　　

能够把一个字典转化为get请求参数