Python使用内置urllib模块或第三方库requests访问网络资源

时间 2019-11-16

标签 python 使用内置 urllib 模块第三方 requests 访问网络资源栏目 Python 繁體版

原文原文链接

前言

更多内容，请访问个人我的博客。html

Python 访问网络资源有不少方法，urllib, urllib2, urllib3, httplib, httplib2, requests ，现介绍以下两种方法：python

内置的 urllib 模块
- 优势：自带模块，无需额外下载第三方库
- 缺点：操做繁琐，缺乏高级功能
第三方库 requests
- 优势：处理URL资源特别方便
- 缺点：须要下载安装第三方库

内置的 `urllib` 模块

发起GET请求

主要使用urlopen()方法来发起请求，以下：编程

from urllib import request

resp = request.urlopen('http://www.baidu.com')
print(resp.read().decode())
复制代码

访问的结果会是一 个http.client.HTTPResponse 对象，使用此对象的 read() 方法，则能够获取访问网页得到的数据。可是要注意的是，得到的数据会是 bytes 的二进制格式，因此须要 decode() 一下，转换成字符串格式。json

发起POST请求

urlopen() 默认的访问方式是GET，当在 urlopen() 方法中传入data参数时，则会发起POST请求。注意：传递的data数据须要为bytes格式。api

设置timeout参数还能够设置超时时间，若是请求时间超出，那么就会抛出异常。以下：bash

from urllib import request

resp = request.urlopen('http://www.baidu.com', data=b'word=hello', timeout=10)
print(resp.read().decode())
复制代码

添加Headers

经过 urllib 发起的请求会有默认的一个Headers："User-Agent":"Python-urllib/3.6"，指明请求是由 urllib 发送的。因此遇到一些验证User-Agent的网站时，咱们须要自定义Headers，而这须要借助于urllib.request中的 Request 对象。cookie

from urllib import request

url = 'http://httpbin.org/get'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}

# 须要使用url和headers生成一个Request对象，而后将其传入urlopen方法中
req = request.Request(url, headers=headers)
resp = request.urlopen(req)
print(resp.read().decode())
复制代码

Request对象

如上所示， urlopen() 方法中不止能够传入字符串格式的url，也能够传入一个 Request 对象来扩展功能，Request 对象以下：网络

class urllib.request.Request(url, data=None, headers={},
                                origin_req_host=None,
                                unverifiable=False, 
                                method=None)
复制代码

构造 Request 对象必须传入url参数，data数据和headers都是可选的。app

最后， Request 方法可使用method参数来自由选择请求的方法，如PUT，DELETE等等，默认为GET。post

添加Cookie

为了在请求时能带上Cookie信息，咱们须要从新构造一个opener。

使用request.build_opener方法来进行构造opener，将咱们想要传递的cookie配置到opener中，而后使用这个opener的open方法来发起请求。以下：

from http import cookiejar
from urllib import request

url = 'https://www.baidu.com'
# 建立一个cookiejar对象
cookie = cookiejar.CookieJar()
# 使用HTTPCookieProcessor建立cookie处理器
cookies = request.HTTPCookieProcessor(cookie)
# 并以它为参数建立Opener对象
opener = request.build_opener(cookies)
# 使用这个opener来发起请求
resp = opener.open(url)

# 查看以前的cookie对象，则能够看到访问百度得到的cookie
for i in cookie:
    print(i)
复制代码

或者也能够把这个生成的opener使用install_opener方法来设置为全局的。

则以后使用urlopen方法发起请求时，都会带上这个cookie。

# 将这个opener设置为全局的opener
request.install_opener(opener)
resp = request.urlopen(url)
复制代码

设置Proxy代理

使用爬虫来爬取数据的时候，经常须要使用代理来隐藏咱们的真实IP。以下：

from urllib import request

url = 'http://www.baidu.com'
proxy = {'http':'222.222.222.222:80','https':'222.222.222.222:80'}
# 建立代理处理器
proxies = request.ProxyHandler(proxy)
# 建立opener对象
opener = request.build_opener(proxies)

resp = opener.open(url)
print(resp.read().decode())
复制代码

下载数据到本地

在咱们进行网络请求时经常须要保存图片或音频等数据到本地，一种方法是使用python的文件操做，将read()获取的数据保存到文件中。

而urllib提供了一个urlretrieve()方法，能够简单的直接将请求获取的数据保存成文件。以下：

from urllib import request

url = 'http://python.org/'
request.urlretrieve(url, 'python.html')
复制代码

urlretrieve() 方法传入的第二个参数为文件保存的位置，以及文件名。

注意：urlretrieve() 方法是python2直接移植过来的方法，之后有可能在某个版本中弃用。

第三方库 `requests`

安装

因为 requests是第三方库，因此要先安装，以下：

pip install requests
复制代码

发起GET请求

直接用 get 方法，以下：

import requests

r = requests.get('http://www.baidu.com/')
print(r.status_code)    #状态
print(r.text)   #内容
复制代码

对于带参数的URL，传入一个dict做为params参数，以下：

import requests

r = requests.get('http://www.baidu.com/', params={'q': 'python', 'cat': '1001'})
print(r.url)    #实际请求的URL
print(r.text)
复制代码

requests的方便之处还在于，对于特定类型的响应，例如JSON，能够直接获取，以下：

r = requests.get('https://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20weather.forecast%20where%20woeid%20%3D%202151330&format=json')
r.json()

# {'query': {'count': 1, 'created': '2017-11-17T07:14:12Z', ...
复制代码

添加Headers

须要传入HTTP Header时，咱们传入一个dict做为headers参数，以下：

r = requests.get('https://www.baidu.com/', headers={'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'})
复制代码

获取响应头，以下：

r.headers
# {Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Content-Encoding': 'gzip', ...}

r.headers['Content-Type']
# 'text/html; charset=utf-8'
复制代码

发起POST请求

要发送POST请求，只须要把get()方法变成post()，而后传入data参数做为POST请求的数据，以下：

r = requests.post('https://accounts.baidu.com/login', data={'form_email': 'abc@example.com', 'form_password': '123456'})
复制代码

requests默认使用application/x-www-form-urlencoded对POST数据编码。若是要传递JSON数据，能够直接传入json参数，以下：

params = {'key': 'value'}
r = requests.post(url, json=params) #内部自动序列化为JSON
复制代码

上传文件

上传文件须要更复杂的编码格式，可是requests把它简化成files参数，以下：

upload_files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=upload_files)
复制代码

在读取文件时，注意务必使用 'rb' 即二进制模式读取，这样获取的 bytes 长度才是文件的长度。

把 post() 方法替换为 put() ， delete() 等，就能够以PUT或DELETE方式请求资源。

添加Cookie

在请求中传入Cookie，只需准备一个dict传入cookies参数，以下：

cs = {'token': '12345', 'status': 'working'}
r = requests.get(url, cookies=cs)
复制代码

requests对Cookie作了特殊处理，使得咱们没必要解析Cookie就能够轻松获取指定的Cookie，以下：

r.cookies['token']
# 12345
复制代码

指定超时

要指定超时，传入以秒为单位的timeout参数。超时分为链接超时和读取超时，以下：

try:
    # 3.1秒后链接超时，27秒后读取超时
    r = requests.get(url, timeout=(3.1, 27))
except requests.exceptions.RequestException as e:
    print(e)
复制代码

超时重连

def gethtml(url):
    i = 0
    while i < 3:
        try:
            html = requests.get(url, timeout=5).text
            return html
        except requests.exceptions.RequestException:
            i += 1
复制代码

添加代理

同添加headers方法，代理参数也要是一个dict，以下：

heads = {
    'User-Agent': 'Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit'
}
proxy = {
    'http': 'http://120.25.253.234:812',
    'https' 'https://163.125.222.244:8123'
}
r = requests.get('https://www.baidu.com/', headers=heads, proxies=proxy)
复制代码

更多编程教学请关注公众号：潘高陪你学编程

Python使用内置urllib模块或第三方库requests访问网络资源

前言

内置的 urllib 模块

发起GET请求

发起POST请求

添加Headers

Request对象

添加Cookie

设置Proxy代理

下载数据到本地

第三方库 requests

安装

发起GET请求

添加Headers

发起POST请求

上传文件

添加Cookie

指定超时

超时重连

添加代理

内置的 `urllib` 模块

第三方库 `requests`