python之HTTP处理模块urllib和urllib2

时间 2020-01-28

标签 python http 处理模块 urllib urllib2 栏目 Python 繁體版

原文原文链接

python2主要涉及两个模块来处理HTTP请求：urllib和urllib2html

urllib模块：python

urllib.urlopen(url[,data[,proxies]]) 打开一个url的方法，返回一个文件对象，而后能够进行相似文件对象的操做 web

urlopen返回对象提供方法：浏览器

read() , readline() ,readlines() , fileno() , close() ：这些方法的使用方式与文件对象彻底同样服务器

info()：返回一个httplib.HTTPMessage对象，表示远程服务器返回的头信息网络

getcode()：返回Http状态码。若是是http请求，200请求成功完成;404网址未找到app

geturl()：返回请求的urlsocket

urllib.urlencode() 将URL中的键值对以链接符&划分,暂时不支持urldecode();注意：urlencode的参数必须是Dictionaryide

如：urllib.urlencode({'spam':1,'eggs':2,'bacon':0})函数

结果为：eggs=2&bacon=0&spam=1

urllib.quote(url)和urllib.quote_plus(url) 将url数据获取以后，并将其编码，从而适用与URL字符串中，使其能被打印和被web服务器接受

如：

print urllib.quote('http://www.baidu.com')

print urllib.quote_plus('http://www.baidu.com')

结果分别为：

http%3A//www.baidu.com

http%3A%2F%2Fwww.baidu.com

urllib.unquote(url)和urllib.unquote_plus(url) 与上面正好相反

urllib2模块：

直接请求一个url地址：

urllib2.urlopen(url, data=None) 经过向指定的URL发出请求来获取数据

构造一个request对象信息，而后发送请求：

urllib2.Request(url,data=None,header={},origin_req_host=None) 功能是构造一个请求信息，返回的req就是一个构造好的请求

urllib2.urlopen(url, data=None) 功能是发送刚刚构造好的请求req，并返回一个文件类的对象response，包括了全部的返回信息

response.read() 能够读取到response里面的html

response.info() 能够读到一些额外的响应头信息

主要区别：

urllib2能够接受一个Request类的实例来设置URL请求的headers，urllib仅能够接受URL。这意味着，你不能够经过urllib模块假装你的User Agent字符串等（假装浏览器）。
urllib提供urlencode方法用来GET查询字符串的产生，而urllib2没有。这是为什么urllib常和urllib2一块儿使用的缘由。
urllib2模块比较优点的地方是urlliburllib2.urlopen能够接受Request对象做为参数，从而能够控制HTTP Request的header部。
可是urllib.urlretrieve函数以及urllib.quote等一系列quote和unquote功能没有被加入urllib2中，所以有时也须要urllib的辅助

异常处理：

官方内容：

The following exceptions are raised as appropriate:

exception urllib2.URLError
The handlers raise this exception (or derived exceptions) when they run into a problem. It is a subclass of IOError.
- reason
- The reason for this error. It can be a message string or another exception instance (socket.error for remote URLs, OSError for local URLs).

exception urllib2.HTTPError
Though being an exception (a subclass of URLError), an HTTPError can also function as a non-exceptional file-like return value (the same thing that urlopen() returns). This is useful when handling exotic HTTP errors, such as requests for authentication.
- reason
- The reason for this error. It can be a message string or another exception instance.
- code
- An HTTP status code as defined in RFC 2616. This numeric value corresponds to a value found in the dictionary of codes as found in BaseHTTPServer.BaseHTTPRequestHandler.responses.

URLError:

只有一个错误类reason。

URLError在没有网络链接(没有路由到特定服务器),或者服务器不存在的状况下产生。这种状况下，异常一样会带有"reason"属性，它是一个tuple，包含了一个错误号和一个错误信息

HTTPError:

包含两个错误类code与reson。

服务器上每个HTTP 应答对象response包含一个数字"状态码"。有时状态码指出服务器没法完成请求。默认的处理器会为你处理一部分这种应答(例如:假如response是一个"重定向"，须要客户端从别的地址获取文档，urllib2将为你处理)。其余不能处理的，urlopen会产生一个HTTPError。典型的错误包含"404"(页面没法找到)，"403"(请求禁止)，和"401"(带验证请求)

注意：except HTTPError 必须在第一个，不然except URLError将一样接受到HTTPError。

实例：

import urllib

import urllib2

from sys import exit

murl = "http://zhpfbk.blog.51cto.com/"

UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2896.3 Safari/537.36"

### 设置传入的参数，内容为一个dic

value = {'value1':'tkk','value2':'abcd'}

### 对value进行url编码

data = urllib.urlencode(value)

### 设置一个http头，格式为一个dic

header = {'User-Agent':UserAgent}

### 设置一个请求信息对象

req = urllib2.Request(murl,data,header)

print req.get_method()

### 如下内容为发送请求，并处理报错

try:

### 发送请求

resp = urllib2.urlopen(req)

### 获取HTTPError报错，必须设置在URLError以前，包含两个对象，code和reson