urllib2 is a Python module for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of the urlopen function. This is capable of fetching URLs using a variety of different protocols. It also offers a slightly more complex interface for handling common situations - like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers.
css
urllib2 是python中的一个来处理URLs(统一资源定位器)的模块。它以urlopen()函数的方式,提供很是简单的接口。它可使用多种不一样的协议来打开网页。它也提供稍微复杂的接口来处理更通常的情形:例如基本的身份验证,Cookies,代理等等。这些由类提供的(函数)也叫作句柄和Openers.
html
urllib2 supports fetching URLs for many "URL schemes" (identified by the string before the ":" in URL - for example "ftp" is the URL scheme of "ftp://python.org/") using their associated network protocols (e.g. FTP, HTTP). This tutorial focuses on the most common case, HTTP.
python
urllib2 支持多种方案来获取网页(经过网址字符串以前的“:”--例如FTP,HTTP)。此教程重点关注最经常使用的情形: http。 web
For straightforward situations urlopen is very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP is RFC 2616. This is a technical document and not intended to be easy to read. This HOWTO aims to illustrate using urllib2, with enough detail about HTTP to help you through. It is not intended to replace the urllib2 docs , but is supplementary to them. 浏览器
urlopen在一般状况下很好使用。可是当你打开网页遇到错误或者异常时,你须要了解一些超文本传输协议。最全面和权威的文档固然是参考HTTP的 RFC 2616,可是这个技术文档却并不容易阅读。这个指南就是经过详尽的HTTP细节,来讲明怎样使用urllib2。这个指南仅仅是对文档urllib2 docs的补充,而不是试图取代它们。 服务器
The simplest way to use urllib2 is as follows :
cookie
最简单的使用urllib2的方式以下所示:
网络
Many uses of urllib2 will be that simple (note that instead of an 'http:' URL we could have used an URL starting with 'ftp:', 'file:', etc.). However, it's the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP. app
不少urllib2的情形都是如此简单的(固然你也能够打开这样的网址'ftp://***.***.***.***'),然而咱们本教程的目的为了解释更复杂的情形:HTTP。
socket
HTTP is based on requests and responses - the client makes requests and servers send responses. urllib2 mirrors this with aRequestobject which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Callingurlopenwith this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call .read() on the response :
HTTP是基于请求和响应的:客户端发出请求,服务器做出答复。urllib2利用 Request 类来描述这个行为,表明你做出的HTTP请求。最简单的建立Request类的方法就是指定你要打开的URL。利用函数urlopen打开Request类,返回一个response类。这个答复是一个像文件的类,你可使用.read()函数来查看答复的内容。
Note that urllib2 makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so :
注意:urllib2使用一样的请求借口来处理URL方案。例如,你能够建立一个FTP请求:
In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information ("metadata") about the data or the about request itself, to the server - this information is sent as HTTP "headers". Let's look at each of these in turn.
在HTTP情形下,Request类有两件额外的事让你去作:第一,你能够将数据发送到服务器。第二,你能够发送关于数据自己,或者关于请求本身的额外信息(元数据)给服务器。这些信息一般用Http“headers”形式传递。让咱们依次看几个例子。
数据
Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script [1] or other web application). With HTTP, this is often done using what's known as a POST request. This is often what your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as thedataargument. The encoding is done using a function from theurlliblibrary not fromurllib2.
有时候你想给某个URL传递数据(这里的URL一般会涉及到CGI(通用网关界面)脚本或者其余web应用程序)。结合HTTP,这一般使用POST请求。这一般是当你提交一个HTML表格时,你的浏览器所做的事情。并不是全部的POSTs都得来自表格。你可使用POST方法传递任意数据到你的应用程序。在一般的HTML表单上,这些要传递的数据须要惊醒标准的编码,而后传递到Request对象的data参数。用urllib库,而不是urllib2库中的函数来进行这种编码。
Note that other encodings are sometimes required (e.g. for file upload from HTML forms - see HTML Specification, Form Submission for more details).
注意:有时候须要其余编码形式(例如,从HTML表格中上传文件,请参考HTML Specification, Form Submission)
If you do not pass thedataargument, urllib2 uses a GET request. One way in which GET and POST requests differ is that POST requests often have "side-effects": they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended to always cause side-effects, and GET requests never to cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself.
若是你不想以data参数的形式传递数据,urllib2可使用Get请求。GET和POST请求的一个不一样之处在于:POST请求常常有反作用:他们会改变系统的状态(例如,可能会把一听垃圾放在你门口)。虽然HTTP标准清楚地告诉咱们:POST总会引发反作用,GET方法从不引发反作用,可是,GET也会有反作用,POST方法也许没有反作用。数据也能够经过GET请求将数据直接镶嵌在URL中。
This is done as follows.
>>> import urllib2 >>> import urllib >>> data = {} >>> data['name'] = 'Somebody Here' >>> data['location'] = 'Northampton' >>> data['language'] = 'Python' >>> url_values = urllib.urlencode(data) >>> print url_values name=Somebody+Here&language=Python&location=Northampton >>> url = 'http://www.example.com/example.cgi' >>> full_url = url + '?' + url_values >>> data = urllib2.open(full_url)
Notice that the full URL is created by adding a?to the URL, followed by the encoded values.
注意到完整的URL是由网址+‘?’还有编码后的数据组成的。
We'll discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request.
Some websites [2] dislike being browsed by programs, or send different versions to different browsers [3] . By default urllib2 identifies itself asPython-urllib/x.y(wherexandyare the major and minor version numbers of the Python release, e.g.Python-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through theUser-Agentheader [4]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [5].
咱们在这里讨论一个特定的HTTP标题,来讲明如何向你的HTTP请求添加标题。有些网站不喜欢正在浏览的节目,或者给不一样的浏览器发送不一样版本。默认状况下urllib2识别本身为Python-urllib/x.y(其中x和y分别是主要的和次要的python版本号。例如,Python-urllib/2.5),这样会混淆一些网站,或者不能工做。浏览器经过user-Agent标题来确认本身。当你建立一个Request类时候,你传递包含标题的字典型。下面的例子向上面同样作了一样的请求,可是他将本身做为IE浏览器。
The response also has two useful methods. See the section on info and geturl which comes after we have a look at what happens when things go wrong.
答复已经有了两个有用的方法(POST,GET)。在看 info and geturl 以前,咱们看看若是程序出错会发生什么事情。
HTTPErroris the subclass ofURLErrorraised in the specific case of HTTP URLs.
urlopen会引起一个URLError异常,当它不能处理答复(尽管像Python的APIs,内建的异常如ValueError,TypeError等也可能一块儿异常)时。HTTPError是URLError的一个子类,当具体的HTTP网址是会引起这个异常。
Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn't exist. In this case, the exception raised will have a 'reason' attribute, which is a tuple containing an error code and a text error message.
一般,URLError被引起是由于没有网络链接(没有这个服务器),或者目标服务器不存在。在这种状况下,异常被引起会有一个‘reason’属性,这个属性是个元组类型,包含一个错误代码和一个文本错误信息。e.g.
>>> req = urllib2.Request('http://www.pretend_server.org')
>>> try: urllib2.urlopen(req)
>>> except URLError, e:
>>> print e.reason >>>
(4, 'getaddrinfo failed')
Every HTTP response from the server contains a numeric "status code". Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a "redirection" that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can't handle, urlopen will raise anHTTPError. Typical errors include '404' (page not found), '403' (request forbidden), and '401' (authentication required).
See section 10 of RFC 2616 for a reference on all the HTTP error codes.TheHTTPErrorinstance raised will have an integer 'code' attribute, which corresponds to the error sent by the server.
每一个来自服务器的HTTP response响应都包含一个数字状态码。有时候这个状态码代表服务器不能履行你的请求。默认处理程序会给你一些错误的信息(如,若是请求是'redirection',它从不一样的网址得到文件,urllib2会为你处理这些),对一些不能处理的,urlopen会引起一个HTTPError。典型的异常包括‘404‘(找不到网页),’403‘(请求被禁止),’401‘(须要验证)。请参考10条和RFC2616中的HTTP错误代码。HTTPError有一个代码属性。他对应服务器发出的错误。
Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range.
BaseHTTPServer.BaseHTTPRequestHandler.responsesis a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience :
由于默认的处理是从新定向(代码在300范围内)。代码在100-299代表成功。一般你看到的代码错误在400-599之间。BaseHTTPServer.BaseHTTPRequestHandler.response 是一个有用的代码字典。它RFC2616中使用的响应代码。以下所示:
When an error is raised the server responds by returning an HTTP error code and an error page. You can use theHTTPErrorinstance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods.
当一个异常被引起,服务器经过返回一个HTTP错误代码和一个错误网页。你可使用HTTPError实例打开。这意味着你可使用code属性如 read,geturl,info,methods方法。
>>> req = urllib2.Request('http://www.python.org/fish.html') >>> try: >>> urllib2.urlopen(req) >>> except URLError, e: >>> print e.code >>> print e.read() >>> 404 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <?xml-stylesheet href="./css/ht2html.css" type="text/css"?> <html><head><title>Error 404: File Not Found</title> ...... etc...
So if you want to be prepared forHTTPErroror URLErrorthere are two basic approaches. I prefer the second approach.
若是你想编写HTTPError和URLError,这有两种方法。我更愿意使用第二个方法。
Note
Theexcept HTTPErrormust come first, otherwiseexcept URLErrorwill also catch anHTTPError.
Note
URLErroris a subclass of the built-in exceptionIOError.
This means that you can avoid importingURLErrorand use :
from urllib2 import Request , urlopen
req = Request ( someurl )
try :
response = urlopen ( req )
except IOError , e :
if hasattr ( e , 'reason' ) :
print 'We failed to reach a server.'
print 'Reason: ' , e . reason
elif hasattr ( e , 'code' ) :
print 'The server couldn\'t fulfill the request.'
print 'Error code: ' , e . code
else :
# everything is fine
Under rare circumstancesurllib2can raisesocket.error.
The response returned by urlopen (or theHTTPErrorinstance) has two useful methodsinfoandgeturl.
geturl - this returns the real URL of the page fetched. This is useful becauseurlopen(or the opener object used) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested.
info - this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently anhttplib.HTTPMessageinstance.
Typical headers include 'Content-length', 'Content-type', and so on. See the Quick Reference to HTTP Headers for a useful listing of HTTP headers with brief explanations of their meaning and use.
urlopen返回的结果有两个好的方法:geturl:返回获得的网页的真是的网址。这个颇有用。由于urlopen可能跟着一个从新定向。URL的网址也许不是你发出请求的那个URL。info:返回一个字典类型的数据。包括描述的网页,特别是服务器返回的标题。它目前是httplib.HTTPMessage的一个实例。典型的标题包括'Content-length', 'Content-type', 等等。请参考 Quick Reference to HTTP Headers里面有一个有用的标题列表和简要的介绍和用法。
When you fetch a URL you use an opener (an instance of the perhaps confusingly-namedurllib2.OpenerDirector). Normally we have been using the default opener - viaurlopen- but you can create custom openers. Openers use handlers. All the "heavy lifting" is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies.
当你用一个opener(是urllib2.OpenerDirector的一个实例)打开一个网址,通常说来,咱们一直利用默认的opener-经过urlopen-可是你能够本身建立一个opener. Openers使用句柄。全部繁重的工做都是由handlers来作的。每个Handler知道怎样对某个特定的URL打开网址,或者知道怎样处理URL的某方面。例如,HTTP从新定向或者HTTP cookies。
You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections.
To create an opener, instantiate an OpenerDirector, and then call .add_handler(some_handler_instance) repeatedly.
当你想处理URLs,你就想去创建openers。例如获得opener来处理cookies。或者用opener来处理从新定向。为了创建一个OpenerDirector 的实例opener,接着须要须要函数.add_handler().
Alternatively, you can usebuild_opener, which is a convenience function for creating opener objects with a single function call.build_openeradds several handlers by default, but provides a quick way to add more and/or override the default handlers.
Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations.
你也可使用build_opener,他是一个很方便的函数来建立opener类。它默认状况下增长许多handles,但提供一个快速的增长或者覆盖默认handlers的方法。其余handlers你也许想去处理代理,认证或者其余普通但稍微专业的情形。
install_openercan be used to make anopenerobject the (global) default opener. This means that calls tourlopenwill use the opener you have installed.
Opener objects have anopenmethod, which can be called directly to fetch urls in the same way as theurlopenfunction: there's no need to callinstall_opener, except as a convenience.
install_opener 能够用来建立一个opener类。这意味着urlopen使用你创建的opener。Opener类有一个open方法。他能够用来直接获得urls,像urlopen函数那同样不须要使用install_opener函数。
To illustrate creating and installing a handler we will use theHTTPBasicAuthHandler. For a more detailed discussion of this subject - including an explanation of how Basic Authentication works - see the Basic Authentication Tutorial.
When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a 'realm'. The header looks like :Www-authenticate: SCHEME realm="REALM".e.g.
当建立一个handler时咱们使用HTTPBasicAuthHandler.更多信息请参考权威的 Basic Authentication Tutorial.
当须要认证的时候,服务器发送一个标题(401代码)要求验证。这中须要验证和‘realm‘ 标题看起来想这样:Www-authenticate: SCHEME realm="REALM"例如:
Www-authenticate: Basic realm="cPanel Users"
The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is 'basic authentication'. In order to simplify this process we can create an instance ofHTTPBasicAuthHandlerand an opener to use this handler.
客户端应该试图从新提交请求用合适的名字和密码。这就是基本的认证。为了简化这种国沉给咱们创建一个HTTPBasicAuthHandler的一个实例和opener。
TheHTTPBasicAuthHandleruses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use aHTTPPasswordMgr. Frequently one doesn't care what the realm is. In that case, it is convenient to useHTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providingNoneas the realm argument to theadd_passwordmethod.
The top-level URL is the first URL that requires authentication. URLs "deeper" than the URL you pass to .add_password() will also match.
HTTPBasicAuthHandler 用一个密码管理者的类来处理我那个这和密码用户名。若是你知道哦阿realm是什么,你可使用HTTPPasswrodMgr. 一般咱们不关心realm是什么。在这种哦功能情形下,咱们用HTTPPasswordMgrWithDefaultRealm是很方便的。这如许你能够具体化用户名和密码。若是你不提供另外的可选方案他会帮你做这些。咱们经过用add_password 中的None。
在顶极URL是第一个URL须要认证。URL比.addpassword()更deeper.
Note
In the above example we only supplied ourHHTPBasicAuthHandlertobuild_opener. By default openers have the handlers for normal situations -ProxyHandler,UnknownHandler,HTTPHandler,HTTPDefaultErrorHandler,HTTPRedirectHandler,FTPHandler,FileHandler,HTTPErrorProcessor.
top_level_url is in fact either a full URL (including the 'http:' scheme component and the hostname and optionally the port number) e.g. "http://example.com/" or an "authority" (i.e. the hostname, optionally including the port number) e.g. "example.com" or "example.com:8080" (the latter example includes a port number). The authority, if present, must NOT contain the "userinfo" component - for example "joe@password :example.com" is not correct.
urllib2 will auto-detect your proxy settings and use those. This is through theProxyHandlerwhich is part of the normal handler chain. Normally that's a good thing, but there are occasions when it may not be helpful [6]. One way to do this is to setup our ownProxyHandler, with no proxies defined. This is done using similar steps to setting up a Basic Authentication handler :
urllib2自动检测的代理设置并使用他们。这是经过正常处理链下的ProxyHandler实现的。通常来讲它是个好东西可是有时候,它并非很管用。一种方式就是本身设定咱们的ProxyHandler,没有代理人的定义.用相似的步骤也能够设定 Basic Authentication :
>>> proxy_support = urllib2.ProxyHandler({}) >>> opener = urllib2.build_opener(proxy_support) >>> urllib2.install_opener(opener)
Note
Currentlyurllib2does not support fetching ofhttpslocations through a proxy. This can be a problem.
The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library.
As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module has no timeout and can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using :
python支持从网络层面得到资源。urllib2使用httplib库中的socket库。在python2.3中你能够指定多久算超时。当你想获得网页是颇有用。默认状况下socket模块没有timeout 能够挂起。目前,socket 中的timeout只在httplib和urllib2层面上。然而,你能够设定全局的timout值。