python爬虫---从零开始（二）Urllib库

时间 2019-11-24

标签 python 爬虫开始 urllib 栏目 Python 繁體版

原文原文链接

　　接上文再继续咱们的爬虫，此次咱们来述说Urllib库html

1，什么是Urllib库python

　　Urllib库是python内置的HTTP请求库git

　　urllib.request　　请求模块github

　　urllib.error　　　异常处理模块cookie

　　urllib.parse　　 url解析模块post

　　urllib.robotparse robots.txt解析模块测试

　　不须要额外安装，python自带的库。网站

注意：ui

　　python2url

　　import urllib2

　　response = urllib2.urlopen('http://baidu.com')

　　python3

　　import urllib.request

　　response = urilib.request.urlopen('http://www.baidu.com')

　　python2和python3使用urllib库仍是有必定区别的。

2，方法以及模块：

　　1）request

　　基本运行：（get方式的请求）

　　import urllib.request

　　response = urilib.request.urlopen('http://www.baidu.com')

　　print(response.read().decode('utf-8'))

　　运行结果以下：

　　在这里咱们看到，当咱们输入urllib.request.urlopen('http://baidu.com')时，咱们会获得一大长串的文本，也就是咱们将要从这个获得的文本里获得咱们所须要的数据。

　　带有请求参数：（post方式的请求）

　　import urllib.request

　　import urllib.parse

　　data = bytes(urllib.parse.urlencode({'username':'cainiao'}),encoding='utf8')

　　response = urllib.request.urlopen('http://httpbin.org/post',data = data)

　　print(response.read())

　　在这里咱们不难看出，咱们给予的data username参数已经传递过去了。

　　注意data必须为bytes类型

　　设置请求超时时间：

　　import urllib.request

　　response = urllib.request.urlopen('http://httpbin.org/get', timeout = 1)

　　print(response.read())

　　这时咱们看到，执行代码时报出timed out错误。咱们这时可使用urllib.error模块，代码以下

　　import urllib.request

　　ipmort urllib.error

　　try:

　　　　response = urllib.request.urlopen('http://httpbin.org/get', timeout = 0.1)

　　　　print(response.read())

　　except urllib.error.URLError as e:

　　　　print('连接超时啦～！') # 这里咱们没有判断错误类型，能够自行加入错误类型判断，而后在进行输出。

　　说到这，咱们就把最简单，最基础的urlopen的基础全都说完了，有能力的小伙伴，能够进行详细阅读其源码，更深刻的了解该方法。

　　2）响应 response

　　import urllib.request

　　response = urllib.request.urlopen('http://www.baidu.com')

　　print(type(response))

　　# 获得一个类型为<class 'http.client.HTTPResponse'>　

　　import urllib.request

　　response = urllib.request.urlopen('http://www.baidu.com')

　　print(type(response)) # 响应类型

　　print(response.status) #上篇文章提到的状态码

　　print(response.getheaders) # 请求头

　　print(response.getheader('Server')) # 取得请求头参数

　　import urllib.request

　　response = urllib.request.urlopen('http://www.baidu.com')

　　print(response.read().decode('utf-8')) # 响应体，响应内容

　　响应体为字节流形式的内容，咱们须要调用decode(decode('utf-8'))进行转码。

　　经常使用的post请求基本写法

　　from urllib import request,parse

　　url = 'http://httpbin.org/post'

　　headers = {

　　　　'User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',

　　　　'Host':'httpbin.org'

　　}

　　dict = {

　　　　'name':'cxiaocai'

　　}

　　data = bytes(parse.urlencode(dict),encoding='utf8')

　　req = request.Request(url =url , data = data , headers = headers , method = 'POST')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

　　也能够写成这样的

　　from urllib import request,parse

　　url = 'http://httpbin.org/post'

　　dict = {

　　　　'name':'cxiaocai'

　　}

　　data = bytes(parse.urlencode(dict),encoding='utf8')

　　req = request.Request(url =url , data = data , headers = headers , method = 'POST')

　　req.add_header('User-Agent':'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)')

　　response = request.urlopen(req)

　　print(response.read().decode('utf-8'))

　　说到这里，咱们最基本的urllib请求就能够基本完成了，很大一部分网站也能够进行爬取了。

3，代理设置

　　代理设置咱们这里简单的说一下，后面的博客咱们会用实际爬虫来讲明这个。

　　Hander代理

　　import urllib.request

　　proxy_hander = urllib.request.ProxyHeader({

　　　　'http':'http://127.0.0.1:1111',

　　　　'https':'https://127.0.0.1:2222'

　　})

　　opener = urllib.request.build_opener(proxy_hander)

　　response = opener.open('http://www.baidu.com')

　　print(response.read()) # 我这没有代理，没有测试该方法。

　　Cookie设置

　　import http.cookiejar, urllib.request

　　cookie = http.cookiejar.CookieJar()

　　hander = urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(hander)

　　response = opener.open("http://www.baidu.com")

　　for item in cookie:

　　　　print(item.name + "=" + item.value)

　　例如某些网站是须要登录的，全部咱们在这里须要设置Cookie

　　咱们也能够将Cookie保存为文本文件，便于屡次进行读取。

　　import http.cookiejar, urllib.request

　　filename = 'cookie.txt'

　　cookie = http.cookiejar.MozillaCookieJar(filename)

　　hander = urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(hander)

　　response = opener.open("http://www.baidu.com")

　　cookie.save(ignore_discard=True, ignore_expires=True)

　　代码运行之后会在项目目录下生成一个cookie.txt

　　另一种Cookie的保存格式

　　import http.cookiejar, urllib.request

　　filename = 'cookie.txt'

　　cookie = http.cookiejar.LWPCookieJar(filename)

　　hander = urllib.request.HTTPCookieProcessor(cookie)

　　opener = urllib.request.build_opener(hander)

　　response = opener.open("http://www.baidu.com")

　　cookie.save(ignore_discard=True, ignore_expires=True)

　运行代码之后也会生成一个txt文件，格式以下

　
下面咱们来读取咱们过程保存的Cookie文件

import http.cookiejar, urllib.request

cookie = http.cookiejar.LWPCookieJar()

cookie.load('cookie.txt',ignore_expires=True,ignore_discard=True)

hander = urllib.request.HTTPCookieProcessor(cookie)

opener = urllib.request.build_opener(hander)

response = opener.open('http://www.baidu.com')

print(response.read().decode('utf-8'))

4，异常处理
　　简单事例，在这里咱们来访问一个不存在的网站

from urllib import request,error

try:

response = request.urlopen('https://www.cnblogs.com/cxiaocai/articles/index123.html')

except error.URLError as e:

print(e.reason)

　这里咱们知道这个网站根本不存在的，会报错，咱们捕捉该异常能够保证程序继续运行，咱们能够执行重试操做
　咱们也能够查看官网 https://docs.python.org/3/library/urllib.error.html#module-urllib.error

5，URL解析 
　　urlparse模块
　　主要用户解析URL的模块，下面咱们先来一个简单的示例

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1')

print(type(result),result)

这里咱们看下输出结果：

　　该方法能够进行url的拆分
　　也能够制定请求方式http，或者https方式请求

from urllib.parse import urlparse

result = urlparse('www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1',scheme='https')

print(result)

　输出结果以下所示：

　　在这里咱们看到了，请求被制定了https请求
　　咱们会看到输出结果里包含一个fragents，咱们想将framents拼接到query后面，咱们能够这样来作

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#commont',allow_fragments=False)

print(result)

　　输出结果为

　　若是没有frament，则拼接到path内
　　示例：

　　
　　咱们如今知道了URl怎么进行拆分，若是咱们获得了URl的集合，例如这样dada = ['http','www.baidu.com','index.html','user','a=6','comment']
咱们可使用urlunparse


　　还有urljoin，主要是来进行url的拼接的，接下来咱们来看下咱们的示例：

之后面的为基准，若是有就留下，若是没有就从前面取。
　　若是咱们的有了一个字典类型的参数，和一个url，咱们想发起get请求（上一期说过get请求传参），咱们能够这样来作，

在这里咱们须要注意的是，url地址后面须要自行加一个‘？’。
最后还有一个urllib.robotparser，主要用robot.txt文件的官网有一些示例，因为这个不经常使用，在这里我作过多解释。
官网地址：https://docs.python.org/3/library/urllib.robotparser.html#module-urllib.robotparser 感兴趣的小伙伴能够自行阅读官方文档。

到这里咱们就把urllib的基本用法所有说了一遍，能够本身尝试写一些爬虫程序了（先用正则解析，后期咱们有更简单的方法）。
想更深刻的研读urllib库，能够直接登录官方网站直接阅读其源码。官网地址： https://docs.python.org/3/library/urllib.html 注意：不少小伙伴看到个人代码直接复制过去，但发现直接粘贴会报错，还须要本身删除多余的空行，在这里我并不建议大家复制粘贴，后期咱们整理一个github供你们直接使用。下一篇文章我会弄一篇关于Requests包的使用，我的感受比urllib更好用，敬请期待。　　                 感谢你们的阅读，不正确的地方，还但愿你们来斧正，鞠躬，谢谢🙏。