爬虫基本原理及urllib库的基本使用

时间 2019-11-16

标签爬虫基本原理 urllib 基本使用栏目网络爬虫繁體版

原文原文链接

爬虫基本原理及urllib库的基本使用

爬虫的基本原理

爬虫定义：请求网站并提取数据的自动化程序

1.能按做者要求下载数据或内容

2.能自动在网络上流窜

爬虫的分类

1.通用爬虫（不分类）

2.专用爬虫（聚焦爬虫）（主讲）

基本流程

1.发起请求 HTTP Request 请求方式：GET POST

2.获取相应内容 HTTP Response

3.解析内容

4.保存数据

Request

请求方式：GET POST

请求URL：全球统一的资源定位符一个网页文档一张图片一个视频均可用URL惟一来肯定

请求头：头部信息 such as：User-Agent Host Cookies

请求体：请求时额外携带的数据

Response

状态码：（Status）:200(成功) 301(跳转) 404（找不到页面） 502（服务器错误）

响应头：内容类型内容长度服务器信息设置Cookie......

响应体：最主要的部分包含了请求资源的内容 such as: 网页HTML 图片二进制数据......

能抓取怎样的数据

1.网页文本：HTML文档 Json格式文本

2.图片：获取到的是二进制文件保存为图片格式

3.视频：同为二进制文件保存为视频格式便可

4.其余

解析方式

1.直接处理

2.Json解析（Json格式）

3.正则表达式

4.BeautifulSoup

5.XPath

6.PyQuery

Urllib库

Urllib库是Python自带的一个HTTP请求库，包含如下几个模块：

urllib.request 请求模块

urllib.error 异常处理模块

urllib.parse url解析模块

urllib.robotparser robots.txt解析模块

#urllib.request
复制代码

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))
#调用decode()方法经过utf-8方式转换为咱们能读懂的网页代码
#GET



import urllib.parse
import urllib.request
d = byte(urllib.parse.urlencode({'name':'kobe'}), encoding = 'utf-8')
response = urllib.response.urlopen('http://www.baidu.com',data = d)
print(response.read().decode('utf-8'))
#POST



import socket
import urllib.request

try:
    response = urllib.request.urlopen('http://www.baidu.com',timeout = o.o1)
except urllib.error.UELError as e:
    if isinstance(e.reason,socket.timeout)
            print('Time Out')
#设置请求的超时时间



import urllib.request

response = urllib.request.urlopen('http://www.baidu.com')
print(response.status)#获取状态码
print(response.getHeaders())#获取响应头的信息 #打印一个元祖列表
print(response.getHeader('Server'))#



import urllib.request
import urllib.parse

url = 'http://httpbin.org/post'
headers = {
    'User-Agent':'......'
    'Host':'......'
}
dict = {'name':'kobe'}

data = bytes(parse.uelopen(dict),encoding = post)
req = request.Request(url = url,data =data,headers = deaders,method = post) # Request函数
response = request.urlopen(req)
print(response.read().decode('utf-8'))

# 当咱们想传递request headers的时候 urlopen()没法支持 故须要这个新的方法

# 用Request方法进行POST请求并加入了请求头





#urllib.error
import urllib.request
import urllib.error

try:
    response = request.urlopen('http://www.baidu.com')
except error.HTTPError as e:
    print(e.reason,e.code,e.header,sep = '\n')
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully!')

# 从代码中能够看出 HTTPError是URLError的子类





#urllib.parse
from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
print(result)

# 拆分

# ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

# 拆分红对应的元组

# scheme参数提供一个默认值 当URL没有协议信息时 返回默认值

from urllib.parse import urlunparse

data = ['http','www.baidu.com','index.html','user','a=6','comment']
print(urlunparse(data))

# http://www.baidu.com/index.html;user?a=6#comment

from urllib.parse import urljoin

print(urljoin('http://www.baidu.com','index.html'))
print(urljoin('http://www.baidu.com#comment','?username="zhangsan"'))
print(urljoin('http://www.baidu.com','www.sohu.com'))

# http://www.baidu.com/index.html

# http://www.baidu.com?username="zhangsan"

# http://www.baidu.com/www.sohu.com

# 若是第二个参数是第一个参数中没有的url组成部分 那将进行添加 不然进行覆盖



from urllib.parse import urlencode

params = {
'name':'zhangsan',
'age':22    
} 
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)

# 'http://www.baidu.com?name=zhangsan&age=22'
复制代码

复制代码

爬虫基本原理及urllib库的基本使用

爬虫基本原理及urllib库的基本使用

爬虫的基本原理

爬虫定义：请求网站并提取数据的自动化程序

1.能按做者要求下载数据或内容

2.能自动在网络上流窜

爬虫的分类

1.通用爬虫 （不分类）

2.专用爬虫（聚焦爬虫）（主讲）

基本流程

1.发起请求 HTTP Request 请求方式：GET POST

2.获取相应内容 HTTP Response

3.解析内容

4.保存数据

Request

请求方式：GET POST

请求URL：全球统一的资源定位符 一个网页文档 一张图片 一个视频 均可用URL惟一来肯定

请求头：头部信息 such as：User-Agent Host Cookies

请求体：请求时额外携带的数据

Response

状态码：（Status）:200(成功) 301(跳转) 404（找不到页面） 502（服务器错误）

响应头：内容类型 内容长度 服务器信息 设置Cookie......

响应体：最主要的部分 包含了请求资源的内容 such as: 网页HTML 图片二进制数据......

能抓取怎样的数据

1.网页文本：HTML文档 Json格式文本

2.图片：获取到的是二进制文件 保存为图片格式

3.视频：同为二进制文件 保存为视频格式便可

4.其余

解析方式

1.直接处理

2.Json解析（Json格式）

3.正则表达式

4.BeautifulSoup

5.XPath

6.PyQuery

Urllib库

Urllib库是Python自带的一个HTTP请求库，包含如下几个模块：

urllib.request 请求模块

urllib.error 异常处理模块

urllib.parse url解析模块

urllib.robotparser robots.txt解析模块

1.通用爬虫（不分类）

请求URL：全球统一的资源定位符一个网页文档一张图片一个视频均可用URL惟一来肯定

响应头：内容类型内容长度服务器信息设置Cookie......

响应体：最主要的部分包含了请求资源的内容 such as: 网页HTML 图片二进制数据......

2.图片：获取到的是二进制文件保存为图片格式

3.视频：同为二进制文件保存为视频格式便可