python3实现简单爬虫功能

时间 2019-11-25

原文原文链接

本文参考虫师python2实现简单爬虫功能，并增长本身的感悟。html

 1 #coding=utf-8
 2 import re
 3 import urllib.request
 4 
 5 def getHtml(url):
 6     page = urllib.request.urlopen(url)
 7     html = page.read()
 8     #print(type(html))
 9     html = html.decode('UTF-8')
10     #print(html)
11     return html
12 
13 def getImg(html):
14     reg = r'img class="BDE_Image" src="(.+?\.jpg)"'
15     imgre = re.compile(reg)
16     #print(type(imgre))
17     #print(imgre)
18     imglist = re.findall(imgre,html)
19     #print(type(imglist))
20     #print(imglist)
21     num = 0
22     for imgurl in imglist:
23         urllib.request.urlretrieve(imgurl,'D:\img\hardaway%s.jpg' %num)
24         num+=1      
25 
26 html = getHtml("http://tieba.baidu.com/p/1569069059")
27 print(getImg(html))

re-python自带模块，用于正则表达式的相关操做
https://docs.python.org/3/library/re.html
urllib.request,来自扩展库urllib，用于打开网址相关操做
https://docs.python.org/3/installing/index.htmlpython
先定义了一个getHtml()函数正则表达式
使用urllib.request.urlopen()方法打开网址
使用read()方法读取网址上的数据
使用decode()方法指定编码格式解码字符串浏览器

我这里指定的编码格式为UTF-8，根据页面源代码得出：
函数

再定义了一个getImg()函数，用于筛选整个页面数据中咱们所须要的图片地址工具

上文中的例子所编写的编码格式是经过查看网页源代码的方式得知的，后来我尝试了下经过正则表达式去匹配获取charset定义的编码格式，而后指定使用匹配来的编码格式。编码

 1 def getHtml(url):
 2     page = urllib.request.urlopen(url)
 3     html = page.read()
 4     #print(type(html))
 5     rehtml = str(html)
 6     #print(type(rehtml))
 7     reg = r'content="text/html; charset=(.+?)"'
 8     imgre = re.compile(reg)
 9     imglist = re.findall(imgre,rehtml)
10     print(type(imglist))
11     code = imglist[0]
12     print(type(code))
13     html = html.decode('%s' %code)
14     return html

说一说这里的思路，html = page.read()方法处理后，返回的为bytes对象。而re.findall()方法是没法在一个字节对象上使用字符串模式的url
因此我新定义了一个变量rehtml,使用str()方法把html的值转为了字符串，供re.findall()方法使用spa
定义了一个新变量code用来放编码格式的值，由于re.findall()方法获取回来的是列表类型，我须要使用的是字符串类型。code
根据须要的图片来编写正则表达式 reg = r’img class=”BDE_Image” src=”(.+?.jpg)”’
使用re.compile()方法把正则表达式编译成一个正则表达式对象,在一个程序中屡次使用会更有效。
使用re.findall()方法匹配网页数据中包含正则表达式的非重叠数据，做为字符串列表。
urllib.request.urlretrieve()方法，将图片下载到本地，并指定到了D盘img文件夹下