爬虫回顾

时间 2019-11-14

原文原文链接

爬虫类型：通用爬虫、聚焦爬虫、增量式爬虫html

在使用fiddler工具抓包时，须要注意下：由于它须要安装证书，在项目请求HTTPS页面是会ssl要求提供安全证书，可能会被拒绝请求安全

能够在发送requests请求时，关闭安全认证，或者暂时关闭fiddler代理。末尾也会提到，这个坑……服务器

使用 BeautifulSoup对HTML标签进行解析数据：dom

import requests from bs4 import BeautifulSoup url='https://www.yangguiweihuo.com/16/16089/' ua={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} page_text=requests.get(url=url,headers=ua).text         #获取全部章节列表HTML
soup=BeautifulSoup(page_text,'lxml') a_list=soup.select('.listmain > dl > dd > a')           #解析出全部章节的URL
 with open("秦吏.txt",'w',encoding='utf-8')as f: for a in a_list: title=a.string                                  #获取a标签中的文本 做为章节名
        detail_url='https://www.yangguiweihuo.com'+a['href']       #拼接章节详情url
        detail_page=requests.get(url=detail_url,headers=ua).text dsp=BeautifulSoup(detail_page,'lxml')           #章节详情页面
        content=dsp.find('div',id='content').text       #章节内容详情
        f.write(title+'\n'+content)                     #数据持久化存储
        print(title+"：下载完成") print('The end') f.close()

关于xpath的使用：工具

div[@class="song"]　　　　　　div中class为song的标签元素测试

div[@class="song"]/li/a/@href　　取出其中的url地址网站

div[@class="song"]/li/a/text()　　取出其中的文本ui

div[contains(@class,'ng')]　　　　是指在div中查找class属性名含有ng的标签元素url

div[starts-with(@class,'ta')]　　　是指div中查找class属性以ta开头的标签元素spa

xpath小案例：

import requests from lxml import etree url="https://gz.58.com/ershoufang/?PGTID=0d100000-0000-335c-5dda-1cebcdf9ae5f&ClickID=2" user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} page_text=requests.get(url,headers=user_agent).text tree=etree.HTML(page_text)              #格式化处理后的所有页面数据
 li_list=tree.xpath('//ul[@class="house-list-wrap"]/li')     #记录以列表返回
 fp=open("58.scv",'w',encoding='utf-8') for li in li_list: title=li.xpath("./div[@class='list-info']/h2/a/text()")[0] price=li.xpath("./div[@class='price']//text()") sum_price=''.join(price) fp.write("home:"+title+"price:"+sum_price+'\n') fp.close() print("数据获取完成！")

碰到网站文本乱码问题的解决：

import requests,os from lxml import etree url='http://pic.netbian.com/4kmeinv/' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} page_text=requests.get(url,headers=user_agent).text tree=etree.HTML(page_text) li_list=tree.xpath('//div[@class="slist"]/ul/li') def getpic(title,photo): if not os.path.exists('./photo'):       #没有文件夹则直接建立空文件夹
        os.mkdir('./photo') fp = open('photo/'+title, 'wb') fp.write(photo) fp.close() return "当前资源下载完成"

for li in li_list: title=li.xpath('./a/b/text()')[0]+".jpg" title=title.encode('iso-8859-1').decode('gbk')  #乱码 转标准格式在解码
    print(title) p_url=li.xpath('./a/img/@src')[0] picture_url='http://pic.netbian.com'+p_url photo=requests.get(url=picture_url,headers=user_agent).content ret=getpic(title,photo) print(ret)

批量获取简历模板：字符乱码问题的处理

import requests,random,os,time from lxml import etree url='http://sc.chinaz.com/jianli/free.html' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} response=requests.get(url,headers=user_agent)         #简历列表页面
response.encoding='utf-8'                 #对乱码进行处理
page_text=response.text tree=etree.HTML(page_text) div_list=tree.xpath('//div[@id="container"]/div') if not os.path.exists('./jl'): os.mkdir('./jl') for div in div_list: title=div.xpath('./a/img/@alt')[0]           #简历名称
    link=div.xpath('./a/@href')[0]                #简历详情地址
 fp=open('./jl/'+title+'.zip','wb') detail_page=requests.get(url=link,headers=user_agent).text  #简历详情页面
    dpage=etree.HTML(detail_page) down_list=dpage.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href') down_url=random.choice(down_list)            #随机选择下载地址
    word=requests.get(url=down_url,headers=user_agent).content print("准备开始下载>>"+title) fp.write(word) time.sleep(1)

对代理ip进行测试：写入数据则代理ip可用

import requests url='https://www.baidu.com/s?wd=ip' user_agent={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.109 Safari/537.36"} proxy={"https":'112.85.170.79:9999'} page=requests.get(url,headers=user_agent,proxies=proxy).text with open('./ip.html','w',encoding='utf-8')as f: f.write(page)

这一每天的mmp,当指定headers的User-Agent时,服务器会重定向到https的网址.所以报出SSL验证失败的错误，为了不重定向形成认证失败，直接关闭认证page_text=requests.get(url=url,headers=user_agent,verify=False).text