移动端数据爬取和Scrapy框架

时间 2019-12-07

标签移动数据 scrapy 框架栏目无线繁體版

原文原文链接

移动端数据爬取

注:抓包工具:青花瓷python

1.配置fiddler
2.移动端安装fiddler证书
3.配置手机的网络
    - 给手机设置一个代理IP:port

a. Fiddler设置

打开Fiddler软件，打开工具的设置。(Fiddler软件菜单栏：Tools->Options)mysql

在HTTPS中设置以下：linux

在Connections中设置以下，这里使用默认8888端口，固然也能够本身更改，可是注意不要与已经使用的端口冲突：redis

Allow remote computers to connect:容许别的机器把请求发送到fiddler上来sql

b. 安全证书下载

在电脑浏览器中输入地址：http://localhost:8888/，点击FiddlerRoot certificate，下载安全证书：json

也能够windows

c. 安全证书安装(证书必定要安装且信任)

证书是须要在手机上进行安装的，这样在电脑Fiddler软件抓包的时候，手机使用电脑的网卡上网才不会报错。api

Android手机安装：把证书放入手机的内置或外置存储卡上，而后经过手机的"系统安全-》从存储设备安装"菜单安装证书。浏览器

而后找到拷贝的FiddlerRoot.cer进行安装便可。安装好以后，能够在信任的凭证中找到咱们已经安装好的安全证书。安全

苹果手机安装：

- 保证手机网络和fiddler所在机器网络是同一个网段下的

- 在safari中访问http://fiddle机器ip：fiddler端口，进行证书下载。而后进行安装证书操做。

- 在手机中的设置-》通用-》关于本机-》证书信任设置-》开启fiddler证书信任

d. 局域网设置

想要使用Fiddler进行手机抓包，首先要确保手机和电脑的网络在一个内网中，可使用让电脑和手机都链接同一个路由器。固然，也可让电脑开放WIFI热点，手机连入。这里，我使用的方法是，让手机和电脑同时连入一个路由器中。最后，让手机使用电脑的代理IP进行上网。在手机上，点击链接的WIFI进行网络修改，添加代理。进行手动设置，ip和端口号都是fiddler机器的ip和fiddler上设置的端口号。

e. Fiddler手机抓包测试

上述步骤都设置完成以后，用手机浏览器打开百度首页，咱们就能够顺利抓包了

scrapy框架

(1)环境安装

linux:
    pip install scrapy
windows:
    有whell才能下载下载twisted框架: pips install whell 
    下载twisted框架(处理并发相关操做) : http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    进入下载目录,执行: pip3 install Twisted-17.1.0-cp35-cp35m-win_amd64.whl
    pip3 install pywin32
    pip3 install scrapy

(2)新建一个项目

建立一个工程

scrapy startproject 工程名称

此时生成的工程目录

建立一个爬虫文件

注:要保证配置文件建立在当前工程目录下

cd firstblood #在工程目录下建立爬虫文件 会在apiders中出现
scrapy genspider first www.xxx.com #first表明爬虫文件的名称,www.xxx.com表明起始url

在first.py爬虫文件中

执行爬虫文件

scrapy crawl first
scrapy crawl first --nolog (不打印日志)

robots反爬机制的处理

被robots反爬机制拦截的反应

处理方式(在配置文件中修改此条为False)

User-Agent假装反爬机制的处理

处理方式(在配置文件中修改此条)

数据解析

xpath表达式的使用以及获取Selector对象中的date中的值的两种方式

class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.xxx.com'] #在该文件中,只能爬取这个域名之下的url,一般注释掉这一句
    start_urls = ['https://www.qiushibaike.com/text/']
    def parse(self, response):
        div_list=response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            #获得的div是Selector类型的
            title=div.xpath('./div/a[2]/h2/text()')[0].extract() #将Selector对象中的date中的值
            title = div.xpath('./div/a[2]/h2/text()').extract_first() #直接获得列表中第0个元素
            print(title)

持久化存储

1.基于终端指令的持久化存储(只能将path的返回值存到本地的文本中)

能够经过将终端指令的形式将parse方法的返回值中存储的数据进行本地磁盘的持久化存储

class FirstSpider(scrapy.Spider):
    name = 'first'
    start_urls = ['https://www.qiushibaike.com/text/']
    def parse(self, response)
        div_list=response.xpath('//div[@id="content-left"]/div')
        all_data=[]
        dic={}
        for div in div_list:
            #获得的div是Selector类型的
            content=div.xpath('./a/div/span/text()').extract_first()
            title=div.xpath('./div/a[2]/h2/text()').extract_first()
            dic={
                "title":title,
                "content":content
            }
            all_data.append(dic)
        print(all_data)
        return all_data

终端指令

scrapy crawl first -o qiubai.csv

注:限制文件存储的格式,只能是json,jsonlines,jl,csv,xml,marshal,pickle格式的文件

2.基于管道的持久化存储

新建工程

scrapy startproject bosspro 
cd bosspro

建立爬虫文件

scrapy genspider  boss www.baidu.com

修改两种反爬机制

爬取Boss直聘的数据

class BossSpider(scrapy.Spider):
    name = 'boss'
    #allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python爬虫&scity=101010100&industry=&position=']

    def parse(self, response):
        li_list=response.xpath('//*[@id="main"]/div/div[3]/ul/li')
        for li in li_list:
            name=li.xpath('./div/div/h3/a/div/text()').extract_first()
            salary=li.xpath('./div/div/h3/a/span/text()').extract_first()

管道的使用

在items.py中

import scrapy
class BossproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field() # 1.将爬虫文件中每个要提交给管道的数据封装成这个类下的一个属性
    salary = scrapy.Field(

在boss.py爬虫文件中

import scrapy
from bosspro.items import BossproItem #2.导入item中的类

class BossSpider(scrapy.Spider):
    name = 'boss'
    #allowed_domains = ['www.baidu.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=python爬虫&scity=101010100&industry=&position=']

    def parse(self, response):
        li_list=response.xpath('//*[@id="main"]/div/div[3]/ul/li')
        for li in li_list:
            name=li.xpath('./div/div/h3/a/div/text()').extract_first()
            salary=li.xpath('./div/div/h3/a/span/text()').extract_first()
            item=BossproItem() #3.实例化一盒item对象
            item['name']=name #4.将解析到的数据存储到item对象中
            item['salary']=salary
            yield item #5.将item对象提交给管道

在pipelines.py管道文件中

#此文件须要接受爬虫文件提交过来的数据,并对数据进行持久化存储(IO)
class BossproPipeline(object):
    #爬虫文件每提交一次,该方法执行一次
    def process_item(self, item, spider):
        print(item['name'])
        print(item['salary']) #6.在process_item函数中进行数据的持久化存储
        return item

在settings.py配置文件中

#7.在配置文件中开启管道
ITEM_PIPELINES = {
   'bosspro.pipelines.BossproPipeline': 300, #300表示的是优先级,数值越小优先级越高
}
#pipelines中声明的管道类都要在此处开启

数据的持久化存储(本地,mysql,redis)

#将数据保存到本地
class BossproPipeline(object):
    f=None
    # open_sipder只会在开始爬虫时执行一次
    def open_spider(self,spider): #此处的spider是BossSpider类的一个实例对象
        print("开始爬虫!")
        self.f= open('./job.txt','w',encoding='utf-8')
    def process_item(self, item, spider):
        self.f.write(item['name']+':'+item['salary']+"\n")
        return item #若是有其余的管道,process_item函数必定要有返回值
    #close_sipder只会在结束爬虫时执行一次
    def close_spider(self,spider):
        print("结束爬虫!")
        self.f.close()

#将数据保存到mysql
import pymysql
class Bosspro_mysql_Pipeline(object):
    conn=None #链接对象
    cursor=None #游标对象
    def open_spider(self,spider):
        self.conn=pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='',db='pachong')
        print("开始爬虫")
        print(self.conn)
    def process_item(self, item, spider):
        self.cursor=self.conn.cursor()
        sql='insert into boss values("%s","%s")' % (item["name"],item["salary"])
        try:
            self.cursor.execute(sql)
            self.conn.commit()
            print(sql)
        except Exception as e:
            print(e)
            self.conn.rollback() #事务回滚
        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.conn.close()
        print('结束爬虫')

#将数据保存到redis
from redis import Redis
class Bosspro_redis_Pipeline(object):
    conn=None 
    def open_spider(self,spider):
        self.conn=Redis(host='127.0.0.1',port=6380)
        print("开始爬虫了")
        print(self.conn)
    def process_item(self, item, spider):
        dic={
            "name":item["name"],
            "title":item["salary"]
        }
        self.conn.lpush('Info',dic)
    def close_spider(self,spider):
        print('结束爬虫了')

注:开启线程

在配置文件中修改CONCURRENT_REQUESTS = 32便可