Python Scrapy框架

时间 2019-12-13

标签 python scrapy 框架栏目 Python 繁體版

原文原文链接

scrapy框架的简介和基础使用

概念：为来爬取网站数据而编写的一款应用框架，集成了相应的功能且具备很强通用型的项目模版html

功能:scrapy框架提供了高性能的异步下载、解析、持久化存储操做...python

scrapy核心组件

引擎(Scrapy)

用来处理整个系统的数据流处理，出发事物(框架核心)mysql

调度器(Scheduler)

用来接受引擎发过来的请求，压入队列中，并在引擎再次请求的时候返回，能够想象成一个url(抓取网页的网址或者说是连接)的优先队列，由它来决定下一个要抓取的网址是什么，同时去除重复的网址linux

下载器(Downloader)

用于下载网页的内容，并将网页内容返回给蜘蛛(scrapy下载器是简历在twisted这个高效异步模型上的)正则表达式

爬虫(spiders)

爬虫是主要干活的，用于从特定的网页中提取本身须要的信息，即所谓的实体(item)。用户也能够从中提取出连接，让scrapy继续抓取下一个页面redis

项目管道(Popline)

负责处理爬虫从网页抽取的实体，主要的功能是持久化实体，验证明体的有效性，清除不须要的信息，当页面爬虫解析后，将发送到项目管道，并通过几个特定的次序处理数据sql

解释：引擎首先回将爬虫文件中的起始url获取，而且提交到调度器中，若是须要从url中下载数据，则调度器回将url经过引擎提交给下载器，下载器根据url去下载指定内容(响应体)。下载好的数据回经过引擎移交给爬虫文件，爬虫文件能够将下载的数据进行指定格式的解析。若是解析出的数据须要进行持久化存储，则爬虫文件会将解析好的数据经过引擎移交给管道进行持久化存储。

安装：数据库

linux or mac：pip install scrapy
windows:
1. pip install wheel
2. 下载对应python解释器版本的twisted
3. pip install 下载好的框架.whl
4. pip install pywin32
5. pip install scrapy

安装成功后能够在终端/命令行输入scrapy测试是否安装成功windows

使用流程
- 建立一个project
- 在建立project目录下建立一个爬虫文件
- 在对应文件中编写爬虫程序来完成爬虫的相关操做
- 配置文件的编写
- 执行
使用命令行建立项目

scrapy startproject 项目名

目录结构：浏览器

- scrapy.cfg：配置文件
- items.py：设置数据存储模版，用户结构化数据
- pipelines：数据持久化处理
- settings.py：配置文件。如：递归层数，并发数，延迟下载等
- spiders：爬虫目录。如：建立文件，编写爬虫解析规则

在project目录下建立爬虫文件：
- cd 项目
- scrapy genspider 爬虫文件的名称起始url

建立的第一个爬虫文件中的内容

# -*- coding: utf-8 -*-
import scrapy


class FirstSpider(scrapy.Spider):
    # 爬虫文件的名称: 经过爬虫文件名称能够指定定位到某一个具体的爬虫文件
    name = 'first'

    # 容许的域名：只能够爬取指定域名下的页面数据
    allowed_domains = ['https://www.qiushibaike.com']
    # 起始url：当前工程将要爬取页面所对应的url
    start_urls = ['http://https://www.qiushibaike.com/']

    # 解析方法：对获取的页面数据进行指定内容的解析
    # response:根据起始url列表发起请求，请求成功后返回的响应对象
    # parse方法的返回值：必须为迭代器或者空
    def parse(self, response):
        pass

在建立的爬虫文件中编写爬虫程序来完成爬虫的相关操做

settings.py的配置

19line：对请求载体的身份进行假装
22line：不听从robots协议

# 请求载体的身份标示，改成浏览器的身份标示
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
# Obey robots.txt rules
# 严格听从门户网站的robots协议
ROBOTSTXT_OBEY = False

执行：scrapy crawl 爬虫文件名称
- 能够在后面加个 --nolog阻止日志信息的输出

爬虫文件进行指定页面数据的解析操做

在对爬取内容解析时建议使用xpath进行指定内容解析

需求：段子的内容和做者

# -*- coding: utf-8 -*-
import scrapy


class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.qiushibaike.com/text']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        f = open('段子.txt','w',encoding='utf-8')
        count = 0
        for div in div_list:
            # xpath解析到的指定内容被存储到来selector对象
            # extract()该方法能够将selector对象中存储的数据值拿到
            # extract_first()取到第一个数据，等同于extract()[0]
            author = div.xpath('./div/a[2]/h2/text()').extract_first().strip()
            # author = div.xpath('./div/a[2]/h2/text()').extract()[0]
            content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip()
            count += 1
            f.write(author+':\n'+content+'\n---------------分割线--------------\n\n\n')
        print('共抓取到:',count)

持久化存储操做

磁盘文件
数据库

磁盘文件

基于终端指令

保证parse方法返回一个可迭代类型的对象（存储解析到的页面内容）
使用终端指令完成数据存储到指定磁盘文件中
parse方法中返回的数据：

class FirstSpider(scrapy.Spider):
 name = 'first'
 # allowed_domains = ['www.qiushibaike.com/text']
 start_urls = ['https://www.qiushibaike.com/text/']

 def parse(self, response):
     div_list = response.xpath('//div[@id="content-left"]/div')
     data_list = []
     for div in div_list:
         # xpath解析到的指定内容被存储到来selector对象
         # extract()该方法能够将selector对象中存储的数据值拿到
         # extract_first()取到第一个数据，等同于extract()[0]
         author = div.xpath('./div/a[2]/h2/text()').extract_first().strip()
         # author = div.xpath('./div/a[2]/h2/text()').extract()[0]
         content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip()
         dic = {
             "author":author,
             "content":content
         }
         data_list.append(dic)
     return data_list

指令：

scrapy crawl first -o qiubai.csv --nolog

基于管道
- items.py :存储解析到的数据
- pipelines.py :处理持久化存储的相关操做

基于管道的数据存储代码实现流程：

将解析到的页面数据存储到items对象
使用yield关键字将items提交给管道文件进行处理
在管道文件中编写代码完成数据存储的操做
在配置文件中开启管道操做

在items.py中：

import scrapy


class QiushiItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

在pipelines.py中

class QiushiPipeline(object):
    file = None
    def open_spider(self,spider):
        # 整个爬虫过程当中，该方法只会被调用一次，能够在这里打开文件
        self.file = open('qiubai.txt','w',encoding='utf-8')
        print('开始爬虫')
    def process_item(self, item, spider):
        # 该方法能够接收爬虫文件中提交过来的item对象，而且对item对象中存储页面数据进行持久化存储
        # 参数item表示的就是接收到的item对象
        # 每当爬虫文件向管道提交一次item，则该方法就会被执行一次
        author = item['author']
        content = item['content']
        self.file.write(author+':\n'+content+'\n-------------------\n\n\n')
        return item
    def close_spider(self,spider):
        # 该方法只会在爬虫结束的时候调用一次
        print('爬虫结束')
        self.file.close()

在爬虫文件中:

import scrapy
from qiushi.items import QiushiItem

class FirstSpider(scrapy.Spider):
    name = 'first'
    # allowed_domains = ['www.qiushibaike.com/text']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            author = div.xpath('./div/a[2]/h2/text()').extract_first().strip()
            # author = div.xpath('./div/a[2]/h2/text()').extract()[0]
            content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip()
            
            # 1. 将解析到的页面数据存储到items对象
            item = QiushiItem()
            item['author'] = author
            item['content'] = content

            # 2.将item对象提交给管道
            yield item

配置文件：取消67line的注释

ITEM_PIPELINES = {
   'qiushi.pipelines.QiushiPipeline': 300,
}

数据库持久化存储

使用mysql数据库进行持久化存储时与基于管道存储方式无太大区别，只是须要在pipelines中编写pymysql链接、操做数据库等

import pymysql

class QiubaiPipeline(object):
    conn = None
    cursor = None
    def open_spider(self,spider):
        print('爬虫开始')
        self.conn = pymysql.connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai')

    def process_item(self, item, spider):
        sql = 'insert into qiubai(author,content) values ("%s","%s")'%(item['author'],item['content'])
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()
        return item

    def close_spider(self,spider):
        print('爬虫结束')
        self.cursor.close()
        self.conn.close()

redis数据库

安装redis数据库

在官网下载redis压缩包
对下载的压缩包进行解压缩
使用命令行打开解压后的文件夹

cd redis-5.0.3

对redis数据库源码文件进行编译

make

进入到解压缩的src目录，经过以下命令启动redis服务端，后面能够加上redis.conf

./redis-server ../redis.conf

reids的简单使用

127.0.0.1:6379> set name 'hahha'
OK
127.0.0.1:6379> get name
"hahha"

基于redis数据库存储

import redis

class QiubaiPipeline(object):
    conn = None
    def open_spider(self,spider):
        print('爬虫开始')
        self.conn = redis.Redis(host='127.0.0.1',port=6379)
    def process_item(self, item, spider):
        dic = {
            'author':item['author'],
            'content':item['content']
        }
        self.conn.lpush('data',dic)
        return item
    def close_spider(self,spider):
        print('爬虫结束')

管道的高级操做

需求：将爬取到的数据值分别存储到磁盘、mysql、redis

须要在管道文件中编写对应平台的管道类
在配置文件中配置类

pipelines.py

import redis

class QiubaiPipeline(object):
    conn = None
    def open_spider(self,spider):
        print('爬虫开始')
        self.conn = redis.Redis(host='127.0.0.1',port=6379)
    def process_item(self, item, spider):
        dic = {
            'author':item['author'],
            'content':item['content']
        }
        # self.conn.lpush('data',dic)
        print('数据写入到redis数据库中')
        return item
    def close_spider(self,spider):
        print('爬虫结束')

class QiubaiFiles(object):

    def process_item(self, item, spider):
        print('数据写入到磁盘文件中')
        return item

class QiubaiMySQL(object):
    def process_item(self, item, spider):
        print('数据写入到mysql数据库中')
        return item

settings.py

ITEM_PIPELINES = {
   'qiubai.pipelines.QiubaiPipeline': 300,
   'qiubai.pipelines.QiubaiFiles': 400,
   'qiubai.pipelines.QiubaiMySQL': 500
}

多个url数据爬取

使用请求的手动发送能够实现多个url进行数据爬取

import scrapy

from qiushiPage.items import QiushipageItem
class QiushiSpider(scrapy.Spider):
    name = 'qiushi'
    # allowed_domains = ['https://www.qiushibaike.com/text/']
    start_urls = ['https://www.qiushibaike.com/text/']
    pageNum = 1
    url = 'https://www.qiushibaike.com/text/page/%d/'
    def parse(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            author = div.xpath('./div/a[2]/h2/text()').extract_first().strip()
            content = div.xpath('.//div[@class="content"]/span/text()').extract_first()
            item = QiushipageItem()
            item['author'] = author
            item['content'] = content
            yield item
        print('还在执行吗')
        # 使用手动请求方式进行多个url爬取
        if self.pageNum <= 13:
            self.pageNum += 1
            print('执行了码', self.pageNum)
            # 判断页码是否小于13
            new_url = format(self.url % self.pageNum)
            # callback函数是回调函数，第二页解析的内容与开始解析的内容是同样的，能够使用parse函数进行解析，也能够本身定义函数进行解析
            yield scrapy.Request(url=new_url,callback=self.parse)

scrapy发起post请求

scrapy要发送post请求，必定要对父类中的start_requests方法进行重写

class PostRequestSpider(scrapy.Spider):
    name = 'post_request'
    # allowed_domains = ['www.baidu.com']
    start_urls = ['https://fanyi.baidu.com/sug']

    # 发起post请求须要对父类中的start_requests方法进行重写
    def start_requests(self):
        data = {
            'kw':'dog'
        }
        for url in self.start_urls:
            # 方式1
            # yield scrapy.Request(url=url,callback=self.parse,method='post')

            # 方式2
            yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data)
    def parse(self, response):
        print(response.text)

scrapy cookie操做

不须要对cookie刻意的去提取或者储存，scrapy.Request会自动存储cookie，在下次发起请求的时候会携带自动存储下来的cookie

scrapy 代理操做

scrapy更换请求ip是经过下载中间件实现的，在middlewares.py中能够自定义一个类，在类中实现一个process_request方法，方法有三个参数，self\request\spider
而后经过更改request.meta['proxy']的属性进行更换ip，更换后在setting.py中开启下载中间件

middlewares.py

class MyPro(object):
    def process_request(self,request,spider):
        request.meta['proxy'] = 'http://61.166.153.167:8080'

settings.py

DOWNLOADER_MIDDLEWARES = {
   'postDemo.middlewares.MyPro': 543,
}

在进行访问时即可以对请求ip进行自动更改

日志等级

种类：

ERROR:错误
WARNING:警告
INFO:通常的信息
DEBUG:调试信息
要想显示指定日志的等级，只须要在settings.py中添加一个LOG_LEVEL = 指定种类的日志信息

# 指定终端输出指定等级的日志信息
LOG_LEVEL = 'ERROR'

也能够指定输出日志信息的文件，而不是输出在屏幕上，一样的是在settings.py中添加一个属性LOG_FILE

LOG_FILE = 'log.txt'

请求传参

import scrapy

from moviePro.items import MovieproItem
class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['http://www.55xia.com']
    start_urls = ['http://www.55xia.com/movie/']

    def parseMoviePage(self,response):
        # 取出item
        item = response.meta['item']

        direct = response.xpath('//html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]//text()').extract_first()
        country = response.xpath('//html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[4]/td[2]/a/text()').extract_first()
        movie_referral = response.xpath('/html/body/div[1]/div/div/div[1]/div[2]/div[2]/p/text()').extract_first()
        download_url = response.xpath('//td[@class="text-break"]/div/a[@rel="nofollow"]/@href').extract_first()
        password = response.xpath('//td[@class="text-break"]/div/strong/text()').extract_first()
        download = '连接：%s密码：%s'%(download_url,password)
        item['download'] = download
        item['country'] = country
        item['direct'] = direct

        item['movie_referral'] = movie_referral
        yield item

    def parse(self, response):
        div_list = response.xpath('//html/body/div[1]/div[1]/div[2]/div')

        for div in div_list:
            name = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first()
            parse_url = div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()
            genre = div.xpath('.//div[@class="otherinfo"]//text()').extract()
            genre = '|'.join(genre)
            url = 'http:%s'%parse_url

            item = MovieproItem()
            item['name'] = name
            item['genre'] = genre
            # 请求传参，两个解析响应的页面不一样，须要同时保存数据，将item经过meta传到回调函数中，在回调函数中能够使用response.meta['item']取出
            # meta参数必须接收一个字典
            yield scrapy.Request(url=url,callback=self.parseMoviePage,meta={'item':item})

CrawlSpider

问题：若是咱们想要对某一个网站的全站数据进行爬取

解决方案：

手动请求的发送
CrawlSpider(推荐)

CrawlSpider概念：CrawlSpider起始就是Spider的一个子类，CrawlSpider功能更增强(连接提取器，解析器)

建立一个基于CrawlSpider的爬虫文件

scrapy genspider -t crawl chouti dig.chouti.com

连接提取器：顾名思义，用来提取指定的连接(url)

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']

    # 实例化了一个连接提取器对象
    # 连接提取器：顾名思义，用来提取指定的连接(url)
    # allow参数：赋值一个正则表达式参数
    # 连接提取器就能够根据正则表达式在页面中指定的连接
    # 提取到的连接会所有交给规则解析起
    link = LinkExtractor(allow=r'/all/hot/recent/\d+')
    rules = (
        # 实例化了一个规则解析器对象
        # 规则解析器接受了连接提取器发送的连接后，就会对这些连接发起请求，获取页面内容，就会根据指定规则对页面内容进行解析
        # callback: 指定一个解析规则(方法/函数)
        # follow参数：是否将连接提取器继续做用到连接提取器提取出所表示的页面数据中
        Rule(link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response)
        # 能够对响应数据进行解析

基于redis的分布式爬虫

分布式爬虫：

概念：多台机器上能够执行同一个爬虫程序，实现网站数据的分布爬取
原生的scrapy框架是不能够实现分布式爬虫的，由于多台设备之间的调度器与管道没法共享
scrapy-redis组件：专门为scrapy开发的一套组件，该组件可让scrapy实现分布式

# 安装
pip install scrapy-redis

分布式爬取的流程：

redis配置文件的配置：将bind 127.0.0.1注释 line69
将protected-mode 改为no 关闭保护模式 line88
redis服务器的开启：基于配置文件进行开启
建立scrapy项目后，建立一个基于crawlspider的爬虫文件
导入RedisCrawlSpider类，而后将爬虫文件基于该类

from scrapy_redis.spiders import RedisCrawlSpider

class QiubaiSpider(RedisCrawlSpider):
    pass

将start_urls更换成redis_key = 'xxx'
编写爬虫对应的代码

from redisPro.items import RedisproItem
class QiubaiSpider(RedisCrawlSpider):
    name = 'qiubai'
    # allowed_domains = ['https://www.qiushibaike.com/pic/']
    # start_urls = ['https://www.qiushibaike.com/pic//']

    # 调度器队列的名称
    redis_key = 'qiubaispider' # 表示与start_urls含义是同样的
    rules = (
        Rule(LinkExtractor(allow=r'/pic/page/\d+'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
        for div in div_list:
            img_url = 'https:'+div.xpath('.//div[@class="thumb"]/a/img/@src')
            item = RedisproItem()
            item['img_url'] = img_url
            yield item

在往管道提交数据的时候不能使用原生的管道了，应该使用redis-spider提供的共享管道，在settings.py中配置：

ITEM_PIPELINES = {
    # 原生
   # 'redisPro.pipelines.RedisproPipeline': 300,
    
    # 分布式组件提供的共享管道
    'scrapy_redis.pipelines.RedisPipeline':400,
}

使用scrapy-redis组件中封装好的调度器，将全部的url存储到该指定的调度器中，从而实现了多台机器的调度共享

# 使用scrapy-redis组件去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件本身的调度器
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# 是否容许暂停
SCHEDULER_PERSIST = True

执行：分布式爬虫执行与原生有所不一样，先切换到爬虫文件所在路径后执行指令

scrapy runspider qiubai.py

将起始url放置到调度器的队列中：redis-cli:
lpush 队列名称 (redis-key) 起始url

lpush qiubaispider https://www.qiushibaike.com/pic/

https://pic.qiushibaike.com/system/pictures/12140/121401684/medium/59AUGYJ1J0ZAPSOL.jpg