Scrapy框架之CrawlSpider

时间 2019-12-11

原文原文链接

针对问题：若是想对某一个网站的全站数据进行爬取，该如何处理？
　　解决方案：css

手动请求的发送：基于Scrapy框架中的Spider的递归爬取进行实现（Request模块递归回调parse方法）
CrawlSpider：基于CrawlSpider的自动爬取进行实现（更加简洁和高效）

1、CrawlSpider介绍

　　CrawlSpider实际上是Spider的一个子类。html

一、CrawlSpider功能

　　CrawlSpider功能比Spider更增强大：除了继承到Spider的特性和功能外，还派生除了其本身独有的更增强大的特性和功能。
　　其中最显著的功能就是“LinkExtractors连接提取器”和“规则解析器”。python

二、Spider和CrawlSpider应用场景

　　Spider是全部爬虫的基类，其设计原则只是为了爬取start_url列表中网页，而从爬取到的网页中提取出的url进行继续的爬取工做使用CrawlSpider更合适。web

2、CrawlSpider使用

一、建立工程与CrawlSpider爬虫文件

# 建立scrapy工程：
$ scrapy startproject crawlSpiderPro
$ cd crawlSpiderPro/

# 建立一个基于CrawlSpider的爬虫文件
$ scrapy genspider -t crawl chouti dig.chouti.com
Created spider 'chouti' using template 'crawl' in module:
  crawlSpiderPro.spiders.chouti

　　注意：建立爬虫的指令对比之前的指令多了 "-t crawl"，表示建立的爬虫文件是基于CrawlSpider这个类的，而再也不是Spider这个基类。正则表达式

二、观察分析生成的爬虫文件:couti.py

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor   # 连接提取器对应的类
from scrapy.spiders import CrawlSpider, Rule   # Rule是规则解析器对应的类

class ChoutiSpider(CrawlSpider):   # 这里继承的父类时CrawlSpider
    name = 'chouti'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']

    rules = (
        # rules中保存的是元组，元组中保存的是Rule规则解析器对象
        # 规划解析器对象第一个参数是：连接提取器对象
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):   # 解析方法
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

三、LinkExtractor——连接提取器

　　连接提取器做用：能够用来提取页面中符合正则表达式要求的相关连接(url)。bash

LinkExtractor(
    allow=r'Items/',     # 知足括号中“正则表达式”的值会被提取，若是为空，则所有匹配。
    deny=xxx,            # 知足正则表达式的则不会被提取。
    restrict_xpaths=xxx, # 知足xpath表达式的值会被提取
    restrict_css=xxx,    # 知足css表达式的值会被提取
    deny_domains=xxx,    # 不会被提取的连接的domains。　
)

allow参数：赋值一个正则表达式。
　　allow赋值正则表达式后，连接提取器就能够根据正则表达式在页面中提取指定的连接。提取到的连接会所有交给规则解析器处理。

四、Rule——规则解析器

　　规则解析器接受了连接提取器发送的连接后，就会对这些连接发起请求，获取连接对应的页面内容。
　　获取页面内容后，根据指定的规则将页面内容中的指定数据值进行解析。框架

（1）解析器格式

Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)

（2）参数介绍

　　参数1: 指定连接提取器
　　参数2:callback 指定规则解析器解析数据的规则（回调函数）
　　参数3:follow 是否将连接提取器继续做用到连接提取器提取出的连接网页中。dom

　　当callback为None,参数3的默认值为true。
　　follow为False时，连接提取器只是提取当前页面显示的全部页码的url
　　follow为True时会不断日后根据页码提取页面，直到提取全部的页面连接，并自动完成去重操做。scrapy

五、CrawlSpider总体爬取流程

爬虫文件首先根据起始url，获取该url的网页内容
连接提取器会根据指定提取规则将步骤a中网页内容中的连接进行提取
规则解析器会根据指定解析规则将连接提取器中提取到的连接中的网页内容根据指定的规则进行解析
将解析数据封装到item中，而后提交给管道进行持久化存储

3、抽屉网项目实战

（1）choutipyide

import scrapy
from scrapy.linkextractors import LinkExtractor   # 连接提取器对应的类
from scrapy.spiders import CrawlSpider, Rule   # Rule是规则解析器对应的类
from crawlSpiderPro.items import CrawlspiderproItem

class ChoutiSpider(CrawlSpider):
    name = 'chouti'
    # allowed_domains = ['dig.chouti.com']
    start_urls = ['https://dig.chouti.com/']
    # 定义连接提取器，且指定其提取规则
    Link = LinkExtractor(allow=r'/all/hot/recent/\d+')    # 获取的页码的a标签中href值

    rules = (
        # 定义规则解析器，且指定解析规则经过callback回调函数
        Rule(Link, callback='parse_item', follow=True),
    )

    def parse_item(self, response):   # 解析方法
        """自定义规则解析器的解析规则函数"""
        div_list = response.xpath('//div[@id="content-list"]/div')

        for div in div_list:
            # 定义item
            item = CrawlspiderproItem()
            # 根据xpath表达式提取抽屉新闻的内容
            item['content'] = div.xpath('.//div[@class="part1"]/a/text()').extract_first().strip('\n')
            # 根据xpath表达式提取抽屉新闻的做者
            item['author'] = div.xpath('.//div[@class="part2"]/a[4]/b/text()').extract_first().strip('\n')
            yield item  # 将item提交至管道

（2）items.py

import scrapy

class CrawlspiderproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field()
    content = scrapy.Field()

（3）pipelines.py

class CrawlspiderproPipeline(object):
    def __init__(self):
        self.fp = None

    def open_spider(self, spider):
        print('开始爬虫')
        self.fp = open('./data.txt', 'w')

    def process_item(self, item, spider):
        # 将爬虫文件提交的item写入文件进行持久化存储
        self.fp.write(item['author'] + ':' + item['content'] + '\n')
        return item

    def close_spider(self, spider):
        print('结束爬虫')
        self.fp.close()

（4）settings.py

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' # 假装请求载体身份

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   # 不听从门户网站robots协议，避免某些信息爬取不到

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'crawlSpiderPro.pipelines.CrawlspiderproPipeline': 300,
}

（5）执行爬虫

$ scrapy crawl chouti --nolog

　　能够看到使用CrawlSpider来爬取全站数据，代码简化程度远高于手动请求发送的模式，而且性能也优化很是多。