python笔记26（爬虫进阶）

时间 2019-11-16

原文原文链接

1、scrapy框架简介

一、什么是Scrapy？

　　Scrapy是一个为了爬取网站数据，提取结构性数据而编写的应用框架，很是出名，很是强悍。所谓的框架就是一个已经被集成了各类功能（高性能异步下载，队列，分布式，解析，持久化等）的具备很强通用性的项目模板。对于框架的学习，重点是要学习其框架的特性、各个功能的用法便可。css

二、安装

　　Linux：html

pip3 install scrapypython

　　Windows：mysql

a. pip3 install wheellinux

b. 下载twisted http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted（下载被编译过的数据）正则表达式

c. 进入下载目录，执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whlredis

d. pip3 install pywin32sql

e. pip3 install scrapy数据库

备注：输入scrapy检查若是有对应的版本信息则表示安装完成。json

三、Scrapy核心组件介绍

引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心)
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 能够想像成一个URL（抓取网页的网址或者说是连接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是创建在twisted这个高效的异步模型上的)
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取本身须要的信息, 即所谓的实体(Item)。用户也能够从中提取出连接,让Scrapy继续抓取下一个页面
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证明体的有效性、清除不须要的信息。当页面被爬虫解析后，将被发送到项目管道，并通过几个特定的次序处理数据。

2、scrapy框架基础使用

　　1）建立项目：scrapy startproject 项目名称

　　　　项目结构：

project_name/
   scrapy.cfg：
   project_name/
       __init__.py
       items.py
       pipelines.py
       settings.py
       spiders/
           __init__.py


scrapy.cfg   项目的主配置信息。（真正爬虫相关的配置信息在settings.py文件中）
items.py     设置数据存储模板，用于结构化数据，如：Django的Model
pipelines    数据持久化处理
settings.py  配置文件，如：递归的层数、并发数，延迟下载等
spiders      爬虫目录，如：建立文件，编写爬虫解析规则

　　2）建立爬虫应用程序：

　　　　　　cd project_name（进入项目目录）

　　　　　　scrapy genspider 应用名称爬取网页的起始url （例如：scrapy genspider qiubai www.qiushibaike.com）

　　3）编写爬虫文件:在步骤2执行完毕后，会在项目的spiders中生成一个应用名的py爬虫文件，文件源码以下：

# -*- coding: utf-8 -*-
import scrapy


class QiubaiSpider(scrapy.Spider):
    #爬虫文件的名称：能够指定某一个具体的爬虫文件
    name = 'qiubai' #应用名称
    #容许爬取的域名（若是遇到非该域名的url则爬取不到数据）
    allowed_domains = ['https://www.qiushibaike.com/']
    #起始爬取的url：工程被执行后就能够获取该列表中的url所对应的页面
    start_urls = ['https://www.qiushibaike.com/']

     #访问起始URL并获取结果后的回调函数，该函数的response参数就是向起始的url发送请求后，获取的响应对象.
    #response参数：就是对起始url发起请求后的响应对象
    #该函数返回值必须为可迭代对象或者NUll 
     def parse(self, response):
        print(response.text) #获取字符串类型的响应内容
        print(response.body)#获取字节类型的相应内容#

　　4）设置修改settings.py配置文件相关配置

修改内容及其结果以下（假装请求载体身份）：
19行：USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' 

22行：ROBOTSTXT_OBEY = False  #能够忽略或者不遵照robots协议

　　5）执行爬虫程序：scrapy crawl 应用名称

　　不想打印日志能够执行：scrapy crawl 应用名称 --nolog

3、scrapy框架持久化存储

一、将糗百首页中段子的内容和标题进行爬取，将解析的内容存在磁盘文件中

# -*- coding: utf-8 -*-
import scrapy


class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    #allowed_domains = ['https://www.qiushibaike.com/']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        #xpath为response中的方法，能够将xpath表达式直接做用于该函数中,xpath返回的列表元素为Selector类型的对象
        odiv = response.xpath('//div[@id="content-left"]/div')
        content_list = [] #用于存储解析到的数据
        for div in odiv:
            #xpath函数返回的为列表，列表中存放的数据为Selector类型的数据。咱们解析到的内容被封装在了Selector对象中，须要调用extract()函数将解析的内容从Selecor中取出。
            # extract()能够将Selector对象总存取的文本内容获取,也可使用extract_first()
            # author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('.//div[@class="content"]/span//text()').extract()
            # 将列表转化为字符串
            content = "".join(content)
            #将解析到的内容封装到字典中
            dic={
                '做者':author,
                '内容':content,
            }

            #将数据存储到content_list这个列表中
            content_list.append(dic)

        return content_list

执行爬虫程序：

执行输出指定格式进行存储：将爬取到的数据写入不一样格式的文件中进行存储
    scrapy crawl qiubai -o qiubai.json
    scrapy crawl qiubai -o qiubai.xml
    scrapy crawl qiubai -o qiubai.csv

二、scrapy持久化操做：将爬取到糗百数据存储写入到文本文件中进行存储

# -*- coding: utf-8 -*-
import scrapy

class QiubaiSpider(scrapy.Spider):
    name = 'qiubai'
    allowed_domains = ['https://www.qiushibaike.com/']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        #xpath为response中的方法，能够将xpath表达式直接做用于该函数中
        odiv = response.xpath('//div[@id="content-left"]/div')
        with open('./data.txt', 'w') as fp:
            for div in odiv:
                 #xpath函数返回的为列表，列表中存放的数据为Selector类型的数据。咱们解析到的内容被封装在了Selector对象中，须要调用extract()函数将解析的内容从Selecor中取出。
                author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
                content = div.xpath('.//div[@class="content"]/span//text()').extract()

                 #持久化存储爬取到的内容
                 fp.write(author + ':' + content + '\n')

注意：上述代码表示的持久化操做是咱们本身经过IO操做将数据进行的文件存储。在scrapy框架中已经为咱们专门集成好了高效、便捷的持久化操做功能，咱们直接使用便可。要想使用scrapy的持久化操做功能，咱们首先来认识以下两个文件：

items.py：数据结构模板文件。定义数据属性。
    pipelines.py：管道文件。接收数据（items），进行持久化操做。

持久化流程：
    1.爬虫文件爬取到数据后，须要将数据封装到items对象中。
    2.使用yield关键字将items对象提交给pipelines管道进行持久化操做。
    3.settings.py配置文件中开启管道

小试牛刀：将糗事百科首页中的段子和做者数据爬取下来，而后进行持久化存储

爬虫文件：qiubaiDemo.py

# -*- coding: utf-8 -*-
import scrapy
from secondblood.items import SecondbloodItem

class QiubaidemoSpider(scrapy.Spider):
    name = 'qiubaiDemo'
    allowed_domains = ['www.qiushibaike.com']
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        odiv = response.xpath('//div[@id="content-left"]/div')
        for div in odiv:
            # xpath函数返回的为列表，列表中存放的数据为Selector类型的数据。咱们解析到的内容被封装在了Selector对象中，须要调用extract()函数将解析的内容从Selecor中取出。xpath返回的列表元素为Selector类型的对象。
            #extract()能够将Selector对象总存取的文本内容获取,也可使用extract_first()
            #author = div.xpath('./div[1]/a[2]/h2/text()')[0].extract()
            author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('.//div[@class="content"]/span//text()').extract()
            #将列表转化为字符串
            content = "".join(content)
            #实例化item对象
                #一、导入类PiplineproItem
                #二、实例化对象
            item = PiplineproItem()
                #三、将解析到的数据值存储
            item['author'] = author
            item['content'] = content
                #四、将item对象提交给管道
            yield item

items文件：items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class PiplineproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # scrapy.Field()能够存储任意类型的数据
    author = scrapy.Field()  # 存储做者
    content = scrapy.Field()  # 存储段子内容

管道文件：pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class PiplineproPipeline(object):
    # 做用：每当爬虫文向管道提交一次item，该方法就会被调用一次。
    # item参数就是接收到爬虫文件给提交过来的item对象
    # 构造方法
    def __init__(self):
        self.fp = None  # 定义一个文件描述符属性

    # 下列都是在重写父类的方法：
    # 该方法只有在开始爬虫的时候被调用一次
    def open_spider(self, spider):
        self.fp = open('./qiubai_data.txt', 'w', encoding='utf-8')
        print('开始爬虫')

    # 由于该方法会被执行调用屡次，因此文件的开启和关闭操做写在了另外两个只会各自执行一次的方法中。
    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        # 将爬虫程序提交的item进行持久化存储
        self.fp.write(author + ':' + content)
        return item

    # 该方法只有在开始结束的时候被调用一次
    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()

配置文件：settings.py

# 默认状况下，scrapy框架没有开启管道功能，若是想用管道作数据持久化存储，在此处给管道开启ITEM_PIPELINES
#PiplineproPipeline在piplines.py中的类，300为优先级，数值越小，优先级越高
# 在当前的项目中能够写多个管道文件，每个类都须要在此开启
ITEM_PIPELINES = {
   'pipLinePro.pipelines.PiplineproPipeline': 300,
}

三、将糗百首页中段子的内容和标题进行爬取，将数据存储在mysql数据库中。

1）建立表

#打开cmd，在终端输入如下语句
mysql -uroot -p
create database scrapyDB;
user scrapyDB;
create table qiubai(author varchar(100),content varchar(9999));

sql语句

2）piplines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql


class PiplineproPipeline(object):
    # 做用：每当爬虫文向管道提交一次item，该方法就会被调用一次。
    # item参数就是接收到爬虫文件给提交过来的item对象
    fp = None  # 定义一个文件描述符属性

    # 下列都是在重写父类的方法：
    # 该方法只有在开始爬虫的时候被调用一次
    def open_spider(self, spider):
        self.fp = open('./qiubai_data.txt', 'w', encoding='utf-8')
        print('开始爬虫')

    # 由于该方法会被执行调用屡次，因此文件的开启和关闭操做写在了另外两个只会各自执行一次的方法中。
    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        # 将爬虫程序提交的item进行持久化存储
        self.fp.write(author + ':' + content)
        return item

    # 该方法只有在开始结束的时候被调用一次
    def close_spider(self, spider):
        print('爬虫结束')
        self.fp.close()


class MyPipline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        self.conn = pymysql.Connect(host="192.168.12.65", port=3306, db="scrapyDB", charset="utf8", user="root")
        self.cursor = self.conn.cursor()
        print('mysql链接成功')

    def process_item(self, item, spider):
        author = item['author']
        content = item['content']
        sql = "insert into qiubai values('%s','%s')" % (author, content)
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
        return item

pipelines.py

3）settings

ITEM_PIPELINES = {
    'pipLinePro.pipelines.PiplineproPipeline': 300,
    'pipLinePro.pipelines.MyPipline': 300,
}

settings

四、Scrapy递归爬取多页数据

需求：将糗事百科全部页码的做者和段子内容数据进行爬取切持久化存储

# -*- coding: utf-8 -*-
import scrapy
from choutiAllPro.items import ChoutiallproItem


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    # allowed_domains = ['www.chouti.com']
    start_urls = ['https://dig.chouti.com/r/pic/hot/1']
    # 设计了一个全部页面通用的url（pageNum表示不一样的页码）
    pageNum = 1  # 起始页码
    url = 'https://dig.chouti.com/r/pic/hot/%d'  # 每页的url

    def parse(self, response):
        div_list = response.xpath('//div[@class="content-list"]/div')
        for div in div_list:
            title = div.xpath('./div[3]/div[1]/a/text()').extract_first()
            item = ChoutiallproItem()
            item['title'] = title
            yield item  # 提交item到管道进行持久化存储

        # 进行其余页码对应url的请求操做，爬取全部页码数据
        if self.pageNum <= 10:
            self.pageNum += 1
            url = format(self.url % self.pageNum)
            print(url)
            # 递归爬取数据：callback参数的值为回调函数（将url请求后，获得的相应数据进行parse解析），递归调用parse函数
            yield scrapy.Request(url=url, callback=self.parse)

chouti.py

4、scrapy框架cookie和代理

一、cookie

在request模块中可使用request.session获取cookie，可是在scrapy中没有request方法，起始在scrapy框架中，不须要额外的处理cookie，在scrapy框架中会额外的处理cookie，自动将cookie存储，前几个案例都是发起的get请求，下面案例使用scrqpy框架发post请求：
默认状况下发起的为get请求：

# -*- coding: utf-8 -*-
import scrapy


class PostSpider(scrapy.Spider):
    name = 'post'
    allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.xxx.com/']

    # Spider中的start_requests方法，能够将start_urls列表中的url依次默认进行get请求
    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        pass

start_requests方法

重写Spider中的start_requests方法，将start_urls列表中的url依次进行post请求：

# -*- coding: utf-8 -*-
import scrapy

#需求：对start_urls列表中的url发起post请求
class PostSpider(scrapy.Spider):
    name = 'post'
    #allowed_domains = ['www.xxx.com']
    start_urls = ['https://fanyi.baidu.com/sug']

    # 重写Spider中的start_requests方法，将start_urls列表中的url依次进行post请求
    def start_requests(self):
        for url in self.start_urls:
            #FormRequest():该方法能够发起一个post请求
            yield scrapy.FormRequest(url=url, callback=self.parse,formdata={'kw':'dog'})

    def parse(self, response):
        print(response.text)

重写start_requests

处理基于cookie的登陆操做：

# -*- coding: utf-8 -*-
import scrapy


class LoginSpider(scrapy.Spider):
    name = 'login'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://accounts.douban.com/login']

    def start_requests(self):
        data = {
            'source': 'movie',
            'redir': 'https://movie.douban.com/',
            'form_email': 'xxx',
            'form_password': 'xxx',
            'login': '登陆'
        }
        for url in self.start_urls:
            yield scrapy.FormRequest(url=url, callback=self.parse, formdata=data)

    # 单独写一个获取页面数据的方法
    def getPageText(self, response):
        page_text = response.text
        with open('./douban.html', 'w', encoding='utf-8') as fp:
            fp.write(page_text)
            print('over')

    def parse(self, response):
        # 对当前用户的我的主页页面进行获取
        url = 'https://www.douban.com/people/xxx/'
        yield scrapy.Request(url=url, callback=self.getPageText)

二、代理

DOWNLOADER_MIDDLEWARES = {
   'dailiPro.middlewares.DailiproDownloaderMiddleware': 543,
}

settings.py

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# 爬虫中间件
# class DailiproSpiderMiddleware(object):
#     # Not all methods need to be defined. If a method is not defined,
#     # scrapy acts as if the spider middleware does not modify the
#     # passed objects.
#
#     @classmethod
#     def from_crawler(cls, crawler):
#         # This method is used by Scrapy to create your spiders.
#         s = cls()
#         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
#         return s
#
#     def process_spider_input(self, response, spider):
#         # Called for each response that goes through the spider
#         # middleware and into the spider.
#
#         # Should return None or raise an exception.
#         return None
#
#     def process_spider_output(self, response, result, spider):
#         # Called with the results returned from the Spider, after
#         # it has processed the response.
#
#         # Must return an iterable of Request, dict or Item objects.
#         for i in result:
#             yield i
#
#     def process_spider_exception(self, response, exception, spider):
#         # Called when a spider or process_spider_input() method
#         # (from other spider middleware) raises an exception.
#
#         # Should return either None or an iterable of Response, dict
#         # or Item objects.
#         pass
#
#     def process_start_requests(self, start_requests, spider):
#         # Called with the start requests of the spider, and works
#         # similarly to the process_spider_output() method, except
#         # that it doesn’t have a response associated.
#
#         # Must return only requests (not items).
#         for r in start_requests:
#             yield r
#
#     def spider_opened(self, spider):
#         spider.logger.info('Spider opened: %s' % spider.name)

# 下载中间件
class DailiproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    # 处理请求的方法
    def process_request(self, request, spider):
        # request参数表示的就是拦截到请求对象
        # request.meta = {'https':'151.106.15.3:1080'}
        request.meta['proxy'] = "https://151.106.15.3:1080"
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

middlewares.py

# -*- coding: utf-8 -*-
import scrapy


class DailiSpider(scrapy.Spider):
    name = 'daili'
    allowed_domains = ['www.daili.com']
    start_urls = ['http://www.baidu.com/s?wd=ip']

    def parse(self, response):
        page_text = response.text
        with open('daili.html', 'w', encoding='utf-8') as fp:
            fp.write(page_text)

daili.py

5、scrapy框架之日志等级和请求传参

一、Scrapy的日志等级

- 在使用scrapy crawl spiderFileName运行程序时，在终端里打印输出的就是scrapy的日志信息。

　　- 日志信息的种类：

　　　　　　　　ERROR ：通常错误

　　　　　　　　WARNING : 警告

　　　　　　　　INFO : 通常的信息

　　　　　　　　DEBUG ：调试信息

　　　　　　　　默认的显示级别是DEBUG

　　- 设置日志信息指定输出：

　　　　在settings.py配置文件中，加入LOG_LEVEL = ‘指定日志信息种类’便可。LOG_FILE = 'log.txt'则表示将日志信息写入到指定文件中进行存储。

二、请求传参

- 在某些状况下，咱们爬取的数据不在同一个页面中，例如，咱们爬取一个电影网站，电影的名称，评分在一级页面，而要爬取的其余电影详情在其二级子页面中。这时咱们就须要用到请求传参。

　　- 案例展现：爬取https://www.dy2018.com/html/gndy/dyzz/电影网，将一级页面中的电影名称，二级页面中的导演爬取。

　　爬虫文件：

# -*- coding: utf-8 -*-
import scrapy
from moviePro.items import MovieproItem


class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.movie.com']
    start_urls = ['https://www.dy2018.com/html/gndy/dyzz/']

    # 该方法能够将电影详情页中的数据进行解析
    def getSencodPageText(self, reponse):
        item = reponse.meta['item']
        actor = reponse.xpath('//*[@id="Zoom"]/p[16]/text()').extract_first()
        item['actor'] = actor
        yield item

    def parse(self, response):
        table_list = response.xpath('//div[@class="co_content8"]/ul/table')
        for table in table_list:
            url = 'https://www.dy2018.com'+table.xpath('./tbody/tr[2]/td[2]/b/a/@href').extract_first()
            name = table.xpath('./tbody/tr[2]/td[2]/b/a/text()').extract_first()
            item = MovieproItem()
            item['name'] = name
            yield scrapy.Request(url=url, callback=self.getSencodPageText, meta={'item': item})

movie.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()
    actor = scrapy.Field()

items.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class MovieproPipeline(object):
    def process_item(self, item, spider):
        print(item['name']+":"+item['actor'])
        return item

pipelines.py

6、scrapy框架之CrawlSpider操做

提问：若是想要经过爬虫程序去爬取”糗百“全站数据新闻数据的话，有几种实现方法？

方法一：基于Scrapy框架中的Spider的递归爬取进行实现（Request模块递归回调parse方法）。

方法二：基于CrawlSpider的自动爬取进行实现（更加简洁和高效）。

一、简介

　　CrawlSpider实际上是Spider的一个子类，除了继承到Spider的特性和功能外，还派生除了其本身独有的更增强大的特性和功能。其中最显著的功能就是”LinkExtractors连接提取器“。Spider是全部爬虫的基类，其设计原则只是为了爬取start_url列表中网页，而从爬取到的网页中提取出的url进行继续的爬取工做使用CrawlSpider更合适。

二、使用

　　1.建立scrapy工程：scrapy startproject projectName

　　2.建立爬虫文件：scrapy genspider -t crawl spiderName www.xxx.com

　　　　--此指令对比之前的指令多了 "-t crawl"，表示建立的爬虫文件是基于CrawlSpider这个类的，而再也不是Spider这个基类。

　　3.观察生成的爬虫文件

# -*- coding: utf-8 -*-
import scrapy
#导入CrawlSpider相关模块
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

#表示该爬虫程序是基于CrawlSpider类的
class ChoutidemoSpider(CrawlSpider):

    name = 'choutiDemo'
    #allowed_domains = ['www.chouti.com']
    start_urls = ['http://www.chouti.com/']
    #表示为提取Link规则
    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )
    #解析方法
    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

　　　CrawlSpider类和Spider类的最大不一样是CrawlSpider多了一个rules属性，其做用是定义”提取动做“。在rules中能够包含一个或多个Rule对象，在Rule对象中包含了LinkExtractor对象。

3.1 LinkExtractor：顾名思义，连接提取器。
　　　　LinkExtractor(
　　　　　　　  allow=r'Items/'，# 知足括号中“正则表达式”的值会被提取，若是为空，则所有匹配。
　　　　　　　　 deny=xxx,  # 知足正则表达式的则不会被提取。
　　　　　　　　 restrict_xpaths=xxx, # 知足xpath表达式的值会被提取
　　　　　　　　 restrict_css=xxx, # 知足css表达式的值会被提取
　　　　　　　　 deny_domains=xxx, # 不会被提取的连接的domains。　
　　  )
　　　　- 做用：提取response中符合规则的连接。
　　　　
3.2 Rule : 规则解析器。根据连接提取器中提取到的连接，根据指定规则提取解析器连接网页中的内容。
　　　　 Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True)
　　　　- 参数介绍：
　　　　　　参数1：指定连接提取器
　　　　　　参数2：指定规则解析器解析数据的规则（回调函数）
　　　　　　参数3：是否将连接提取器继续做用到连接提取器提取出的连接网页中。当callback为None,参数3的默认值为true。
　　
3.3 rules=( ):指定不一样规则解析器。一个Rule对象表示一种提取规则。
　　
3.4 CrawlSpider总体爬取流程：
　　　　a)爬虫文件首先根据起始url，获取该url的网页内容
　　　　b)连接提取器会根据指定提取规则将步骤a中网页内容中的连接进行提取
　　　　c)规则解析器会根据指定解析规则将连接提取器中提取到的连接中的网页内容根据指定的规则进行解析
　　　　d)将解析数据封装到item中，而后提交给管道进行持久化存储

三、简单代码实战应用

爬虫文件：

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from qiubaiBycrawl.items import QiubaibycrawlItem
import re

class QiubaitestSpider(CrawlSpider):
    name = 'qiubaiTest'
    #起始url
    start_urls = ['http://www.qiushibaike.com/']

    # 链接提取器（提取页码链接）：从起始的url表示的页面的源码中进行指定链接的提取
    # allow参数：正则表达式，能够将起始url页面源码数据中符合该规则的链接进行提取
    page_link = LinkExtractor(allow=r'/8hr/page/\d+/')

    rules = (
        # 规则解析器：将链接提取器提取到的链接对应的页面数据进行指定（callback）规则的解析
        # follow=True：将链接提起器继续做用到链接提取器提取出的页面当中
        Rule(page_link, callback='parse_item', follow=True),
    )

    #自定义规则解析器的解析规则函数
    def parse_item(self, response):
        div_list = response.xpath('//div[@id="content-left"]/div')
     
        for div in div_list:
            #定义item
            item = QiubaibycrawlItem()
            #根据xpath表达式提取糗百中段子的做者
            item['author'] = div.xpath('./div/a[2]/h2/text()').extract_first().strip('\n')
            #根据xpath表达式提取糗百中段子的内容
            item['content'] = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip('\n')

            yield item #将item提交至管道

item文件：

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy


class QiubaibycrawlItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field() #做者
    content = scrapy.Field() #内容

管道文件：

# -*- coding: utf-8 -*-
# Define your item pipelines here
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

class QiubaibycrawlPipeline(object):
    def __init__(self):
        self.fp = None

    def open_spider(self,spider):
        print('开始爬虫')
        self.fp = open('./data.txt','w')
       
    def process_item(self, item, spider):
        #将爬虫文件提交的item写入文件进行持久化存储
        self.fp.write(item['author']+':'+item['content']+'\n')
        return item   

    def close_spider(self,spider):
        print('结束爬虫')
        self.fp.close()

7、scrapy框架之分布式操做

一、redis简单回顾

　　1.启动redis：

　　　　mac/linux: redis-server redis.conf

　　　　windows: redis-server.exe redis-windows.conf

　　2.对redis配置文件进行配置：

　　　　- 注释该行：bind 127.0.0.1，表示可让其余ip访问redis

　　　　- 将yes该为no：protected-mode no，表示可让其余ip操做redis

二、scrapy基于redis的数据持久化操做流程

　　1.安装scrapy-redis组件：

　　　　- pip install scrapy-redis

　　　　- scrapy-redis是基于scrapy框架开发出的一套组件，其做用就是可让scrapy实现分布式爬虫。

　　2.编写爬虫文件：

　　　　- 同以前scrapy中基于Spider或者CrawlSpider的编写方式一致。

　　3.编写管道文件：

　　　　- 在scrapy-redis组件中已经帮助咱们封装好了一个专门用于链接存储redis数据库的管道（RedisPipeline），所以咱们直接使用便可，无需本身编写管道文件。

　　4.编写配置文件：

　　　　- 在settings.py中开启管道，且指定使用scrapy-redis中封装好的管道。

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 400
}

　　　　- 该管道默认会链接且将数据存储到本机的redis服务中，若是想要链接存储到其余redis服务中须要在settings.py中进行以下配置：

REDIS_HOST = 'redis服务的ip地址'
REDIS_PORT = 6379
REDIS_ENCODING = ‘utf-8’
REDIS_PARAMS = {‘password’:’123456’}

三、redis分布式部署

　　1.scrapy框架是否能够本身实现分布式？

　　　　- 不能够。缘由有二。

　　　　　　其一：由于多台机器上部署的scrapy会各自拥有各自的调度器，这样就使得多台机器没法分配start_urls列表中的url。（多台机器没法共享同一个调度器）

　　　　　　其二：多台机器爬取到的数据没法经过同一个管道对数据进行统一的数据持久出存储。（多台机器没法共享同一个管道）

　　2.redis实现分布式基本流程：

　　　　- 使用基于scrapy-redis组件中的爬虫文件。

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from movieproject.items import MovieprojectItem
#导入scrapy-redis中的模块
from scrapy_redis.spiders import RedisCrawlSpider

class NnSpider(RedisCrawlSpider):
    name = 'nn' 
    allowed_domains = ['www.id97.com']
    #redis_key表示调度器中的队列（将要爬取的页面数据对应的url都须要放置到调度器队列中）
    redis_key = 'nnspider:start_urls'

    # 根据规则提取全部的页码连接
    page_link = LinkExtractor(allow=r'/movie/\?page=\d')
    detail_link = LinkExtractor(restrict_xpaths='//div[contains(@class,"col-xs-1-5")]/div/a')
    # detail_link = LinkExtractor(allow=r'/movie/\d+\.html$')
    # follow : 是否跟进
    rules = (
        # 全部的页码不用处理，跟进便可
        Rule(page_link, follow=True),
        # 全部的详情页处理，不用跟进
        Rule(detail_link, callback='parse_item', follow=False),
    )

    def parse_item(self, response):
        # 建立一个item对象
        item = MovieprojectItem()
        # 电影海报
        item['post'] = response.xpath('//a[@class="movie-post"]/img/@src').extract_first()
        # 电影名字
        item['name'] = response.xpath('//h1').xpath('string(.)').extract_first()
        yield item

- 使用scrapy-redis组件中封装好的调度器，将全部的url存储到该指定的调度器中，从而实现了多台机器的调度器共享。

# 使用scrapy-redis组件的去重队列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis组件本身的调度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否容许暂停
SCHEDULER_PERSIST = True

- 使用scrapy-redis组件中封装好的管道，将每台机器爬取到的数据存储经过该管道存储到redis数据库中，从而实现了多台机器的管道共享。

ITEM_PIPELINES = {
   'scrapy_redis.pipelines.RedisPipeline': 400,
}

- 执行：scrapy runspider xxx.py，而后向调度器队列中传入起始url：lpush nnspider:start_urls "http://www.xxx.com/"

总结：

# 一、导入RedisCrawlSpider类，将爬虫类的父类修改为RedisCrawlSpider
# 二、将start_urls修改成redis_key属性
# 三、编写具体的解析代码
# 四、将item提交到scrapy-redis组件被封装好的管道里
# 五、将爬虫文件中产生的url对应的请求对象所有都提交到scrapy-redis封装好的调度器中
# 六、在配置文件中指明将爬取到的数据值存储到哪个redis数据库中
# 七、在redis.windows.conf文件中注释"bind 127.0.0.1"，不注释表明当前的redis数据库只容许本机访问，修改保护模式"protected-mode no"，yes表示只容许读操做
# 八、启动redis服务器：redis-server ./redis.windows.conf,启动redis客户端：redis-cli
# 九、进入爬虫文件所在目录，执行爬虫文件：scrapy runspider 爬虫文件名称
# 十、向调度器中扔一个起始的url，打开redis客户端，输入指令lpush 调度器队列名称 起始url [value ...]

问答题：

1）接触过几种爬虫模块？（request urllib）
2）robots协议是什么？（门户网站中经常使用的反爬手段，防君子不防小人）
3）如何处理验证码？（使用云打码平台或人工识别）

4）掌握几种数据解析的方式？（xpath、正则、bs4）

5）如何爬取动态加载的页面数据？（selenuim+phantonjs或selenuim+谷歌无头浏览器）

6）接触过哪些反爬机制？如何处理？（robots协议 UA 封IP 验证码数据加密动态数据爬取 token（动态变化的参数能够从页面源码中获取后赋值给参数））

7）scrapy核心组件工做流程（五个核心组件：爬虫文件、引擎、调度器、管道、下载器）

8）接触过几种爬虫的类（scrapy 、crawl-scrapy、redis-scrapy）

9）如何实现分布式流程

10）爬取到的数据如何进行数据分析（没有讲）