爬虫相关之浅聊爬虫

1.安装:要是说到爬虫,咱们不得不提一个大而全的爬虫组件/框架,这个框架就是scrapy:scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其能够应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。那么咱们直接进入正题,先说说这个框架的两种安装方式:
第一种:windows环境下的安装须要如下几步操做
1.下载twisted:http://www.lfd.uci.edu/~gohlke/pythonlibs/
2.pip3 install wheel
3.pip3 install Twisted‑18.4.0‑cp36‑cp36m‑win_amd64.whl #根据本身的版本找对应的版本进行安装
4.pip3 install pywin32
5.pip3 install scrapy
第二种:在linux系统下安装,苹果的mac下的安装方式也是同样
pip3 install scrapy

 

2.scrapy的基本使用:Django与Scrapy的使用对比html

Django:python

# 建立django project
django-admin startproject mysite 

cd mysite

# 建立app
python manage.py startapp app01 
python manage.py startapp app02 

# 启动项目
python manage.runserver

Scrapy:react

# 建立scrapy project
  scrapy startproject cjk

  cd cjk

# 建立爬虫
  scrapy genspider chouti chouti.com
  scrapy genspider cnblogs cnblogs.com 

# 启动爬虫
  scrapy crawl chouti
安装好scrapy以后进入cmd命令行查看是否安装成功:scrapy,若是见到以下的提示就表明安装好了
 1 Last login: Sat Jan  5 18:14:13 on ttys000
 2 chenjunkandeMBP:~ chenjunkan$ scrapy
 3 Scrapy 1.5.1 - no active project
 4 
 5 Usage:
 6   scrapy <command> [options] [args]
 7 
 8 Available commands:
 9   bench         Run quick benchmark test
10   fetch         Fetch a URL using the Scrapy downloader
11   genspider     Generate new spider using pre-defined templates
12   runspider     Run a self-contained spider (without creating a project)
13   settings      Get settings values
14   shell         Interactive scraping console
15   startproject  Create new project
16   version       Print Scrapy version
17   view          Open URL in browser, as seen by Scrapy
18 
19   [ more ]      More commands available when run from project directory
20 
21 Use "scrapy <command> -h" to see more info about a command
View Code
scrapy项目目录的查看:
建立project
  scrapy startproject 项目名称
    项目名称
    项目名称/
      - spiders # 爬虫文件
        - chouti.py
        - cnblgos.py
  - items.py # 持久化
  - pipelines # 持久化
  - middlewares.py # 中间件
  - settings.py # 配置文件(爬虫)
  scrapy.cfg # 配置文件(部署)
如何启动爬虫?咱们来看下面这个简单的例子:

# -*- coding: utf-8 -*-
import scrapy


class ChoutiSpider(scrapy.Spider):
    # 爬虫的名称
    name = 'chouti'
    # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取
    allowed_domains = ['dig.chouti.com']
    # 起始的url
    start_urls = ['http://dig.chouti.com/']

    # 回调函数,起始的url执行结束以后就会自动去调这个函数
    def parse(self, response):
        # <200 https://dig.chouti.com/> <class 'scrapy.http.response.html.HtmlResponse'>
        print(response, type(response))
经过看源码得知from scrapy.http.response.html import HtmlResponse这个里面的HtmlResponse这个类是继承TextResponse,TextResponse同时又继承了Response,所以HtmlResponse也就具备了父类里面的方法(text,xpath)等
说明:上面这个举例简单的打印了返回的response里面的信息以及response的类型,可是在这里面咱们须要注意几点:
  • 对于chouti这个网站来讲设置了反爬虫的协议,所以咱们在setting里面须要将ROBOTSTXT_OBEY这个配置改成Flase,默认为True,若是改成Flase以后那么就不会遵循爬虫协议
  • 对于回调函数返回的response实际上是HtmlResponse这个类的对象,response是一个封装了全部请求的响应信息的对象
  • 执行scrapy crawl chouti与scrapy crawl chouti --log的区别?前面是打印出来日志,后面是不打印出来日志

 

3.scrapy基本原理了解(后面会详细的介绍)linux

scrapy是一个基于事件循环的异步非阻塞的框架:基于twisted实现的一个框架,内部是基于事件循环的机制实现爬虫的并发

原来的我是这样实现同时爬取多个url的:将请求一个一个发出去
url_list = ['http://www.baidu.com','http://www.baidu.com','http://www.baidu.com',]
                
for item in url_list:
    response = requests.get(item)
    print(response.text)
如今的我能够经过这种方式来实现:
from twisted.web.client import getPage, defer
from twisted.internet import reactor


# 第一部分:代理开始接收任务
def callback(contents):
    print(contents)


deferred_list = []
# 我须要请求的列表
url_list = ['http://www.bing.com', 'https://segmentfault.com/', 'https://stackoverflow.com/']
for url in url_list:
    # 不会立刻就将请求发出去,而是生成一个对象
    deferred = getPage(bytes(url, encoding='utf8'))
    # 请求结束调用的回调函数
    deferred.addCallback(callback)
    # 将全部的对象放入一个列表里面
    deferred_list.append(deferred)

# 第二部分:代理执行完任务后,中止
dlist = defer.DeferredList(deferred_list)


def all_done(arg):
    reactor.stop()


dlist.addBoth(all_done)

# 第三部分:代理开始去处理吧
reactor.run()

 

4.持久化web

咱们传统的持久化通常来讲都是将爬到的数据写到文件里面,这样作有不少缺点:好比
  • 没法完成爬虫刚开始:打开链接; 爬虫关闭时:关闭链接;
  • 分工不明确
传统的持久化写文件:

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 
 4 
 5 class ChoutiSpider(scrapy.Spider):
 6     # 爬虫的名称
 7     name = 'chouti'
 8     # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取
 9     allowed_domains = ['dig.chouti.com']
10     # 起始的url
11     start_urls = ['http://dig.chouti.com/']
12 
13     # 回调函数,起始的url执行结束以后就会自动去调这个函数
14     def parse(self, response):
15         f = open('news.log', mode='a+')
16         item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
17         for item in item_list:
18             text = item.xpath('.//a/text()').extract_first()
19             href = item.xpath('.//a/@href').extract_first()
20             print(href, text.strip())
21             f.write(href + '\n')
22         f.close()
View Code
这个时候scrapy里面的两个重要的模块就要上场了:分别是 Item与Pipelines

用scrapy实现pipeline步骤:

a.在setting配置文件中配置ITEM_PIPELINES:在这里能够写多个,后面的数字表明优先级,数字越小表明越优先

ITEM_PIPELINES = {
    'cjk.pipelines.CjkPipeline': 300,
}
b.当咱们配置好了ITEM_PIPELINES以后其实咱们的pipelines.py文件里面的def process_item(self, item, spider)这个方法会被自动的触发,可是咱们仅仅是修改这个咱们会发现运行爬虫的时候并无打印出咱们想要的信息chenjunkan;
class CjkscrapyPipeline(object):
    def process_item(self, item, spider):
        print("chenjunkan")
        return item

c.那么咱们先要在Item.py这个文件里面新增两个字段,用来约束只须要传这两个字段就行redis

import scrapy


class CjkscrapyItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    href = scrapy.Field()
    title = scrapy.Field()
而后在爬虫里面新增:yield CjkscrapyItem(href=href, title=text),实例化CjkItem这个类生成一个对象,给这个对象传入两个参数,CjkItem这个类里面的两个参数就是用来接收这两个参数的
 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from cjkscrapy.items import CjkscrapyItem
 4 
 5 
 6 class ChoutiSpider(scrapy.Spider):
 7     # 爬虫的名称
 8     name = 'chouti'
 9     # 定向爬虫,只能爬取域名是gig.chouti.com的网站才能爬取
10     allowed_domains = ['dig.chouti.com']
11     # 起始的url
12     start_urls = ['http://dig.chouti.com/']
13 
14     # 回调函数,起始的url执行结束以后就会自动去调这个函数
15     def parse(self, response):
16         item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
17         for item in item_list:
18             text = item.xpath('.//a/text()').extract_first()
19             href = item.xpath('.//a/@href').extract_first()
20             yield CjkscrapyItem(href=href, title=text)
View Code

d.接下来pipelines里面的process_item这个方法就会被自动的触发,每yeild一次,process_item这个方法就被触发一次;这个里面的item就是咱们经过CjkItem建立的对象;那么为何咱们须要用CjkItem,由于咱们能够经过这个类来给咱们CjkPipeline约束要取什么数据进行持久化,item给咱们约定了什么字段咱们就去取什么字段;那么def process_item(self, item, spider)这个方法里面的spider值得是什么呢?这个其实就是实例化的爬虫,由于class ChoutiSpider(scrapy.Spider):这个爬虫要想执行的话首先必须先实例化,说白了就是咱们当前爬虫的对象,这个对象里面有name,allowed_domains等一些属性算法

e.所以咱们一个爬虫当yield CjkItem的时候就是交给咱们的pipelines去处理,若是咱们yield Request那么就是从新去下载shell

总结:django

 1 a.先写pipeline类
 2 
 3 
 4 class XXXPipeline(object):
 5     def process_item(self, item, spider):
 6         return item
 7 
 8 
 9 b.写Item类
10 
11 
12 class XdbItem(scrapy.Item):
13     href = scrapy.Field()
14     title = scrapy.Field()
15 
16 
17 c.配置
18 ITEM_PIPELINES = {
19     'xdb.pipelines.XdbPipeline': 300,
20 }
21 
22 d.爬虫,yield每执行一次,process_item就调用一次。
23 
24 yield Item对象
View Code
如今咱们有了初步的了解持久化了,可是咱们仍是存在点问题,就是每yeild一次就会触发process_item这个方法一次,假设咱们在process_item里面要不断的打开和关闭链接,这就对性能上影响比较大了。示例:
def process_item(self, item, spider):
        f = open("xx.log", "a+")
        f.write(item["href"]+"\n")
        f.close()
        return item
那么在CjkscrapyPipeline这个类里面其实还有两个方法open_spider和close_spider,这样咱们就能够将打开链接的操做放在open_spider里面,将关闭链接的操做放在close_spider里面,避免重复的去操做打开关闭:
class CjkscrapyPipeline(object):

    def open_spider(self, spider):
        print("爬虫开始了")
        self.f = open("new.log", "a+")

    def process_item(self, item, spider):
        self.f.write(item["href"] + "\n")

        return item

    def close_spider(self, spider):
        self.f.close()
        print("爬虫结束了")
咱们上面确实是完成了功能,可是仔细看看咱们上面的这种写法是否是有点不是很规范,所以:

class CjkscrapyPipeline(object):
    def __init__(self):
        self.f = None

    def open_spider(self, spider):
        print("爬虫开始了")
        self.f = open("new.log", "a+")

    def process_item(self, item, spider):
        self.f.write(item["href"] + "\n")

        return item

    def close_spider(self, spider):
        self.f.close()
        print("爬虫结束了")
再看上面的代码,我么会发现咱们将写文件的目录放在了程序中,那么咱们能不能在配置文件中去配置呢?因此在CjkscrapyPipeline这个类里面有一个方法:from_crawler,这是一个类方法

@classmethod
    def from_crawler(cls, crawler):
        print('File.from_crawler')
        path = crawler.settings.get('HREF_FILE_PATH')
        return cls(path)
说明:crawler.settings.get('HREF_FILE_PATH')表明去全部的配置文件中去找HREF_FILE_PATH,这个方法里面的cls指的是当前类CjkscrapyPipeline,返回的cls(path在初始化的时候传入path):

class CjkscrapyPipeline(object):
    def __init__(self, path):
        self.f = None
        self.path = path

    @classmethod
    def from_crawler(cls, crawler):
        print('File.from_crawler')
        path = crawler.settings.get('HREF_FILE_PATH')
        return cls(path)

    def open_spider(self, spider):
        print("爬虫开始了")
        self.f = open(self.path, "a+")

    def process_item(self, item, spider):
        self.f.write(item["href"] + "\n")

        return item

    def close_spider(self, spider):
        self.f.close()
        print("爬虫结束了")
那么如今咱们知道在CjkPipeline这个类里面有5个方法,他们的执行顺序是什么样的呢?

"""
源码内容:
    1. 判断当前CjkPipeline类中是否有from_crawler
        有:
            obj = CjkPipeline.from_crawler(....)
        否:
            obj = CjkPipeline()
    2. obj.open_spider()
    
    3. obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()/obj.process_item()
    
    4. obj.close_spider()
"""
说明:首先先会判断是否有from_crawler这个方法,若是没有的话,那么就会直接实例化CjkscrapyPipeline这个类生成对对象,若是有的话,那么先执行from_crawler这个方法,他会调用settting文件,由于这个方法的返回值是实例化这个类的对象,实例化的时候须要调用__init__方法
 
在pipelines里面有一个return item,这个是干吗用的呢?
from scrapy.exceptions import DropItem
# return item # 交给下一个pipeline的process_item方法
# raise DropItem()# 后续的 pipeline的process_item方法再也不执行
 
注意:pipeline是全部爬虫公用,若是想要给某个爬虫定制须要使用spider参数本身进行处理。

# if spider.name == 'chouti':

 

5.去重规则segmentfault

我拿上面以前的代码举例:

# -*- coding: utf-8 -*-
import scrapy
from cjk.items import CjkItem


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    def parse(self, response):
        print(response.request.url)

        # item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
        # for item in item_list:
        #     text = item.xpath('.//a/text()').extract_first()
        #     href = item.xpath('.//a/@href').extract_first()

        page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.parse)  # https://dig.chouti.com/all/hot/recent/2
说明:print(response.request.url)这个打印出来的是当前执行的url,当咱们执行上面的代码的时候,能够发现从1~120页没有重复的,实际上是scrapy内部本身作了去重的,也就是至关于在内存里面设置了一个集合,由于集合是不可重复的

查看源码:
导入:from scrapy.dupefilter import RFPDupeFilter
RFPDupeFilter这是一个类:

 1 class RFPDupeFilter(BaseDupeFilter):
 2     """Request Fingerprint duplicates filter"""
 3 
 4     def __init__(self, path=None, debug=False):
 5         self.file = None
 6         self.fingerprints = set()
 7         self.logdupes = True
 8         self.debug = debug
 9         self.logger = logging.getLogger(__name__)
10         if path:
11             self.file = open(os.path.join(path, 'requests.seen'), 'a+')
12             self.file.seek(0)
13             self.fingerprints.update(x.rstrip() for x in self.file)
14 
15     @classmethod
16     def from_settings(cls, settings):
17         debug = settings.getbool('DUPEFILTER_DEBUG')
18         return cls(job_dir(settings), debug)
19 
20     def request_seen(self, request):
21         fp = self.request_fingerprint(request)
22         if fp in self.fingerprints:
23             return True
24         self.fingerprints.add(fp)
25         if self.file:
26             self.file.write(fp + os.linesep)
27 
28     def request_fingerprint(self, request):
29         return request_fingerprint(request)
30 
31     def close(self, reason):
32         if self.file:
33             self.file.close()
34 
35     def log(self, request, spider):
36         if self.debug:
37             msg = "Filtered duplicate request: %(request)s"
38             self.logger.debug(msg, {'request': request}, extra={'spider': spider})
39         elif self.logdupes:
40             msg = ("Filtered duplicate request: %(request)s"
41                    " - no more duplicates will be shown"
42                    " (see DUPEFILTER_DEBUG to show all duplicates)")
43             self.logger.debug(msg, {'request': request}, extra={'spider': spider})
44             self.logdupes = False
45 
46         spider.crawler.stats.inc_value('dupefilter/filtered', spider=spider)
View Code
在这个类里面有一个很是重要的方法:request_seen,这个方法直接肯定了是否已经访问过;当咱们在yeild Request的时候在内部先会调用一下request_seen这个方法,判断是否已经访问过了:单独来看这个方法

def request_seen(self, request):
        fp = self.request_fingerprint(request)
        if fp in self.fingerprints:
            return True
        self.fingerprints.add(fp)
        if self.file:
            self.file.write(fp + os.linesep)
说明:a.首先执行fp = self.request_fingerprint(request),这个里面的request就是咱们传进来的request,request里面应该有url,由于每一个url的长短不一,所以咱们能够将fp当作是这个url被request_fingerprint这个方法封装过得一个md5值,这样每一个url的长度是同样的,便于操做;
b.执行if fp in self.fingerprints: self.fingerprints在__init__初始化的时候实际上是一个集合:self.fingerprints = set();若是fp在这个集合里面那么返回的值是true,若是返回的是true,那么就表明这个url被访问过了,那么再次访问的时候就不在访问了
c.若是没有被访问过就执行self.fingerprints.add(fp),将fp加入到这个集合里面
d.咱们将访问过得放在内存里面,可是也可放在文件里面执行:if self.file: self.file.write(fp + os.linesep)
 
如何写到文件中呢?
 1 1.首先执行from_setting去配置文件找路径,要在配置文件里面将debug设置为True;
 2 @classmethod
 3     def from_settings(cls, settings):
 4         debug = settings.getbool('DUPEFILTER_DEBUG')
 5         return cls(job_dir(settings), debug)
 6 2.返回的是return cls(job_dir(settings), debug),
 7 import os
 8 
 9 def job_dir(settings):
10     path = settings['JOBDIR']
11     if path and not os.path.exists(path):
12         os.makedirs(path)
13     return path
14 若是有path那就从setting里面去取,设置JOBDIR,若是没有的话那么建立,最后返回的是path,也就是配置文件里面的JOBDIR对应的值
View Code

可是咱们通常的话写在文件里面的话意义不是很大,我后面会介绍将其写在redis里面

自定义去重规则:由于内置的去重规则RFPDupeFilter这个类是继承BaseDupeFilter的,所以咱们自定义的时候也能够继承这个基类

from scrapy.dupefilter import BaseDupeFilter
from scrapy.utils.request import request_fingerprint


class CjkDupeFilter(BaseDupeFilter):

    def __init__(self):
        self.visited_fd = set()

    @classmethod
    def from_settings(cls, settings):
        return cls()

    def request_seen(self, request):
        fd = request_fingerprint(request=request)
        if fd in self.visited_fd:
            return True
        self.visited_fd.add(fd)

    def open(self):  # can return deferred
        print('开始')

    def close(self, reason):  # can return a deferred
        print('结束')

    # def log(self, request, spider):  # log that a request has been filtered
    #     print('日志')

系统默认的去重规则配置:DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'

咱们须要在setting配置文件里面将咱们自定义的去重规则写上:DUPEFILTER_CLASS = 'cjkscrapy.dupefilters.CjkDupeFilter'

eg:后面咱们会在redis里面作去重规则

注意:在咱们的爬虫文件也能够本身设置去重规则,在Request里面有个dont_filter这个参数,默认为False表示遵循过滤规则,若是将dont_filter这个参数设置为True那么表示不遵循去重规则

class Request(object_ref):

    def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None, flags=None):

 

6.depth深度控制(后面会详细说明)

# -*- coding: utf-8 -*-
import scrapy
from scrapy.dupefilter import RFPDupeFilter


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    def parse(self, response):
        print(response.request.url, response.meta.get("depth", 0))

        # item_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
        # for item in item_list:
        #     text = item.xpath('.//a/text()').extract_first()
        #     href = item.xpath('.//a/@href').extract_first()

        page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_list:
            from scrapy.http import Request
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.parse, dont_filter=False)  # https://dig.chouti.com/all/hot/recent/2

深度指的是爬虫爬取的深度,若是咱们想控制深度,那么咱们能够在setting配置文件里面设置DEPTH_LIMIT = 3

 

7.手动处理cookie

Request对象里面参数查看:

class Request(object_ref):

    def __init__(self, url, callback=None, method='GET', headers=None, body=None,
                 cookies=None, meta=None, encoding='utf-8', priority=0,
                 dont_filter=False, errback=None, flags=None):

咱们暂时先关注Request里面的这个参数url, callback=None, method='GET', headers=None, body=None,cookies=None;发请求其实就至关于请求头+请求体+cookie

自动登陆抽屉示例:

咱们第一次请求抽屉的时候咱们会获得一个cookie,咱们若是去响应里面去拿到cookie呢?

首先导入一个处理cookie的类:from scrapy.http.cookies import CookieJar

# -*- coding: utf-8 -*-
import scrapy
from scrapy.dupefilter import RFPDupeFilter
from scrapy.http.cookies import CookieJar


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    def parse(self, response):
        cookie_dict = {}

        # 去响应头中获取cookie,cookie保存在cookie_jar对象
        cookie_jar = CookieJar()
        # 去response, response.request里面拿到cookie
        cookie_jar.extract_cookies(response, response.request)

        # 去对象中将cookie解析到字典
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    cookie_dict[m] = n.value
        print(cookie_dict)

运行上面的代码:咱们能拿到cookie

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
/Users/chenjunkan/Desktop/scrapytest/cjkscrapy/cjkscrapy/spiders/chouti.py:3: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
  from scrapy.dupefilter import RFPDupeFilter
{'gpsd': 'd052f4974404d8c431f3c7c1615694c4', 'JSESSIONID': 'aaaUxbbxMYOWh4T7S7rGw'}

 自动登陆抽屉而且点赞的完整示例:

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.dupefilter import RFPDupeFilter
 4 from scrapy.http.cookies import CookieJar
 5 from scrapy.http import Request
 6 
 7 
 8 class ChoutiSpider(scrapy.Spider):
 9     name = 'chouti'
10     allowed_domains = ['dig.chouti.com']
11     start_urls = ['http://dig.chouti.com/']
12     cookie_dict = {}
13 
14     def parse(self, response):
15 
16         # 去响应头中获取cookie,cookie保存在cookie_jar对象
17         cookie_jar = CookieJar()
18         # 去response, response.request里面拿到cookie
19         cookie_jar.extract_cookies(response, response.request)
20 
21         # 去对象中将cookie解析到字典
22         for k, v in cookie_jar._cookies.items():
23             for i, j in v.items():
24                 for m, n in j.items():
25                     self.cookie_dict[m] = n.value
26         print(self.cookie_dict)
27 
28         yield Request(
29             url='https://dig.chouti.com/login',
30             method='POST',
31             body="phone=8618357186730&password=cjk139511&oneMonth=1",
32             cookies=self.cookie_dict,
33             headers={
34                 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
35             },
36             callback=self.check_login
37         )
38 
39     def check_login(self, response):
40         print(response.text)
41         yield Request(
42             url='https://dig.chouti.com/all/hot/recent/1',
43             cookies=self.cookie_dict,
44             callback=self.index
45         )
46 
47     def index(self, response):
48         news_list = response.xpath('//div[@id="content-list"]/div[@class="item"]')
49         for new in news_list:
50             link_id = new.xpath('.//div[@class="part2"]/@share-linkid').extract_first()
51             yield Request(
52                 url='http://dig.chouti.com/link/vote?linksId=%s' % (link_id,),
53                 method='POST',
54                 cookies=self.cookie_dict,
55                 callback=self.check_result
56             )
57         # 全部的页面所有点赞
58         page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
59         for page in page_list:
60             page = "https://dig.chouti.com" + page
61             yield Request(url=page, callback=self.index)  # https://dig.chouti.com/all/hot/recent/2
62 
63     def check_result(self, response):
64         print(response.text)
View Code

运行结果:

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
 2 /Users/chenjunkan/Desktop/scrapytest/cjkscrapy/cjkscrapy/spiders/chouti.py:3: ScrapyDeprecationWarning: Module `scrapy.dupefilter` is deprecated, use `scrapy.dupefilters` instead
 3   from scrapy.dupefilter import RFPDupeFilter
 4 {'gpsd': '78613f08c985435d5d0eedc08b0ed812', 'JSESSIONID': 'aaaTmxnFAJJGn9Muf8rGw'}
 5 {"result":{"code":"9999", "message":"", "data":{"complateReg":"0","destJid":"cdu_53587312848"}}}
 6 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112533000","lvCount":"6","nick":"chenjunkan","uvCount":"26","voteTime":"小于1分钟前"}}}
 7 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112720000","lvCount":"28","nick":"chenjunkan","uvCount":"27","voteTime":"小于1分钟前"}}}
 8 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112862000","lvCount":"24","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
 9 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112849000","lvCount":"29","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
10 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112872000","lvCount":"48","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
11 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"23","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
12 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"69","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
13 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112877000","lvCount":"189","nick":"chenjunkan","uvCount":"33","voteTime":"小于1分钟前"}}}
14 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112926000","lvCount":"98","nick":"chenjunkan","uvCount":"35","voteTime":"小于1分钟前"}}}
15 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692112951000","lvCount":"61","nick":"chenjunkan","uvCount":"35","voteTime":"小于1分钟前"}}}
16 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113086000","lvCount":"13","nick":"chenjunkan","uvCount":"37","voteTime":"小于1分钟前"}}}
17 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113097000","lvCount":"17","nick":"chenjunkan","uvCount":"38","voteTime":"小于1分钟前"}}}
18 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113118000","lvCount":"21","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
19 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113155000","lvCount":"86","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
20 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113140000","lvCount":"22","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
21 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113148000","lvCount":"25","nick":"chenjunkan","uvCount":"41","voteTime":"小于1分钟前"}}}
22 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113416000","lvCount":"22","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
23 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113386000","lvCount":"13","nick":"chenjunkan","uvCount":"46","voteTime":"小于1分钟前"}}}
24 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113381000","lvCount":"70","nick":"chenjunkan","uvCount":"46","voteTime":"小于1分钟前"}}}
25 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113408000","lvCount":"27","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
26 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113402000","lvCount":"17","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
27 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113428000","lvCount":"41","nick":"chenjunkan","uvCount":"47","voteTime":"小于1分钟前"}}}
28 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113528000","lvCount":"55","nick":"chenjunkan","uvCount":"49","voteTime":"小于1分钟前"}}}
29 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113544000","lvCount":"16","nick":"chenjunkan","uvCount":"49","voteTime":"小于1分钟前"}}}
30 {"result":{"code":"9999", "message":"推荐成功", "data":{"jid":"cdu_53587312848","likedTime":"1546692113643000","lvCount":"22","nick":"chenjunkan","uvCount":"50","voteTime":"小于1分钟前"}}}
View Code

 

8.起始请求定制

在提及始url以前咱们先来了解一个可迭代对象和迭代器,什么是可迭代对象?列表就是可迭代对象

# 可迭代对象
# li = [11,22,33]

# 可迭代对象转换为迭代器
# iter(li)

生成器转换为迭代器,生成器是一种特殊的迭代器

def func():
    yield 11
    yield 22
    yield 33

# 生成器
li = func()

# 生成器 转换为迭代器
iter(li)


v = li.__next__()
print(v)

咱们知道爬虫里面的调度器只能接收request对象,而后交给下载器去下载,所以当爬虫启动的时候在源码的内部开始不是先去执行parse方法,而是先把start_urls里面的每个url都封装成一个一个的request对象,而后将request对象交给调度器;具体流程以下:

"""
scrapy引擎来爬虫中取起始URL:
    1. 调用start_requests并获取返回值
    2. v = iter(返回值) 返回值是生成器或者迭代器都不要紧,由于会转换为迭代器
    3. 
        req1 = 执行 v.__next__()
        req2 = 执行 v.__next__()
        req3 = 执行 v.__next__()
        ...
    4. req所有放到调度器中

"""

说明:当调度器来拿request对象的时候并非直接经过start_urls拿,而是调用内部的start_requests这个方法,若是本类没有的话就去父类拿

start_requests方法的两种方式:

方式一:start_requests方法是一个生成器函数,在内部会将这个生成器转换为迭代器

def start_requests(self):
    # 方式一:
    for url in self.start_urls:
        yield Request(url=url)

方式二:start_requests方法返回一个可迭代的对象

def start_requests(self):
    # 方式二:
    req_list = []
    for url in self.start_urls:
        req_list.append(Request(url=url))
    return req_list

当咱们本身不定义start_requests的时候其实咱们的源码里面也是有这个方法的:进入scrapy.Spider

查看源码:

    def start_requests(self):
        cls = self.__class__
        if method_is_overridden(cls, Spider, 'make_requests_from_url'):
            warnings.warn(
                "Spider.make_requests_from_url method is deprecated; it "
                "won't be called in future Scrapy releases. Please "
                "override Spider.start_requests method instead (see %s.%s)." % (
                    cls.__module__, cls.__name__
                ),
            )
            for url in self.start_urls:
                yield self.make_requests_from_url(url)
        else:
            for url in self.start_urls:
                yield Request(url, dont_filter=True)

若是咱们本身定义了start_requests,那么咱们本身的优先级会更高,默认的状况下发的是get请求,那么这样咱们能够本身改为post的请求

eg:起始的url能够去redis中获取

 

9.深度和优先级

# -*- coding: utf-8 -*-
import scrapy
from scrapy.dupefilter import RFPDupeFilter
from scrapy.http.cookies import CookieJar
from scrapy.http import Request


class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['dig.chouti.com']
    start_urls = ['http://dig.chouti.com/']

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse)

    def parse(self, response):
        print(response.meta.get("depth", 0))

        page_list = response.xpath('//div[@id="dig_lcpage"]//a/@href').extract()
        for page in page_list:
            page = "https://dig.chouti.com" + page
            yield Request(url=page, callback=self.parse)  # https://dig.chouti.com/all/hot/recent/2

当咱们在执行yield Request的时候其实咱们将request对象放在调度器中去,而后交给下载器进行下载,其中还须要进过不少中间件,所以在中间件里面还能够作点其余的操做:导入from scrapy.spidermiddlewares.depth import DepthMiddleware

DepthMiddleware:

class DepthMiddleware(object):

    def __init__(self, maxdepth, stats=None, verbose_stats=False, prio=1):
        self.maxdepth = maxdepth
        self.stats = stats
        self.verbose_stats = verbose_stats
        self.prio = prio

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        maxdepth = settings.getint('DEPTH_LIMIT')
        verbose = settings.getbool('DEPTH_STATS_VERBOSE')
        prio = settings.getint('DEPTH_PRIORITY')
        return cls(maxdepth, crawler.stats, verbose, prio)

    def process_spider_output(self, response, result, spider):
        def _filter(request):
            if isinstance(request, Request):
                depth = response.meta['depth'] + 1
                request.meta['depth'] = depth
                if self.prio:
                    request.priority -= depth * self.prio
                if self.maxdepth and depth > self.maxdepth:
                    logger.debug(
                        "Ignoring link (depth > %(maxdepth)d): %(requrl)s ",
                        {'maxdepth': self.maxdepth, 'requrl': request.url},
                        extra={'spider': spider}
                    )
                    return False
                elif self.stats:
                    if self.verbose_stats:
                        self.stats.inc_value('request_depth_count/%s' % depth,
                                             spider=spider)
                    self.stats.max_value('request_depth_max', depth,
                                         spider=spider)
            return True

        # base case (depth=0)
        if self.stats and 'depth' not in response.meta:
            response.meta['depth'] = 0
            if self.verbose_stats:
                self.stats.inc_value('request_depth_count/0', spider=spider)

        return (r for r in result or () if _filter(r))

当通过中间件的时候穿过的就是process_spider_output方法:

说明:

1.if self.stats and 'depth' not in response.meta:
第一次将请求发送过去的时候Request里面的meta=None;返回的reponse里面的meta也是空的,在response里面response.request中的request就是咱们开始封装的request;导入from scrapy.http import response,咱们会发现response.meta:
@property
    def meta(self):
        try:
            return self.request.meta
        except AttributeError:
            raise AttributeError(
                "Response.meta not available, this response "
                "is not tied to any request"
            )
能够得出:response.meta = response.request.meta
2.当咱们yield Request的时候这个时候就会触发process_spider_output这个函数,当条件成立以后就会给request.meta赋值:response.meta['depth'] = 0
3.return (r for r in result or () if _filter(r)):result指的是返回的一个一个的request对象,循环每个request对象,执行_filter方法
4.若是_filter返回True,表示这个请求是容许的,交给调度器,若是返回False,那么就丢弃
            if isinstance(request, Request):
                depth = response.meta['depth'] + 1
                request.meta['depth'] = depth
若是request是Request对象,那么就深度+15.if self.maxdepth and depth > self.maxdepth:  maxdepth = settings.getint('DEPTH_LIMIT')若是深度大于咱们本身设置的深度,那么就丢弃

优先级策略:深度越深那么优先级就越低,默认的优先级为0

if self.prio:
    request.priority -= depth * self.prio

 

10.代理

scrapy内置代理:当调度器将url交给下载器进行下载的时候,那么中间会进过中间件,内置的代理就在这些中间件里面

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/downloadermiddlewares/httpproxy.py

HttpProxyMiddleware类:

 1 import base64
 2 from six.moves.urllib.request import getproxies, proxy_bypass
 3 from six.moves.urllib.parse import unquote
 4 try:
 5     from urllib2 import _parse_proxy
 6 except ImportError:
 7     from urllib.request import _parse_proxy
 8 from six.moves.urllib.parse import urlunparse
 9 
10 from scrapy.utils.httpobj import urlparse_cached
11 from scrapy.exceptions import NotConfigured
12 from scrapy.utils.python import to_bytes
13 
14 
15 class HttpProxyMiddleware(object):
16 
17     def __init__(self, auth_encoding='latin-1'):
18         self.auth_encoding = auth_encoding
19         self.proxies = {}
20         for type, url in getproxies().items():
21             self.proxies[type] = self._get_proxy(url, type)
22 
23     @classmethod
24     def from_crawler(cls, crawler):
25         if not crawler.settings.getbool('HTTPPROXY_ENABLED'):
26             raise NotConfigured
27         auth_encoding = crawler.settings.get('HTTPPROXY_AUTH_ENCODING')
28         return cls(auth_encoding)
29 
30     def _basic_auth_header(self, username, password):
31         user_pass = to_bytes(
32             '%s:%s' % (unquote(username), unquote(password)),
33             encoding=self.auth_encoding)
34         return base64.b64encode(user_pass).strip()
35 
36     def _get_proxy(self, url, orig_type):
37         proxy_type, user, password, hostport = _parse_proxy(url)
38         proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))
39 
40         if user:
41             creds = self._basic_auth_header(user, password)
42         else:
43             creds = None
44 
45         return creds, proxy_url
46 
47     def process_request(self, request, spider):
48         # ignore if proxy is already set
49         if 'proxy' in request.meta:
50             if request.meta['proxy'] is None:
51                 return
52             # extract credentials if present
53             creds, proxy_url = self._get_proxy(request.meta['proxy'], '')
54             request.meta['proxy'] = proxy_url
55             if creds and not request.headers.get('Proxy-Authorization'):
56                 request.headers['Proxy-Authorization'] = b'Basic ' + creds
57             return
58         elif not self.proxies:
59             return
60 
61         parsed = urlparse_cached(request)
62         scheme = parsed.scheme
63 
64         # 'no_proxy' is only supported by http schemes
65         if scheme in ('http', 'https') and proxy_bypass(parsed.hostname):
66             return
67 
68         if scheme in self.proxies:
69             self._set_proxy(request, scheme)
70 
71     def _set_proxy(self, request, scheme):
72         creds, proxy = self.proxies[scheme]
73         request.meta['proxy'] = proxy
74         if creds:
75             request.headers['Proxy-Authorization'] = b'Basic ' + creds
View Code
    def process_request(self, request, spider):
        # ignore if proxy is already set
        if 'proxy' in request.meta:
            if request.meta['proxy'] is None:
                return
            # extract credentials if present
            creds, proxy_url = self._get_proxy(request.meta['proxy'], '')
            request.meta['proxy'] = proxy_url
            if creds and not request.headers.get('Proxy-Authorization'):
                request.headers['Proxy-Authorization'] = b'Basic ' + creds
            return
        elif not self.proxies:
            return

        parsed = urlparse_cached(request)
        scheme = parsed.scheme

        # 'no_proxy' is only supported by http schemes
        if scheme in ('http', 'https') and proxy_bypass(parsed.hostname):
            return

        if scheme in self.proxies:
            self._set_proxy(request, scheme)

说明:

        if scheme in self.proxies:
            self._set_proxy(request, scheme)

self.proxy是什么?是一个self.proxies = {}空的字典:加代理的本质就是在请求头上加点数据

        for type, url in getproxies().items():
            self.proxies[type] = self._get_proxy(url, type)

接下来:getproxies = getproxies_environment

def getproxies_environment():
    """Return a dictionary of scheme -> proxy server URL mappings.

    Scan the environment for variables named <scheme>_proxy;
    this seems to be the standard convention.  If you need a
    different way, you can pass a proxies dictionary to the
    [Fancy]URLopener constructor.

    """
    proxies = {}
    # in order to prefer lowercase variables, process environment in
    # two passes: first matches any, second pass matches lowercase only
    for name, value in os.environ.items():
        name = name.lower()
        if value and name[-6:] == '_proxy':
            proxies[name[:-6]] = value
    # CVE-2016-1000110 - If we are running as CGI script, forget HTTP_PROXY
    # (non-all-lowercase) as it may be set from the web server by a "Proxy:"
    # header from the client
    # If "proxy" is lowercase, it will still be used thanks to the next block
    if 'REQUEST_METHOD' in os.environ:
        proxies.pop('http', None)
    for name, value in os.environ.items():
        if name[-6:] == '_proxy':
            name = name.lower()
            if value:
                proxies[name[:-6]] = value
            else:
                proxies.pop(name[:-6], None)
    return proxies

说明:a.经过os.environ从环境变量里面去获取代理,只要遵循后六位是_proxy结尾就能够,也能够经过get方法设置代理

import os

v = os.environ.get('XXX')
print(v)

b.经过这个方法咱们可以获得一个

proxies = {

  http:'192.168.3.3'

  https:'192.168.3.4'

}

c.接下来self.proxies[type] = self._get_proxy(url, type)

    def _get_proxy(self, url, orig_type):
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

        if user:
            creds = self._basic_auth_header(user, password)
        else:
            creds = None

        return creds, proxy_url

内置代理的写法:

在爬虫启动时,提早在os.envrion中设置代理便可。
class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/']
    cookie_dict = {}

    def start_requests(self):
        import os
        os.environ['HTTPS_PROXY'] = "http://root:root@192.168.11.11:9999/"
        os.environ['HTTP_PROXY'] = '19.11.2.32',
        for url in self.start_urls:
            yield Request(url=url,callback=self.parse)

可是咱们还有另外的一种方式设置内置代理,在爬虫里面去设置meta

源码查看:

        if 'proxy' in request.meta:
            if request.meta['proxy'] is None:
                return
            # extract credentials if present
            creds, proxy_url = self._get_proxy(request.meta['proxy'], '')
            request.meta['proxy'] = proxy_url
            if creds and not request.headers.get('Proxy-Authorization'):
                request.headers['Proxy-Authorization'] = b'Basic ' + creds
            return
        elif not self.proxies:
            return

meta写法

class ChoutiSpider(scrapy.Spider):
    name = 'chouti'
    allowed_domains = ['chouti.com']
    start_urls = ['https://dig.chouti.com/']
    cookie_dict = {}

    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url,callback=self.parse,meta={'proxy':'"http://root:root@192.168.11.11:9999/"'})

 

自定义代理:

代码示例:

import base64
import random
from six.moves.urllib.parse import unquote

try:
    from urllib2 import _parse_proxy
except ImportError:
    from urllib.request import _parse_proxy
from six.moves.urllib.parse import urlunparse
from scrapy.utils.python import to_bytes


class CjkProxyMiddleware(object):

    def _basic_auth_header(self, username, password):
        user_pass = to_bytes(
            '%s:%s' % (unquote(username), unquote(password)),
            encoding='latin-1')
        return base64.b64encode(user_pass).strip()

    def process_request(self, request, spider):
        PROXIES = [
            "http://root:root@192.168.11.11:9999/",
            "http://root:root@192.168.11.12:9999/",
            "http://root:root@192.168.11.13:9999/",
            "http://root:root@192.168.11.14:9999/",
            "http://root:root@192.168.11.15:9999/",
            "http://root:root@192.168.11.16:9999/",
        ]
        url = random.choice(PROXIES)

        orig_type = ""
        proxy_type, user, password, hostport = _parse_proxy(url)
        proxy_url = urlunparse((proxy_type or orig_type, hostport, '', '', '', ''))

        if user:
            creds = self._basic_auth_header(user, password)
        else:
            creds = None
        request.meta['proxy'] = proxy_url
        if creds:
            request.headers['Proxy-Authorization'] = b'Basic ' + creds

 

 11.解析器(后面会详细介绍)

代码示例:

html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""

from scrapy.http import HtmlResponse
from scrapy.selector import Selector

response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')


# hxs = Selector(response)
# hxs.xpath()
response.xpath('')

 

12.下载中间件

源码查看:

 1 class CjkscrapyDownloaderMiddleware(object):
 2     # Not all methods need to be defined. If a method is not defined,
 3     # scrapy acts as if the downloader middleware does not modify the
 4     # passed objects.
 5 
 6     @classmethod
 7     def from_crawler(cls, crawler):
 8         # This method is used by Scrapy to create your spiders.
 9         s = cls()
10         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
11         return s
12 
13     def process_request(self, request, spider):
14         # Called for each request that goes through the downloader
15         # middleware.
16 
17         # Must either:
18         # - return None: continue processing this request
19         # - or return a Response object
20         # - or return a Request object
21         # - or raise IgnoreRequest: process_exception() methods of
22         #   installed downloader middleware will be called
23         return None
24 
25     def process_response(self, request, response, spider):
26         # Called with the response returned from the downloader.
27 
28         # Must either;
29         # - return a Response object
30         # - return a Request object
31         # - or raise IgnoreRequest
32         return response
33 
34     def process_exception(self, request, exception, spider):
35         # Called when a download handler or a process_request()
36         # (from other downloader middleware) raises an exception.
37 
38         # Must either:
39         # - return None: continue processing this exception
40         # - return a Response object: stops process_exception() chain
41         # - return a Request object: stops process_exception() chain
42         pass
43 
44     def spider_opened(self, spider):
45         spider.logger.info('Spider opened: %s' % spider.name)
View Code

自定义下载中间件:先定义一个md文件,而后在setting里面设置

DOWNLOADER_MIDDLEWARES = {
    # 'cjkscrapy.middlewares.CjkscrapyDownloaderMiddleware': 543,
    'cjkscrapy.md.Md1': 666,
    'cjkscrapy.md.Md2': 667,
}

md.py

class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md1.process_request', request)
        pass

    def process_response(self, request, response, spider):
        print('m1.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass


class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md2.process_request', request)

    def process_response(self, request, response, spider):
        print('m2.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
 2 md1.process_request <GET http://dig.chouti.com/>
 3 md2.process_request <GET http://dig.chouti.com/>
 4 m2.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 5 m1.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 6 md1.process_request <GET https://dig.chouti.com/>
 7 md2.process_request <GET https://dig.chouti.com/>
 8 m2.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
 9 m1.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
10 response <200 https://dig.chouti.com/>
View Code

a.返回Response对象

from scrapy.http import HtmlResponse
from scrapy.http import Request


class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md1.process_request', request)
        # 1. 返回Response
        import requests
        result = requests.get(request.url)
        return HtmlResponse(url=request.url, status=200, headers=None, body=result.content)

    def process_response(self, request, response, spider):
        print('m1.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass


class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md2.process_request', request)

    def process_response(self, request, response, spider):
        print('m2.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
2 md1.process_request <GET http://dig.chouti.com/>
3 m2.process_response <GET http://dig.chouti.com/> <200 http://dig.chouti.com/>
4 m1.process_response <GET http://dig.chouti.com/> <200 http://dig.chouti.com/>
5 response <200 http://dig.chouti.com/>
View Code

说明:经过结果能够看出,在这一步能够伪造下载的结果,直接把结果当成返回值交给下一个处理

b.返回Request

from scrapy.http import HtmlResponse
from scrapy.http import Request


class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md1.process_request', request)
        # 2. 返回Request
        return Request('https://dig.chouti.com/r/tec/hot/1')

    def process_response(self, request, response, spider):
        print('m1.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass


class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md2.process_request', request)

    def process_response(self, request, response, spider):
        print('m2.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
md1.process_request <GET http://dig.chouti.com/>
md1.process_request <GET https://dig.chouti.com/r/tec/hot/1>

c.抛出异常:必需要有process_exception

from scrapy.http import HtmlResponse
from scrapy.http import Request


class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md1.process_request', request)
        # 3. 抛出异常
        from scrapy.exceptions import IgnoreRequest
        raise IgnoreRequest

    def process_response(self, request, response, spider):
        print('m1.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass


class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md2.process_request', request)

    def process_response(self, request, response, spider):
        print('m2.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
md1.process_request <GET http://dig.chouti.com/>

d.上面的几种咱们通常都不用,而是对请求进行加工(cookie)

from scrapy.http import HtmlResponse
from scrapy.http import Request


class Md1(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md1.process_request', request)
        request.headers[
            'user-agent'] = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"

    def process_response(self, request, response, spider):
        print('m1.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass


class Md2(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        return s

    def process_request(self, request, spider):
        print('md2.process_request', request)

    def process_response(self, request, response, spider):
        print('m2.process_response', request, response)
        return response

    def process_exception(self, request, exception, spider):
        pass

运行结果:

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
 2 md1.process_request <GET http://dig.chouti.com/>
 3 md2.process_request <GET http://dig.chouti.com/>
 4 m2.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 5 m1.process_response <GET http://dig.chouti.com/> <301 http://dig.chouti.com/>
 6 md1.process_request <GET https://dig.chouti.com/>
 7 md2.process_request <GET https://dig.chouti.com/>
 8 m2.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
 9 m1.process_response <GET https://dig.chouti.com/> <200 https://dig.chouti.com/>
10 response <200 https://dig.chouti.com/>
View Code

可是这个咱们不须要本身写,源码里面帮咱们写好了:

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/downloadermiddlewares/useragent.py

源码查看:

"""Set User-Agent header per spider or use a default value from settings"""

from scrapy import signals


class UserAgentMiddleware(object):
    """This middleware allows spiders to override the user_agent"""

    def __init__(self, user_agent='Scrapy'):
        self.user_agent = user_agent

    @classmethod
    def from_crawler(cls, crawler):
        o = cls(crawler.settings['USER_AGENT'])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider_opened(self, spider):
        self.user_agent = getattr(spider, 'user_agent', self.user_agent)

    def process_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b'User-Agent', self.user_agent)

 

13.爬虫中间件

源码查看:

 1 class CjkscrapySpiderMiddleware(object):
 2     # Not all methods need to be defined. If a method is not defined,
 3     # scrapy acts as if the spider middleware does not modify the
 4     # passed objects.
 5 
 6     @classmethod
 7     def from_crawler(cls, crawler):
 8         # This method is used by Scrapy to create your spiders.
 9         s = cls()
10         crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
11         return s
12 
13     def process_spider_input(self, response, spider):
14         # Called for each response that goes through the spider
15         # middleware and into the spider.
16 
17         # Should return None or raise an exception.
18         return None
19 
20     def process_spider_output(self, response, result, spider):
21         # Called with the results returned from the Spider, after
22         # it has processed the response.
23 
24         # Must return an iterable of Request, dict or Item objects.
25         for i in result:
26             yield i
27 
28     def process_spider_exception(self, response, exception, spider):
29         # Called when a spider or process_spider_input() method
30         # (from other spider middleware) raises an exception.
31 
32         # Should return either None or an iterable of Response, dict
33         # or Item objects.
34         pass
35 
36     def process_start_requests(self, start_requests, spider):
37         # Called with the start requests of the spider, and works
38         # similarly to the process_spider_output() method, except
39         # that it doesn’t have a response associated.
40 
41         # Must return only requests (not items).
42         for r in start_requests:
43             yield r
44 
45     def spider_opened(self, spider):
46         spider.logger.info('Spider opened: %s' % spider.name)
View Code

自定义爬虫中间件:

在配置文件中设置:

SPIDER_MIDDLEWARES = {
    # 'cjkscrapy.middlewares.CjkscrapySpiderMiddleware': 543,
    'cjkscrapy.sd.Sd1': 667,
    'cjkscrapy.sd.Sd2': 668,
}

sd.py:

 1 class Sd1(object):
 2     # Not all methods need to be defined. If a method is not defined,
 3     # scrapy acts as if the spider middleware does not modify the
 4     # passed objects.
 5 
 6     @classmethod
 7     def from_crawler(cls, crawler):
 8         # This method is used by Scrapy to create your spiders.
 9         s = cls()
10         return s
11 
12     def process_spider_input(self, response, spider):
13         # Called for each response that goes through the spider
14         # middleware and into the spider.
15 
16         # Should return None or raise an exception.
17         return None
18 
19     def process_spider_output(self, response, result, spider):
20         # Called with the results returned from the Spider, after
21         # it has processed the response.
22 
23         # Must return an iterable of Request, dict or Item objects.
24         for i in result:
25             yield i
26 
27     def process_spider_exception(self, response, exception, spider):
28         # Called when a spider or process_spider_input() method
29         # (from other spider middleware) raises an exception.
30 
31         # Should return either None or an iterable of Response, dict
32         # or Item objects.
33         pass
34 
35     # 只在爬虫启动时,执行一次。
36     def process_start_requests(self, start_requests, spider):
37         # Called with the start requests of the spider, and works
38         # similarly to the process_spider_output() method, except
39         # that it doesn’t have a response associated.
40 
41         # Must return only requests (not items).
42         for r in start_requests:
43             yield r
44 
45 
46 class Sd2(object):
47     # Not all methods need to be defined. If a method is not defined,
48     # scrapy acts as if the spider middleware does not modify the
49     # passed objects.
50 
51     @classmethod
52     def from_crawler(cls, crawler):
53         # This method is used by Scrapy to create your spiders.
54         s = cls()
55         return s
56 
57     def process_spider_input(self, response, spider):
58         # Called for each response that goes through the spider
59         # middleware and into the spider.
60 
61         # Should return None or raise an exception.
62         return None
63 
64     def process_spider_output(self, response, result, spider):
65         # Called with the results returned from the Spider, after
66         # it has processed the response.
67 
68         # Must return an iterable of Request, dict or Item objects.
69         for i in result:
70             yield i
71 
72     def process_spider_exception(self, response, exception, spider):
73         # Called when a spider or process_spider_input() method
74         # (from other spider middleware) raises an exception.
75 
76         # Should return either None or an iterable of Response, dict
77         # or Item objects.
78         pass
79 
80     # 只在爬虫启动时,执行一次。
81     def process_start_requests(self, start_requests, spider):
82         # Called with the start requests of the spider, and works
83         # similarly to the process_spider_output() method, except
84         # that it doesn’t have a response associated.
85 
86         # Must return only requests (not items).
87         for r in start_requests:
88             yield r
View Code

通常状况下这个咱们不作修改,使用内置的,在爬虫中间件中先执行全部的的input方法,再执行全部的output方法

 

14.定制命令

咱们以前全部的运行爬虫都是在命令行中运行的,很是的麻烦,那么咱们能够定制一个脚本用来执行这些命令,在根目录上写上

start.py:

import sys
from scrapy.cmdline import execute

if __name__ == '__main__':
    # execute(["scrapy","crawl","chouti","--nolog"]) #单个爬虫

自定制命令:

  • 在spiders同级建立任意目录,如:commands
  • 在其中建立 crawlall.py 文件 (此处文件名就是自定义的命令)
from scrapy.commands import ScrapyCommand
    from scrapy.utils.project import get_project_settings


    class Command(ScrapyCommand):

        requires_project = True

        def syntax(self):
            return '[options]'

        def short_desc(self):
            return 'Runs all of the spiders'

        def run(self, args, opts):
            spider_list = self.crawler_process.spiders.list()
            for name in spider_list:
                self.crawler_process.crawl(name, **opts.__dict__)
            self.crawler_process.start()

crawlall.py
  • 在settings.py 中添加配置 COMMANDS_MODULE = '项目名称.目录名称' #COMMANDS_MODULE = "cjkscrapy.commands"
  • 在项目目录执行命令:scrapy crawlall 

咱们查看scrapy --help,咱们添加的命令就在其中了

 1 chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy --help
 2 Scrapy 1.5.1 - project: cjkscrapy
 3 
 4 Usage:
 5   scrapy <command> [options] [args]
 6 
 7 Available commands:
 8   bench         Run quick benchmark test
 9   check         Check spider contracts
10   crawl         Run a spider
11   crawlall      Runs all of the spiders
12   edit          Edit spider
13   fetch         Fetch a URL using the Scrapy downloader
14   genspider     Generate new spider using pre-defined templates
15   list          List available spiders
16   parse         Parse URL (using its spider) and print the results
17   runspider     Run a self-contained spider (without creating a project)
18   settings      Get settings values
19   shell         Interactive scraping console
20   startproject  Create new project
21   version       Print Scrapy version
22   view          Open URL in browser, as seen by Scrapy
View Code

 

 15.scrapy信号

 新建一个文件:ext.py

from scrapy import signals


class MyExtend(object):
    def __init__(self):
        pass

    @classmethod
    def from_crawler(cls, crawler):
        self = cls()

        crawler.signals.connect(self.x1, signal=signals.spider_opened)
        crawler.signals.connect(self.x2, signal=signals.spider_closed)

        return self

    def x1(self, spider):
        print('open')

    def x2(self, spider):
        print('close')

在setting配置文件里面设置:

EXTENSIONS = {
    'scrapy.extensions.telnet.TelnetConsole': None,
    'cjkscrapy.ext.MyExtend': 666
}

说明:crawler.signals.connect(self.x1, signal=signals.spider_opened)表示将x1这个函数注册到spider_opened里面

#与引擎相关的
engine_started = object()
engine_stopped = object()
#与爬虫相关的
spider_opened = object()
spider_idle = object() #爬虫闲置的时候
spider_closed = object()
spider_error = object()
#与请求相关的
request_scheduled = object()#将请求放在调度器的时候
request_dropped = object()#将请求丢失的时候
#与相应相关的
response_received = object()
response_downloaded = object()
#与item相关的
item_scraped = object()
item_dropped = object()

运行结果:

chenjunkandeMBP:cjkscrapy chenjunkan$ scrapy crawl chouti --nolog
open爬虫
response <200 https://dig.chouti.com/>
close爬虫

 

16.scrapy_redis

scrapy_redis这个是干吗的呢?这个组件是帮咱们开发分布式爬虫的组件

redis操做:

 1 import redis
 2 
 3 conn = redis.Redis(host='140.143.227.206', port=8888, password='beta')
 4 keys = conn.keys()
 5 print(keys)
 6 print(conn.smembers('dupefilter:xiaodongbei'))
 7 # print(conn.smembers('visited_urls'))
 8 # v1 = conn.sadd('urls','http://www.baidu.com')
 9 # v2 = conn.sadd('urls','http://www.cnblogs.com')
10 # print(v1)
11 # print(v2)
12 # v3 = conn.sadd('urls','http://www.bing.com')
13 # print(v3)
14 
15 # result = conn.sadd('urls','http://www.bing.com')
16 # if result == 1:
17 #     print('以前未访问过')
18 # else:
19 #     print('以前访问过')
20 # print(conn.smembers('urls'))
View Code

(1).scrapy_redis去重

代码示例:

方式一:彻底自定义

自定义去重规则:随便写一个文件:xxx.py

from scrapy.dupefilter import BaseDupeFilter
import redis
from scrapy.utils.request import request_fingerprint
import scrapy_redis


class DupFilter(BaseDupeFilter):
    def __init__(self):
        self.conn = redis.Redis(host='140.143.227.206', port=8888, password='beta')

    def request_seen(self, request):
        """
        检测当前请求是否已经被访问过
        :param request: 
        :return: True表示已经访问过;False表示未访问过
        """
        fid = request_fingerprint(request)
        result = self.conn.sadd('visited_urls', fid)
        if result == 1:
            return False
        return True

setting里面设置:

DUPEFILTER_CLASS = 'dbd.xxx.DupFilter'

 

方式二:在scrapy_redis上定制

咱们经过源码分析:导入:import redis

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy_redis/dupefilter.py

RFPDupeFilter类:这个类也继承了BaseDupeFilter,跟咱们的方式一差很少

  1 import logging
  2 import time
  3 
  4 from scrapy.dupefilters import BaseDupeFilter
  5 from scrapy.utils.request import request_fingerprint
  6 
  7 from . import defaults
  8 from .connection import get_redis_from_settings
  9 
 10 
 11 logger = logging.getLogger(__name__)
 12 
 13 
 14 # TODO: Rename class to RedisDupeFilter.
 15 class RFPDupeFilter(BaseDupeFilter):
 16     """Redis-based request duplicates filter.
 17 
 18     This class can also be used with default Scrapy's scheduler.
 19 
 20     """
 21 
 22     logger = logger
 23 
 24     def __init__(self, server, key, debug=False):
 25         """Initialize the duplicates filter.
 26 
 27         Parameters
 28         ----------
 29         server : redis.StrictRedis
 30             The redis server instance.
 31         key : str
 32             Redis key Where to store fingerprints.
 33         debug : bool, optional
 34             Whether to log filtered requests.
 35 
 36         """
 37         self.server = server
 38         self.key = key
 39         self.debug = debug
 40         self.logdupes = True
 41 
 42     @classmethod
 43     def from_settings(cls, settings):
 44         """Returns an instance from given settings.
 45 
 46         This uses by default the key ``dupefilter:<timestamp>``. When using the
 47         ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
 48         it needs to pass the spider name in the key.
 49 
 50         Parameters
 51         ----------
 52         settings : scrapy.settings.Settings
 53 
 54         Returns
 55         -------
 56         RFPDupeFilter
 57             A RFPDupeFilter instance.
 58 
 59 
 60         """
 61         server = get_redis_from_settings(settings)
 62         # XXX: This creates one-time key. needed to support to use this
 63         # class as standalone dupefilter with scrapy's default scheduler
 64         # if scrapy passes spider on open() method this wouldn't be needed
 65         # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
 66         key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
 67         debug = settings.getbool('DUPEFILTER_DEBUG')
 68         return cls(server, key=key, debug=debug)
 69 
 70     @classmethod
 71     def from_crawler(cls, crawler):
 72         """Returns instance from crawler.
 73 
 74         Parameters
 75         ----------
 76         crawler : scrapy.crawler.Crawler
 77 
 78         Returns
 79         -------
 80         RFPDupeFilter
 81             Instance of RFPDupeFilter.
 82 
 83         """
 84         return cls.from_settings(crawler.settings)
 85 
 86     def request_seen(self, request):
 87         """Returns True if request was already seen.
 88 
 89         Parameters
 90         ----------
 91         request : scrapy.http.Request
 92 
 93         Returns
 94         -------
 95         bool
 96 
 97         """
 98         fp = self.request_fingerprint(request)
 99         # This returns the number of values added, zero if already exists.
100         added = self.server.sadd(self.key, fp)
101         return added == 0
102 
103     def request_fingerprint(self, request):
104         """Returns a fingerprint for a given request.
105 
106         Parameters
107         ----------
108         request : scrapy.http.Request
109 
110         Returns
111         -------
112         str
113 
114         """
115         return request_fingerprint(request)
116 
117     def close(self, reason=''):
118         """Delete data on close. Called by Scrapy's scheduler.
119 
120         Parameters
121         ----------
122         reason : str, optional
123 
124         """
125         self.clear()
126 
127     def clear(self):
128         """Clears fingerprints data."""
129         self.server.delete(self.key)
130 
131     def log(self, request, spider):
132         """Logs given request.
133 
134         Parameters
135         ----------
136         request : scrapy.http.Request
137         spider : scrapy.spiders.Spider
138 
139         """
140         if self.debug:
141             msg = "Filtered duplicate request: %(request)s"
142             self.logger.debug(msg, {'request': request}, extra={'spider': spider})
143         elif self.logdupes:
144             msg = ("Filtered duplicate request %(request)s"
145                    " - no more duplicates will be shown"
146                    " (see DUPEFILTER_DEBUG to show all duplicates)")
147             self.logger.debug(msg, {'request': request}, extra={'spider': spider})
148             self.logdupes = False
View Code

这个类里面也是经过request_seen来去重的

    def request_seen(self, request):
        """Returns True if request was already seen.

        Parameters
        ----------
        request : scrapy.http.Request

        Returns
        -------
        bool

        """
        fp = self.request_fingerprint(request)
        # This returns the number of values added, zero if already exists.
        added = self.server.sadd(self.key, fp)
        return added == 0

说明:

  1 1.fp = self.request_fingerprint(request):建立惟一标识
  2 2.added = self.server.sadd(self.key, fp):
  3 self.server = server:对应的redis连接
  4 (1).@classmethod
  5     def from_settings(cls, settings):
  6         """Returns an instance from given settings.
  7 
  8         This uses by default the key ``dupefilter:<timestamp>``. When using the
  9         ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
 10         it needs to pass the spider name in the key.
 11 
 12         Parameters
 13         ----------
 14         settings : scrapy.settings.Settings
 15 
 16         Returns
 17         -------
 18         RFPDupeFilter
 19             A RFPDupeFilter instance.
 20 
 21 
 22         """
 23         server = get_redis_from_settings(settings)
 24         # XXX: This creates one-time key. needed to support to use this
 25         # class as standalone dupefilter with scrapy's default scheduler
 26         # if scrapy passes spider on open() method this wouldn't be needed
 27         # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
 28         key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
 29         debug = settings.getbool('DUPEFILTER_DEBUG')
 30         return cls(server, key=key, debug=debug)
 31 
 32 (2).def get_redis_from_settings(settings):
 33     """Returns a redis client instance from given Scrapy settings object.
 34 
 35     This function uses ``get_client`` to instantiate the client and uses
 36     ``defaults.REDIS_PARAMS`` global as defaults values for the parameters. You
 37     can override them using the ``REDIS_PARAMS`` setting.
 38 
 39     Parameters
 40     ----------
 41     settings : Settings
 42         A scrapy settings object. See the supported settings below.
 43 
 44     Returns
 45     -------
 46     server
 47         Redis client instance.
 48 
 49     Other Parameters
 50     ----------------
 51     REDIS_URL : str, optional
 52         Server connection URL.
 53     REDIS_HOST : str, optional
 54         Server host.
 55     REDIS_PORT : str, optional
 56         Server port.
 57     REDIS_ENCODING : str, optional
 58         Data encoding.
 59     REDIS_PARAMS : dict, optional
 60         Additional client parameters.
 61 
 62     """
 63     params = defaults.REDIS_PARAMS.copy()
 64     params.update(settings.getdict('REDIS_PARAMS'))
 65     # XXX: Deprecate REDIS_* settings.
 66     for source, dest in SETTINGS_PARAMS_MAP.items():
 67         val = settings.get(source)
 68         if val:
 69             params[dest] = val
 70 
 71     # Allow ``redis_cls`` to be a path to a class.
 72     if isinstance(params.get('redis_cls'), six.string_types):
 73         params['redis_cls'] = load_object(params['redis_cls'])
 74 
 75     return get_redis(**params)
 76 (3).def get_redis(**kwargs):
 77     """Returns a redis client instance.
 78 
 79     Parameters
 80     ----------
 81     redis_cls : class, optional
 82         Defaults to ``redis.StrictRedis``.
 83     url : str, optional
 84         If given, ``redis_cls.from_url`` is used to instantiate the class.
 85     **kwargs
 86         Extra parameters to be passed to the ``redis_cls`` class.
 87 
 88     Returns
 89     -------
 90     server
 91         Redis client instance.
 92 
 93     """
 94     redis_cls = kwargs.pop('redis_cls', defaults.REDIS_CLS)
 95     url = kwargs.pop('url', None)
 96     if url:
 97         return redis_cls.from_url(url, **kwargs)
 98     else:
 99         return redis_cls(**kwargs)
100 (4).redis_cls = kwargs.pop('redis_cls', defaults.REDIS_CLS)
101 REDIS_CLS = redis.StrictRedis
102 
103 所以在配置文件setting里面要加
104 # ############### scrapy redis链接 ####################
105 
106 REDIS_HOST = '140.143.227.206'                            # 主机名
107 REDIS_PORT = 8888                                   # 端口
108 REDIS_PARAMS  = {'password':'beta'}                                  # Redis链接参数             默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
109 REDIS_ENCODING = "utf-8"                            # redis编码类型             默认:'utf-8'
110 
111 # REDIS_URL = 'redis://user:pass@hostname:9001'       # 链接URL(优先于以上配置)
112 DUPEFILTER_KEY = 'dupefilter:%(timestamp)s'
113 
114 self.key = key:key = defaults.DUPEFILTER_KEY % {'timestamp': int(time.time())}
115 3.return added == 0
116 若是为0的话就表明访问过了,也就是返回的true,若是范湖为1的话那么表明没有访问过,也就是返回False
View Code

代码示例:

from scrapy_redis.dupefilter import RFPDupeFilter
from scrapy_redis.connection import get_redis_from_settings
from scrapy_redis import defaults


class RedisDupeFilter(RFPDupeFilter):
    @classmethod
    def from_settings(cls, settings):
        """Returns an instance from given settings.

        This uses by default the key ``dupefilter:<timestamp>``. When using the
        ``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
        it needs to pass the spider name in the key.

        Parameters
        ----------
        settings : scrapy.settings.Settings

        Returns
        -------
        RFPDupeFilter
            A RFPDupeFilter instance.


        """
        server = get_redis_from_settings(settings)
        # XXX: This creates one-time key. needed to support to use this
        # class as standalone dupefilter with scrapy's default scheduler
        # if scrapy passes spider on open() method this wouldn't be needed
        # TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
        key = defaults.DUPEFILTER_KEY % {'timestamp': 'chenjunkan'}
        debug = settings.getbool('DUPEFILTER_DEBUG')
        return cls(server, key=key, debug=debug)

这里面的key默认是用时间戳的形式,而咱们能够将其写死key = defaults.DUPEFILTER_KEY % {'timestamp': 'chenjunkan'}

最后在配置文件里面:

# ############### scrapy redis链接 ####################

REDIS_HOST = '140.143.227.206'                            # 主机名
REDIS_PORT = 8888                                   # 端口
REDIS_PARAMS  = {'password':'beta'}                                  # Redis链接参数             
默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
REDIS_ENCODING = "utf-8" # redis编码类型 默认:'utf-8' # REDIS_URL = 'redis://user:pass@hostname:9001' # 链接URL(优先于以上配置) DUPEFILTER_KEY = 'dupefilter:%(timestamp)s' # DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter' DUPEFILTER_CLASS = 'cjkscrapy.xxx.RedisDupeFilter'

 

 (2).scrapy_redis队列

redis实现队列和栈:

查看源码:导入 import scrapy_redis

路径:/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy_redis/queue.py

queue.py:

  1 from scrapy.utils.reqser import request_to_dict, request_from_dict
  2 
  3 from . import picklecompat
  4 
  5 
  6 class Base(object):
  7     """Per-spider base queue class"""
  8 
  9     def __init__(self, server, spider, key, serializer=None):
 10         """Initialize per-spider redis queue.
 11 
 12         Parameters
 13         ----------
 14         server : StrictRedis
 15             Redis client instance.
 16         spider : Spider
 17             Scrapy spider instance.
 18         key: str
 19             Redis key where to put and get messages.
 20         serializer : object
 21             Serializer object with ``loads`` and ``dumps`` methods.
 22 
 23         """
 24         if serializer is None:
 25             # Backward compatibility.
 26             # TODO: deprecate pickle.
 27             serializer = picklecompat
 28         if not hasattr(serializer, 'loads'):
 29             raise TypeError("serializer does not implement 'loads' function: %r"
 30                             % serializer)
 31         if not hasattr(serializer, 'dumps'):
 32             raise TypeError("serializer '%s' does not implement 'dumps' function: %r"
 33                             % serializer)
 34 
 35         self.server = server
 36         self.spider = spider
 37         self.key = key % {'spider': spider.name}
 38         self.serializer = serializer
 39 
 40     def _encode_request(self, request):
 41         """Encode a request object"""
 42         obj = request_to_dict(request, self.spider)
 43         return self.serializer.dumps(obj)
 44 
 45     def _decode_request(self, encoded_request):
 46         """Decode an request previously encoded"""
 47         obj = self.serializer.loads(encoded_request)
 48         return request_from_dict(obj, self.spider)
 49 
 50     def __len__(self):
 51         """Return the length of the queue"""
 52         raise NotImplementedError
 53 
 54     def push(self, request):
 55         """Push a request"""
 56         raise NotImplementedError
 57 
 58     def pop(self, timeout=0):
 59         """Pop a request"""
 60         raise NotImplementedError
 61 
 62     def clear(self):
 63         """Clear queue/stack"""
 64         self.server.delete(self.key)
 65 
 66 
 67 class FifoQueue(Base):
 68     """Per-spider FIFO queue"""
 69 
 70     def __len__(self):
 71         """Return the length of the queue"""
 72         return self.server.llen(self.key)
 73 
 74     def push(self, request):
 75         """Push a request"""
 76         self.server.lpush(self.key, self._encode_request(request))
 77 
 78     def pop(self, timeout=0):
 79         """Pop a request"""
 80         if timeout > 0:
 81             data = self.server.brpop(self.key, timeout)
 82             if isinstance(data, tuple):
 83                 data = data[1]
 84         else:
 85             data = self.server.rpop(self.key)
 86         if data:
 87             return self._decode_request(data)
 88 
 89 
 90 class PriorityQueue(Base):
 91     """Per-spider priority queue abstraction using redis' sorted set"""
 92 
 93     def __len__(self):
 94         """Return the length of the queue"""
 95         return self.server.zcard(self.key)
 96 
 97     def push(self, request):
 98         """Push a request"""
 99         data = self._encode_request(request)
100         score = -request.priority
101         # We don't use zadd method as the order of arguments change depending on
102         # whether the class is Redis or StrictRedis, and the option of using
103         # kwargs only accepts strings, not bytes.
104         self.server.execute_command('ZADD', self.key, score, data)
105 
106     def pop(self, timeout=0):
107         """
108         Pop a request
109         timeout not support in this queue class
110         """
111         # use atomic range/remove using multi/exec
112         pipe = self.server.pipeline()
113         pipe.multi()
114         pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0)
115         results, count = pipe.execute()
116         if results:
117             return self._decode_request(results[0])
118 
119 
120 class LifoQueue(Base):
121     """Per-spider LIFO queue."""
122 
123     def __len__(self):
124         """Return the length of the stack"""
125         return self.server.llen(self.key)
126 
127     def push(self, request):
128         """Push a request"""
129         self.server.lpush(self.key, self._encode_request(request))
130 
131     def pop(self, timeout=0):
132         """Pop a request"""
133         if timeout > 0:
134             data = self.server.blpop(self.key, timeout)
135             if isinstance(data, tuple):
136                 data = data[1]
137         else:
138             data = self.server.lpop(self.key)
139 
140         if data:
141             return self._decode_request(data)
142 
143 
144 # TODO: Deprecate the use of these names.
145 SpiderQueue = FifoQueue
146 SpiderStack = LifoQueue
147 SpiderPriorityQueue = PriorityQueue
View Code

队列:先进先出

代码示例:

import scrapy_redis
import redis

class FifoQueue(object):
    def __init__(self):
        self.server = redis.Redis(host='140.143.227.206',port=8888,password='beta')

    def push(self, request):
        """Push a request"""
        self.server.lpush('USERS', request)

    def pop(self, timeout=0):
        """Pop a request"""
        data = self.server.rpop('USERS')
        return data
# [33,22,11]
q = FifoQueue()
q.push(11)
q.push(22)
q.push(33)

print(q.pop())
print(q.pop())
print(q.pop())

说明:lpush从左边添加,rpop从右边拿出来,也就是先进先出,广度优先

栈:后进先出:

import redis

class LifoQueue(object):
    """Per-spider LIFO queue."""
    def __init__(self):
        self.server = redis.Redis(host='140.143.227.206',port=8888,password='beta')

    def push(self, request):
        """Push a request"""
        self.server.lpush("USERS", request)

    def pop(self, timeout=0):
        """Pop a request"""
        data = self.server.lpop('USERS')
        return

# [33,22,11]

说明:lpush从左边进去,lpop从左边出去,后进先出,深度优先

zadd,zrange函数:

import redis

conn = redis.Redis(host='140.143.227.206', port=8888, password='beta')
conn.zadd('score', cjk=79, pyy=33, cc=73)

print(conn.keys())

v = conn.zrange('score', 0, 8, desc=True)
print(v)

pipe = conn.pipeline()# 打包,一次性执行多条命令
pipe.multi()
pipe.zrange("score", 0, 0).zremrangebyrank('score', 0, 0)# 表示默认从小到大排序,只取第一个;根据排名删除第一个
results, count = pipe.execute()
print(results, count)

zadd函数:将 cjk=79, pyy=33, cc=73三个值放入redis里面去,名字为score

zrange函数:能够给score设置一个区间,让他按照从小到大进行排序;v = conn.zrange('score', 0, 8, desc=True)若是将desc设置为true表示按照分值从大到小排取

优先级队列:

import redis


class PriorityQueue(object):
    """Per-spider priority queue abstraction using redis' sorted set"""

    def __init__(self):
        self.server = redis.Redis(host='140.143.227.206', port=8888, password='beta')

    def push(self, request, score):
        """Push a request"""
        # data = self._encode_request(request)
        # score = -request.priority
        # We don't use zadd method as the order of arguments change depending on
        # whether the class is Redis or StrictRedis, and the option of using
        # kwargs only accepts strings, not bytes.
        self.server.execute_command('ZADD', 'xxxxxx', score, request)

    def pop(self, timeout=0):
        """
        Pop a request
        timeout not support in this queue class
        """
        # use atomic range/remove using multi/exec
        pipe = self.server.pipeline()
        pipe.multi()
        pipe.zrange('xxxxxx', 0, 0).zremrangebyrank('xxxxxx', 0, 0)
        results, count = pipe.execute()
        if results:
            return results[0]


q = PriorityQueue()

q.push('alex', 99)
q.push('oldboy', 56)
q.push('eric', 77)

v1 = q.pop()
print(v1)
v2 = q.pop()
print(v2)
v3 = q.pop()
print(v3)

说明:若是分数同样的话,那么就根据名字来排优先级,若是按照优先级从小到大就是广度优先,若是从大到小就是深度优先

(3)scrapy_redis调度器

调度器的配置文件:

# ###################### 调度器 ######################
from scrapy_redis.scheduler import Scheduler
# 由scrapy_redis的调度器来进行负责调配
# enqueue_request: 向调度器中添加任务
# next_request: 去调度器中获取一个任务
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# 规定任务存放的顺序
# 优先级
DEPTH_PRIORITY = 1  # 广度优先
# DEPTH_PRIORITY = -1 # 深度优先
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'  # 默认使用优先级队列(默认),其余:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)

# 广度优先
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'  # 默认使用优先级队列(默认),其余:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)
# 深度优先
# SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'  # 默认使用优先级队列(默认),其余:PriorityQueue(有序集合),FifoQueue(列表)、LifoQueue(列表)

"""
redis = {
    chouti:requests:[
        pickle.dumps(Request(url='Http://wwwww',callback=self.parse)),
        pickle.dumps(Request(url='Http://wwwww',callback=self.parse)),
        pickle.dumps(Request(url='Http://wwwww',callback=self.parse)),
    ],
    cnblogs:requests:[
    
    ]
}
"""
SCHEDULER_QUEUE_KEY = '%(spider)s:requests'  # 调度器中请求存放在redis中的key

SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  # 对保存到redis中的数据进行序列化,默认使用pickle

SCHEDULER_PERSIST = False  # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
SCHEDULER_FLUSH_ON_START = True  # 是否在开始以前清空 调度器和去重记录,True=清空,False=不清空
# SCHEDULER_IDLE_BEFORE_CLOSE = 10  # 去调度器中获取数据时,若是为空,最多等待时间(最后没数据,未获取到)。


SCHEDULER_DUPEFILTER_KEY = '%(spider)s:dupefilter'  # 去重规则,在redis中保存时对应的key
SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重规则对应处理的类

说明:

a.from scrapy_redis.scheduler import Scheduler:表示由scrapy的调度器来调度

  1 import importlib
  2 import six
  3 
  4 from scrapy.utils.misc import load_object
  5 
  6 from . import connection, defaults
  7 
  8 
  9 # TODO: add SCRAPY_JOB support.
 10 class Scheduler(object):
 11     """Redis-based scheduler
 12 
 13     Settings
 14     --------
 15     SCHEDULER_PERSIST : bool (default: False)
 16         Whether to persist or clear redis queue.
 17     SCHEDULER_FLUSH_ON_START : bool (default: False)
 18         Whether to flush redis queue on start.
 19     SCHEDULER_IDLE_BEFORE_CLOSE : int (default: 0)
 20         How many seconds to wait before closing if no message is received.
 21     SCHEDULER_QUEUE_KEY : str
 22         Scheduler redis key.
 23     SCHEDULER_QUEUE_CLASS : str
 24         Scheduler queue class.
 25     SCHEDULER_DUPEFILTER_KEY : str
 26         Scheduler dupefilter redis key.
 27     SCHEDULER_DUPEFILTER_CLASS : str
 28         Scheduler dupefilter class.
 29     SCHEDULER_SERIALIZER : str
 30         Scheduler serializer.
 31 
 32     """
 33 
 34     def __init__(self, server,
 35                  persist=False,
 36                  flush_on_start=False,
 37                  queue_key=defaults.SCHEDULER_QUEUE_KEY,
 38                  queue_cls=defaults.SCHEDULER_QUEUE_CLASS,
 39                  dupefilter_key=defaults.SCHEDULER_DUPEFILTER_KEY,
 40                  dupefilter_cls=defaults.SCHEDULER_DUPEFILTER_CLASS,
 41                  idle_before_close=0,
 42                  serializer=None):
 43         """Initialize scheduler.
 44 
 45         Parameters
 46         ----------
 47         server : Redis
 48             The redis server instance.
 49         persist : bool
 50             Whether to flush requests when closing. Default is False.
 51         flush_on_start : bool
 52             Whether to flush requests on start. Default is False.
 53         queue_key : str
 54             Requests queue key.
 55         queue_cls : str
 56             Importable path to the queue class.
 57         dupefilter_key : str
 58             Duplicates filter key.
 59         dupefilter_cls : str
 60             Importable path to the dupefilter class.
 61         idle_before_close : int
 62             Timeout before giving up.
 63 
 64         """
 65         if idle_before_close < 0:
 66             raise TypeError("idle_before_close cannot be negative")
 67 
 68         self.server = server
 69         self.persist = persist
 70         self.flush_on_start = flush_on_start
 71         self.queue_key = queue_key
 72         self.queue_cls = queue_cls
 73         self.dupefilter_cls = dupefilter_cls
 74         self.dupefilter_key = dupefilter_key
 75         self.idle_before_close = idle_before_close
 76         self.serializer = serializer
 77         self.stats = None
 78 
 79     def __len__(self):
 80         return len(self.queue)
 81 
 82     @classmethod
 83     def from_settings(cls, settings):
 84         kwargs = {
 85             'persist': settings.getbool('SCHEDULER_PERSIST'),
 86             'flush_on_start': settings.getbool('SCHEDULER_FLUSH_ON_START'),
 87             'idle_before_close': settings.getint('SCHEDULER_IDLE_BEFORE_CLOSE'),
 88         }
 89 
 90         # If these values are missing, it means we want to use the defaults.
 91         optional = {
 92             # TODO: Use custom prefixes for this settings to note that are
 93             # specific to scrapy-redis.
 94             'queue_key': 'SCHEDULER_QUEUE_KEY',
 95             'queue_cls': 'SCHEDULER_QUEUE_CLASS',
 96             'dupefilter_key': 'SCHEDULER_DUPEFILTER_KEY',
 97             # We use the default setting name to keep compatibility.
 98             'dupefilter_cls': 'DUPEFILTER_CLASS',
 99             'serializer': 'SCHEDULER_SERIALIZER',
100         }
101         for name, setting_name in optional.items():
102             val = settings.get(setting_name)
103             if val:
104                 kwargs[name] = val
105 
106         # Support serializer as a path to a module.
107         if isinstance(kwargs.get('serializer'), six.string_types):
108             kwargs['serializer'] = importlib.import_module(kwargs['serializer'])
109 
110         server = connection.from_settings(settings)
111         # Ensure the connection is working.
112         server.ping()
113 
114         return cls(server=server, **kwargs)
115 
116     @classmethod
117     def from_crawler(cls, crawler):
118         instance = cls.from_settings(crawler.settings)
119         # FIXME: for now, stats are only supported from this constructor
120         instance.stats = crawler.stats
121         return instance
122 
123     def open(self, spider):
124         self.spider = spider
125 
126         try:
127             self.queue = load_object(self.queue_cls)(
128                 server=self.server,
129                 spider=spider,
130                 key=self.queue_key % {'spider': spider.name},
131                 serializer=self.serializer,
132             )
133         except TypeError as e:
134             raise ValueError("Failed to instantiate queue class '%s': %s",
135                              self.queue_cls, e)
136 
137         try:
138             self.df = load_object(self.dupefilter_cls)(
139                 server=self.server,
140                 key=self.dupefilter_key % {'spider': spider.name},
141                 debug=spider.settings.getbool('DUPEFILTER_DEBUG'),
142             )
143         except TypeError as e:
144             raise ValueError("Failed to instantiate dupefilter class '%s': %s",
145                              self.dupefilter_cls, e)
146 
147         if self.flush_on_start:
148             self.flush()
149         # notice if there are requests already in the queue to resume the crawl
150         if len(self.queue):
151             spider.log("Resuming crawl (%d requests scheduled)" % len(self.queue))
152 
153     def close(self, reason):
154         if not self.persist:
155             self.flush()
156 
157     def flush(self):
158         self.df.clear()
159         self.queue.clear()
160 
161     def enqueue_request(self, request):
162         if not request.dont_filter and self.df.request_seen(request):
163             self.df.log(request, self.spider)
164             return False
165         if self.stats:
166             self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider)
167         self.queue.push(request)
168         return True
169 
170     def next_request(self):
171         block_pop_timeout = self.idle_before_close
172         request = self.queue.pop(block_pop_timeout)
173         if request and self.stats:
174             self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider)
175         return request
176 
177     def has_pending_requests(self):
178         return len(self) > 0
View Code

在这个类里面比较重要的方法:enqueue_request和next_request:

当咱们的爬虫程序运行起来的时候,只有一个调度器,调用enqueue_request往队列里面加入一个请求,添加一个任务的时候就是调用enqueue_request,若是下载器来去任务就会调用next_request,所以咱们确定会调用队列;

b.SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue':默认使用的是优先级队列

队列其实就是redis里面的那个列表,这个列表应该会有一个key:SCHEDULER_QUEUE_KEY = '%(spider)s:requests' # 调度器中请求存放在redis中的key,取得是当前爬虫的名称;由于redis队列里面只能放字符串,那么咱们须要经过SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"  进行序列化# 对保存到redis中的数据进行序列化,默认使用pickle

 1 """A pickle wrapper module with protocol=-1 by default."""
 2 
 3 try:
 4     import cPickle as pickle  # PY2
 5 except ImportError:
 6     import pickle
 7 
 8 
 9 def loads(s):
10     return pickle.loads(s)
11 
12 
13 def dumps(obj):
14     return pickle.dumps(obj, protocol=-1)
View Code

c.SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 去调度器中获取数据时,若是为空,最多等待时间(最后没数据,未获取到)。表示阻塞的去取任务

import redis


conn = redis.Redis(host='140.143.227.206',port=8888,password='beta')

# conn.flushall()
print(conn.keys())
# chouti:dupefilter/chouti:request

# conn.lpush('xxx:request','http://wwww.xxx.com')
# conn.lpush('xxx:request','http://wwww.xxx1.com')

# print(conn.lpop('xxx:request'))
# print(conn.blpop('xxx:request',timeout=10))

 

整个的执行流程:

1. 当执行scrapy crawl chouti --nolog
    
2. 找到 SCHEDULER = "scrapy_redis.scheduler.Scheduler" 配置并实例化调度器对象
    - 执行Scheduler.from_crawler
 @classmethod
    def from_crawler(cls, crawler):
        instance = cls.from_settings(crawler.settings)
        # FIXME: for now, stats are only supported from this constructor
        instance.stats = crawler.stats
        return instance
    
    - 执行Scheduler.from_settings
@classmethod
    def from_settings(cls, settings):
        kwargs = {
            'persist': settings.getbool('SCHEDULER_PERSIST'),
            'flush_on_start': settings.getbool('SCHEDULER_FLUSH_ON_START'),
            'idle_before_close': settings.getint('SCHEDULER_IDLE_BEFORE_CLOSE'),
        }

        # If these values are missing, it means we want to use the defaults.
        optional = {
            # TODO: Use custom prefixes for this settings to note that are
            # specific to scrapy-redis.
            'queue_key': 'SCHEDULER_QUEUE_KEY',
            'queue_cls': 'SCHEDULER_QUEUE_CLASS',
            'dupefilter_key': 'SCHEDULER_DUPEFILTER_KEY',
            # We use the default setting name to keep compatibility.
            'dupefilter_cls': 'DUPEFILTER_CLASS',
            'serializer': 'SCHEDULER_SERIALIZER',
        }
        for name, setting_name in optional.items():
            val = settings.get(setting_name)
            if val:
                kwargs[name] = val

        # Support serializer as a path to a module.
        if isinstance(kwargs.get('serializer'), six.string_types):
            kwargs['serializer'] = importlib.import_module(kwargs['serializer'])

        server = connection.from_settings(settings)
        # Ensure the connection is working.
        server.ping()

        return cls(server=server, **kwargs)

        - 读取配置文件:
            SCHEDULER_PERSIST             # 是否在关闭时候保留原来的调度器和去重记录,True=保留,False=清空
            SCHEDULER_FLUSH_ON_START     # 是否在开始以前清空 调度器和去重记录,True=清空,False=不清空
            SCHEDULER_IDLE_BEFORE_CLOSE  # 去调度器中获取数据时,若是为空,最多等待时间(最后没数据,未获取到)。
        - 读取配置文件:    
            SCHEDULER_QUEUE_KEY             # %(spider)s:requests
            SCHEDULER_QUEUE_CLASS         # scrapy_redis.queue.FifoQueue
            SCHEDULER_DUPEFILTER_KEY     # '%(spider)s:dupefilter'
            DUPEFILTER_CLASS             # 'scrapy_redis.dupefilter.RFPDupeFilter'
            SCHEDULER_SERIALIZER         # "scrapy_redis.picklecompat"

        - 读取配置文件:
            REDIS_HOST = '140.143.227.206'                            # 主机名
            REDIS_PORT = 8888                                   # 端口
            REDIS_PARAMS  = {'password':'beta'}                                  # Redis链接参数             默认:REDIS_PARAMS = {'socket_timeout': 30,'socket_connect_timeout': 30,'retry_on_timeout': True,'encoding': REDIS_ENCODING,})
            REDIS_ENCODING = "utf-8"      
    - 实例化Scheduler对象
    
3. 爬虫开始执行起始URL
    - 调用 scheduler.enqueue_requests()
        def enqueue_request(self, request):
            # 请求是否须要过滤?
            # 去重规则中是否已经有?(是否已经访问过,若是未访问添加到去重记录中。)
            if not request.dont_filter and self.df.request_seen(request):
                self.df.log(request, self.spider)
                # 已经访问过就不要再访问了
                return False
            
            if self.stats:
                self.stats.inc_value('scheduler/enqueued/redis', spider=self.spider)
            # print('未访问过,添加到调度器', request)
            self.queue.push(request)
            return True
    
4. 下载器去调度器中获取任务,去下载
    
    - 调用 scheduler.next_requests()
        def next_request(self):
            block_pop_timeout = self.idle_before_close
            request = self.queue.pop(block_pop_timeout)
            if request and self.stats:
                self.stats.inc_value('scheduler/dequeued/redis', spider=self.spider)
            return request
    
    

深度优先和广度优先的第二种方式:

# 规定任务存放的顺序
# 优先级
DEPTH_PRIORITY = 1  # 广度优先
# DEPTH_PRIORITY = -1 # 深度优先

说明:当爬虫爬取的时候咱们的优先级设置的是越往深处越低request.priority -= depth * self.prio,当咱们设置DEPTH_PRIORITY = 1的时候,也就是越往深度优先级愈来愈小,而后在咱们的scrapy_redis里面score = -request.priority,这里面的分数就会变得愈来愈大,表示从小到大,说明是广度优先,反过来就是深度优先

scrapy中 调度器 和 队列  和 dupefilter的关系?
        
    调度器,调配添加或获取那个request.
    队列,存放request。
    dupefilter,访问记录。

注意:

# 优先使用DUPEFILTER_CLASS,若是没有就是用SCHEDULER_DUPEFILTER_CLASS
  SCHEDULER_DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'  # 去重规则对应处理的类
        

 (4).redis.spider

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import scrapy_redis
from scrapy_redis.spiders import RedisSpider


class ChoutiSpider(RedisSpider):
    name = 'chouti'
    allowed_domains = ['chouti.com']def parse(self, response):
        print(response)

说明:若是咱们from scrapy_redis.spiders import RedisSpider,让咱们的爬虫继承RedisSpider

 1 class RedisSpider(RedisMixin, Spider):
 2     """Spider that reads urls from redis queue when idle.
 3 
 4     Attributes
 5     ----------
 6     redis_key : str (default: REDIS_START_URLS_KEY)
 7         Redis key where to fetch start URLs from..
 8     redis_batch_size : int (default: CONCURRENT_REQUESTS)
 9         Number of messages to fetch from redis on each attempt.
10     redis_encoding : str (default: REDIS_ENCODING)
11         Encoding to use when decoding messages from redis queue.
12 
13     Settings
14     --------
15     REDIS_START_URLS_KEY : str (default: "<spider.name>:start_urls")
16         Default Redis key where to fetch start URLs from..
17     REDIS_START_URLS_BATCH_SIZE : int (deprecated by CONCURRENT_REQUESTS)
18         Default number of messages to fetch from redis on each attempt.
19     REDIS_START_URLS_AS_SET : bool (default: False)
20         Use SET operations to retrieve messages from the redis queue. If False,
21         the messages are retrieve using the LPOP command.
22     REDIS_ENCODING : str (default: "utf-8")
23         Default encoding to use when decoding messages from redis queue.
24 
25     """
26 
27     @classmethod
28     def from_crawler(self, crawler, *args, **kwargs):
29         obj = super(RedisSpider, self).from_crawler(crawler, *args, **kwargs)
30         obj.setup_redis(crawler)
31         return obj
View Code

能够控制起始的url是去列表里面取仍是去集合里面取

fetch_one = self.server.spop if use_set else self.server.lpop

setting配置文件里面:

REDIS_START_URLS_AS_SET = False
import redis

conn = redis.Redis(host='140.143.227.206',port=8888,password='beta')


conn.lpush('chouti:start_urls','https://dig.chouti.com/r/pic/hot/1') #能够用来控制来一个连接就去执行一个连接,定制起始的url,能够源源不断的去发请求

 

(5).setting配置文件

1.BOT_NAME = 'cjk' # 表示爬虫名称
2.SPIDER_MODULES = ['cjk.spiders']# 爬虫所放在的目录
3.NEWSPIDER_MODULE = 'cjk.spiders' #新建立爬虫的时候爬虫所在的目录
4.USER_AGENT = 'dbd (+http://www.yourdomain.com)' #请求头
5.ROBOTSTXT_OBEY = False #若是是true的时候表示遵循别人的爬虫协议,False就表示
6.CONCURRENT_REQUESTS = 32# 全部的爬虫并发请求数,假设有两个爬虫,有可能一个爬虫15个,另一个17个
7.DOWNLOAD_DELAY = 3#表示每次去下载页面的时候去延迟3秒
8.CONCURRENT_REQUESTS_PER_DOMAIN = 16 #每一域名并发16个,或者换句话说每个爬虫16个
9.CONCURRENT_REQUESTS_PER_IP = 16# 有时候咱们一个域名有可能会有多个ip假设为2个ip,那么总并发数为16*2=32个
10.COOKIES_ENABLED = False# 表示内部帮咱们设置cookie
11. TELNETCONSOLE_ENABLED = True,#表示若是爬虫正在运行的时候可让他终止,而后又从新运行
#TELNETCONSOLE_HOST = '127.0.0.1'
# TELNETCONSOLE_PORT = [6023,]
12.DEFAULT_REQUEST_HEADERS# 默认请求头
13.#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
from scrapy.contrib.throttle import AutoThrottle
"""
自动限速算法
    from scrapy.contrib.throttle import AutoThrottle
    自动限速设置
    1. 获取最小延迟 DOWNLOAD_DELAY
    2. 获取最大延迟 AUTOTHROTTLE_MAX_DELAY
    3. 设置初始下载延迟 AUTOTHROTTLE_START_DELAY
    4. 当请求下载完成后,获取其"链接"时间 latency,即:请求链接到接受到响应头之间的时间
    5. 用于计算的... AUTOTHROTTLE_TARGET_CONCURRENCY
    target_delay = latency / self.target_concurrency
    new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延迟时间
    new_delay = max(target_delay, new_delay)
    new_delay = min(max(self.mindelay, new_delay), self.maxdelay)
    slot.delay = new_delay
"""
14.# HTTPCACHE_ENABLED = True # 是否启用缓存
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
#好比说咱们在地铁上没有网,那么咱们在出去以前咱们能够将咱们的网页下载到本地,而后去本地去拿,这样在咱们本地就会出现一个叫httpcache的文件
相关文章
相关标签/搜索