爬虫 Scrapy 学习系列之一：Tutorial

时间 2019-12-07

原文原文链接

前言

笔者打算写一系列的文章，记录本身在学习并使用 Scrapy 的点滴；做者打算使用 python 3.6 做为 Scrapy 的基础运行环境；css

本文为做者的原创做品，转载需注明出处；本文转载自本人的博客，伤神的博客：http://www.shangyang.me/2017/...html

Scrapy 安装

我本地安装有两个版本的 python, 2.7 和 3.6；而正如前言所描述的那样，笔者打算使用 Python 3.6 的环境来搭建 Scrapy；python

$ pip install Scrapy

默认安装的支持 Python 2.7 版本的 Scrapy；正则表达式

$ pip3 install Scrap

安装的是支持 python 3.x 版本的 Scrapy；不过安装过程当中，遇到了些问题，<font color='red'>HTTPSConnectionPool(host='pypi.python.org', port=443): Read timed out.</font>解决办法是，在安装的过程当中，延长超时的时间，shell

$ pip3 install -U --timeout 1000 Scrapy

Scrapy Tutorial

建立 tutorial 项目

使用json

$ scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/usr/local/lib/python2.7/site-packages/scrapy/templates/project', created in:
    /Users/mac/workspace/scrapy/tutorial

可见默认使用的 python 2.7，可是若是须要建立一个支持 python 3.x 版本的 tutoiral 项目呢？以下所示，使用 python3 -mbash

$ python3 -m scrapy startproject tutorial
New Scrapy project 'tutorial', using template directory '/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/scrapy/templates/project', created in:
    /Users/mac/workspace/scrapy/tutorial

导入 PyCharm

直接 open 项目工程 /Users/mac/workspace/scrapy/tutorial；这里须要注意的是默认的 PyCharm 使用的解释器 Interpretor 是我本地的 Python 2.7；这里须要将解释器改成 Python 3.6；下面记录下修改的步骤，框架

点击左上角 PyCharm Community Edition，进入 Preferences
点击 Project:tutorial，而后选择 Project Interpreter，而后设置解释器的版本，以下

工程结构

经过命令构建出来的项目骨架如图所示python2.7

第一个 Spider

咱们来新建一个 Spider 类，名叫 quotes_spider.py，并将其放置到 tutorial/spiders 目录中scrapy

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

能够看到，咱们新建的 QuotesSpider 类是继承自 scrapy.Spider 类的；下面看看其属性和方法的意义，

name
是 Spider 的标识符，用于惟一标识该 Spider；它必须在整个项目中是全局惟一的；
start_requests()
必须定义并返回一组能够被 Spider 爬取的 Requests，Request 对象由一个 URL 和一个回调函数构成；
parse()
就是 Request 对象中的回调方法，用来解析每个 Request 以后的 Response；因此，parse() 方法就是用来解析返回的内容，经过解析获得的 URL 一样能够建立对应的 Requests 进而继续爬取；

再来看看具体的实现，

start_request(self) 方法分别针对 http://quotes.toscrape.com/pa... 和 http://quotes.toscrape.com/pa... 建立了两个须要被爬取的 Requests 对象；并经过 yield 进行迭代返回；备注，yield 是迭代生成器，是一个 Generator；
parse(self, response) 方法既是对 Request 的反馈的内容 Response 进行解析，这里的解析的逻辑很简单，就是分别建立两个本地文件，而后将 response.body 的内容放入这两个文件当中。

如何执行

执行的过程须要使用到命令行，注意，这里须要使用到scrapy命令来执行；

$ cd /Users/mac/workspace/scrapy/tutorial
$ python3 -m scrapy crawl quotes

大体会输出以下内容

... 
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Spider opened
2016-12-16 21:24:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-12-16 21:24:05 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2016-12-16 21:24:05 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-1.html
2016-12-16 21:24:05 [quotes] DEBUG: Saved file quotes-2.html
2016-12-16 21:24:05 [scrapy.core.engine] INFO: Closing spider (finished)
...

能够看到，经过爬取，咱们在本地生成了两个 html 文件 quotes-1.html 和 quotes-2.html

如何提取

经过命令行的方式提取

Scrapy 提供了命令行的方式能够对须要被爬取的内容进行高效的调试，经过使用Scrapy shell进入命令行，而后在命令行中能够快速的对要爬取的内容进行提取；

如何进入 Scrapy shell 环境

咱们试着经过 Scrapy shell 来提取下 "http://quotes.toscrape.com/page/1/" 中的数据，经过执行以下命令，进入 shell

$ scrapy shell "http://quotes.toscrape.com/page/1/"

输出

[ ... Scrapy log here ... ]
2016-09-19 12:09:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fa91d888c90>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/page/1/>
[s]   response   <200 http://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fa91d888c10>
[s]   spider     <DefaultSpider 'default' at 0x7fa91c8af990>
[s] Useful shortcuts:
[s]   shelp()           Shell help (print this help)
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects
[s]   view(response)    View response in a browser
>>>

这样，咱们就进入了 Scrapy shell 的环境，上面显示了链接请求和返回的相关信息，response 返回 status code 200 表示成功返回；

经过 CSS 标准进行提取

这里主要是遵循 CSS 标准 https://www.w3.org/TR/selectors/ 来对网页的元素进行提取，

经过使用 css() 选择咱们要提取的元素；下面演示一下如何提取元素 <title/>
```
>>> response.css('title')
[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]
```
能够看到，它经过返回一个相似 SelectorList 的对象成功的获取到了 http://quotes.toscrape.com/pa... 页面中的 <title/> 的信息，该信息是封装在Selector对象中的 data 属性中的；
提取Selector元素的文本内容，通常有两种方式用来提取，
- 经过使用 extract() 或者 extract_first() 方法来提取元素的内容；下面演示如何提取 #1 返回的元素 <title/> 中的文本内容 text；
```
>>> response.css('title::text').extract_first()
'Quotes to Scrape'
```
  extract_first() 表示提取返回队列中的第一个 Selector 对象；一样也可使用以下的方式，
```
>>> response.css('title::text')[0].extract()
'Quotes to Scrape'
```
  不过 extract_first() 方法能够在当页面没有找到的状况下，避免出现IndexError的错误；
- 经过 re() 方法来使用正则表达式的方式来进行提取元素的文本内容
```
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
```
  备注，最后一个正则表示式返回了两个匹配的 Group；

使用 XPath

除了使用 CSS 标准来提取元素意外，咱们还可使用 XPath 标准来提取元素，好比，

>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'

XPath 比 CSS 的爬取方式更为强大，由于它不只仅是根据 HTML 的结构元素去进行检索(Navigating)，而且它能够顺带的对文本(text)进行检索；因此它能够支持 CSS 标准不能作到的场景，好比，检索一个包含文本内容"Next Page"的 link 元素；这就使得经过 XPath 去构建爬虫更为简单；

提取 quotes 和 authors

下面咱们未来演示如何提取 http://quotes.toscrape.com 首页中的内容，先来看看首页的结构

能够看到，里面每一个段落包含了一个名人的一段语录，那么咱们如何来提取全部的相关信息呢？

咱们从提取第一个名人的信息入手，看看如何提取第一个名人的名言信息；能够看到，第一个名人的语句是爱因斯坦的，那么咱们试着来提取名言、做者以及相关的tags；

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

下面咱们就来试着一步一步的去提取相关的信息，

首先，进入 Scrapy Shell，

$ scrapy shell 'http://quotes.toscrape.com'

而后，获取 <div class="quote" /> 元素列表

>>> response.css("div.quote")

这里会返回一系列的相关的 Selectors，不过由于这里咱们仅仅是对第一个名言进行解析，因此咱们只取第一个元素，并将其保存在 quote 变量中

>>> quote = response.css("div.quote")[0]

而后，咱们来分别提取title、author和tags

提取title

>>> title = quote.css("span.text::text").extract_first()
>>> title
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

提取author

>>> author = quote.css("small.author::text").extract_first()
>>> author
'Albert Einstein'

提取tags，这里须要注意的是，tags 是一系列的文本，

>>> tags = quote.css("div.tags a.tag::text").extract()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Ok，上述完成了针对其中一个名言信息的提取，那么，咱们如何提取完全部名人的名言信息呢？

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity

写个循环，将全部的信息的信息放入 Python dictionary；

经过 Python 程序来进行提取

本小计继续沿用提取 quotes 和 authors 小节的例子，来看看如何经过 python 程序来作相同的爬取动做；

提取数据

修改该以前的 quotes_spider.py 内容，以下，

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

执行上述的名为 quotes 的爬虫，

$ scrapy crawl quotes

执行结果以下，

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

能够看到，咱们经过 python 建立的爬虫 quotes 一条一条的返回了爬取的信息；

保存数据

最简单保存方式被爬取的数据是经过使用 Feed exports，经过使用以下的命令，

使用 JSON 格式

$ scrapy crawl quotes -o quotes.json

上述命令会生成一个文件quotes.json，该文件中包含了全部被爬取的数据；不过因为历史的缘由，Scrapy 是往一个文件中追加被爬取的信息，而不是覆盖更新，因此若是你执行上述命令两次，将会获得一个损坏了的 json 文件；

使用 JSON Lines 格式

$ scrapy crawl quotes -o quotes.jl

这样，保存的文件就是 JSON Lines 的格式了，注意，这里的惟一变化是文件的后缀名改成了.jl；

补充，JSON Lines 是另外一种 JSON 格式的定义，基本设计是每行是一个有效的 JSON Value；好比它的格式比 CSV 格式更友好，

["Name", "Session", "Score", "Completed"]
["Gilbert", "2013", 24, true]
["Alexa", "2013", 29, true]
["May", "2012B", 14, false]
["Deloise", "2012A", 19, true]

同时也能够支持内嵌数据，

{"name": "Gilbert", "wins": [["straight", "7♣"], ["one pair", "10♥"]]}
{"name": "Alexa", "wins": [["two pair", "4♠"], ["two pair", "9♠"]]}
{"name": "May", "wins": []}
{"name": "Deloise", "wins": [["three of a kind", "5♣"]]}

JSON Lines 格式很是适合处理含有大量数据的文件，经过迭代，每行处理一个数据对象；不过，要注意的是，使用 JSON lines 的方式，Scrapy 一样的是以追加的方式添加内容，只是由于 JSON Lines 逐行的方式添加被爬取的数据，因此以追加的方式并不会想使用 JSON 格式那样致使文件格式错误；

若是是一个小型的项目，使用 JSON Lines 的方式就足够了；可是，若是你面临的是一个更复杂的项目，并且有更复杂的数据须要爬取，那么你就可使用 Item Pipeline；一个 demo Pipelines 已经帮你建立好了，tutorial/pipelines.py；

提取下一页(提取连接信息)

如何提取章节详细的描述了如何爬取页面的信息，那么，如何爬取该网站的全部信息呢？那么就必须爬取相关的连接信息；那么咱们依然以 http://quotes.toscrape.com 为例，来看看咱们该如何爬取连接信息，

咱们能够看到，下一页的连接 HTML 元素，

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

咱们能够经过 shell 来抓取它，

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

这样，咱们获得了这个anchor元素，可是咱们想要获得的是其href属性；Scrapy 支持 CSS 扩展的方式，所以咱们能够直接爬取其属性值，

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

好的，咱们如今已经知道该如何获取下一页连接的相对地址了，那么咱们如何修改咱们的 python 程序使得咱们能够自动的爬取全部页面的数据呢？

使用 scrapy.Request

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

这里简单的描述下程序的执行逻辑，经过 for 循环处理完当前页面的爬取操做，而后执行获取下一页的相关操做，首先得到下一页的_相对路径_并保存到变量 next_page 中，而后经过 response.urljon(next_page) 方法获得_绝对路径_；最后，经过该绝对路径再生成一个 scrapy.Request 对象返回，并加入爬虫队列中，等待下一次的爬取；由此，你就能够动态的去爬取全部相关页面的信息了；

基于此，你就能够创建起很是复杂的爬虫了，一样，能够根据不一样连接的类型，构建不一样的 Parser，那么就能够对不一样类型的返回页面进行分别处理；

使用 response.follow

不一样于使用 scrapy Request，须要经过相对路径构造出绝对路径，_response.follow_ 能够直接使用相对路径，所以就不须要调用 urljoin 方法了；注意，_response.follow_ 直接返回一个 Request 实例，能够直接经过 yield 进行返回；因此，上述代码能够简化为

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('span small::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

另外，_response.follow_ 在处理 <a> 元素的时候，会直接使用它们的 href 属性；因此上述代码还能够简化为，

next_page = response.css('li.next a').extract_first()
        if next_page is not None:
            yield response.follow(next_page, callback=self.parse)

所以匹配的时候不须要显示的声明 <a> 的属性值了；

定义更多的 Parser

import scrapy

class AuthorSpider(scrapy.Spider):

    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)'):
            yield response.follow(href, self.parse_author)

        # follow pagination links
        for href in response.css('li.next a::attr(href)'):
            yield response.follow(href, self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

该例子建立了两个解析方法 parse() 和 parse_author()，一个是用来控制整个爬取流程，一个是用来解析 author 信息的；首先，咱们来分析一下执行的流程，

进入 parse()_，从当前的页面中爬取获得全部相关的 author _href 属性值既是一个连接，而后针对该连接，经过 response.follow 建立一个新的 Request 继续进行爬取，经过回调 parse_author() 方法对爬取的内容进行进一步的解析，这里就是对爬取到的 Author 的信息进行提取；
当 #1 有关当前页面全部的 Author 信息都已经爬取成功之后，便开始对下一页进行爬取；

从这个例子中，咱们须要注意的是，当爬取当前页面的时候，咱们依然能够经过建立子的 Requests 对子连接进行爬取直到全部有关当前页面的信息都已经被爬取完毕之后，方可进入下一个页面继续进行爬取；

另外，须要注意的是，在爬取整个网站信息的时候，必然会有多个相同 Author 的名言，那么势必要爬取到许多的重复的 Author 的信息；这无疑是增长了爬取的压力同时也须要处理大量的冗余数据，基于此，Scrapy 默认实现了对重复的已经爬取过的连接在下次爬取的时候自动过滤掉了；不过，你也能够经过 DUPEFILTER_CLASS 来进行设置是否启用该默认行为；

使用 Spider 参数

你能够经过 commond line 的方式为你的 Spider 提供参数，

$ scrapy crawl quotes -o quotes-humor.json -a tag=humor

该参数将会被传入 Spider 的 __init__ 方法中，并默认成为当前 Spider quotes 的属性；在 quotes Spider 的 python 应用程序中，能够经过使用 self.tag 来获取该参数信息；

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        url = 'http://quotes.toscrape.com/'
        tag = getattr(self, 'tag', None)
        if tag is not None:
            url = url + 'tag/' + tag
        yield scrapy.Request(url, self.parse)

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

经过 getattr(self, 'tag', None) 即可以获取从 common line 中传入的 tag 参数，并构造出须要爬取的 URL 连接 http://quotes.toscrape.com/ta...

Reference

Scrapy 爬虫框架：http://python.jobbole.com/86405/
Installation Guide: https://doc.scrapy.org/en/lat...
explain how virtualenv used: https://stackoverflow.com/que...
tutorial guide: https://doc.scrapy.org/en/lat...
Scrapy Clusters: http://scrapy-cluster.readthe...
Scrapy Deployment: https://scrapyd.readthedocs.i...
Scrapy 0.24 文档：http://scrapy-chs.readthedocs...