使用Scrapy抓取数据

时间 2019-12-10

标签使用 scrapy 抓取数据栏目 Python 繁體版

原文原文链接

原文发表于：http://blog.javachen.com/2014/05/24/using-scrapy-to-cralw-data.htmljavascript

Scrapy是Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途普遍，能够用于数据挖掘、监测和自动化测试。css

官方主页： http://www.scrapy.org/
中文文档：Scrapy 0.22 文档
GitHub项目主页：https://github.com/scrapy/scrapy

Scrapy 使用了 Twisted 异步网络库来处理网络通信。总体架构大体以下（注：图片来自互联网）：html

Scrapy主要包括了如下组件：java

引擎，用来处理整个系统的数据流处理，触发事务。
调度器，用来接受引擎发过来的请求，压入队列中，并在引擎再次请求的时候返回。
下载器，用于下载网页内容，并将网页内容返回给蜘蛛。
蜘蛛，蜘蛛是主要干活的，用它来制订特定域名或网页的解析规则。
项目管道，负责处理有蜘蛛从网页中抽取的项目，他的主要任务是清晰、验证和存储数据。当页面被蜘蛛解析后，将被发送到项目管道，并通过几个特定的次序处理数据。
下载器中间件，位于Scrapy引擎和下载器之间的钩子框架，主要是处理Scrapy引擎与下载器之间的请求及响应。
蜘蛛中间件，介于Scrapy引擎和蜘蛛之间的钩子框架，主要工做是处理蜘蛛的响应输入和请求输出。
调度中间件，介于Scrapy引擎和调度之间的中间件，从Scrapy引擎发送到调度的请求和响应。

使用Scrapy能够很方便的完成网上数据的采集工做，它为咱们完成了大量的工做，而不须要本身费大力气去开发。python

1. 安装

安装 python

Scrapy 目前最新版本为0.22.2，该版本须要 python 2.7，故须要先安装 python 2.7。这里我使用 centos 服务器来作测试，由于系统自带了 python ，须要先检查 python 版本。git

查看python版本：github

bash$ python -V
Python 2.6.6

升级版本到2.7：web

bash$ Python 2.7.6:
$ wget http://python.org/ftp/python/2.7.6/Python-2.7.6.tar.xz
$ tar xf Python-2.7.6.tar.xz
$ cd Python-2.7.6
$ ./configure --prefix=/usr/local --enable-unicode=ucs4 --enable-shared LDFLAGS="-Wl,-rpath /usr/local/lib"
$ make && make altinstall

创建软链接，使系统默认的 python指向 python2.7正则表达式

bash$ mv /usr/bin/python /usr/bin/python2.6.6 
$ ln -s /usr/local/bin/python2.7 /usr/bin/python

再次查看python版本：redis

bash$ python -V
Python 2.7.6

安装

这里使用 wget 的方式来安装 setuptools :

bash$ wget https://bootstrap.pypa.io/ez_setup.py -O - | python

安装 zope.interface

bash$ easy_install zope.interface

安装 twisted

Scrapy 使用了 Twisted 异步网络库来处理网络通信，故须要安装 twisted。

安装 twisted 前，须要先安装 gcc：

bash$ yum install gcc -y

而后，再经过 easy_install 安装 twisted：

bash$ easy_install twisted

若是出现下面错误：

bash$ easy_install twisted
Searching for twisted
Reading https://pypi.python.org/simple/twisted/
Best match: Twisted 14.0.0
Downloading https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
Processing Twisted-14.0.0.tar.bz2
Writing /tmp/easy_install-kYHKjn/Twisted-14.0.0/setup.cfg
Running Twisted-14.0.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-kYHKjn/Twisted-14.0.0/egg-dist-tmp-vu1n6Y
twisted/runner/portmap.c:10:20: error: Python.h: No such file or directory
twisted/runner/portmap.c:14: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:31: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘*’ token
twisted/runner/portmap.c:45: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘PortmapMethods’
twisted/runner/portmap.c: In function ‘initportmap’:
twisted/runner/portmap.c:55: warning: implicit declaration of function ‘Py_InitModule’
twisted/runner/portmap.c:55: error: ‘PortmapMethods’ undeclared (first use in this function)
twisted/runner/portmap.c:55: error: (Each undeclared identifier is reported only once
twisted/runner/portmap.c:55: error: for each function it appears in.)

请安装 python-devel 而后再次运行：

bash$ yum install python-devel -y
$ easy_install twisted

若是出现下面异常：

error: Not a recognized archive type: /tmp/easy_install-tVwC5O/Twisted-14.0.0.tar.bz2

请手动下载而后安装，下载地址在这里

bash$ wget https://pypi.python.org/packages/source/T/Twisted/Twisted-14.0.0.tar.bz2#md5=9625c094e0a18da77faa4627b98c9815
$ tar -vxjf Twisted-14.0.0.tar.bz2
$ cd Twisted-14.0.0
$ python setup.py install

安装 pyOpenSSL

先安装一些依赖：

bash$ yum install libffi libffi-devel openssl-devel -y

而后，再经过 easy_install 安装 pyOpenSSL：

bash$ easy_install pyOpenSSL

安装 Scrapy

先安装一些依赖：

bash$ yum install libxml2 libxslt libxslt-devel -y

最后再来安装 Scrapy ：

bash$ easy_install scrapy

2. 使用 Scrapy

在安装成功以后，你能够了解一些 Scrapy 的基本概念和使用方法，并学习 Scrapy 项目的例子 dirbot 。

Dirbot 项目位于 https://github.com/scrapy/dirbot，该项目包含一个 README 文件，它详细描述了项目的内容。若是你熟悉 git，你能够 checkout 它的源代码。或者你能够经过点击 Downloads 下载 tarball 或 zip 格式的文件。

下面以该例子来描述如何使用 Scrapy 建立一个爬虫项目。

新建工程

在抓取以前，你须要新建一个 Scrapy 工程。进入一个你想用来保存代码的目录，而后执行：

bash$ scrapy startproject tutorial

这个命令会在当前目录下建立一个新目录 tutorial，它的结构以下：

.
├── scrapy.cfg
└── tutorial
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        └── __init__.py

这些文件主要是：

scrapy.cfg: 项目配置文件
tutorial/: 项目python模块, 呆会代码将从这里导入
tutorial/items.py: 项目items文件
tutorial/pipelines.py: 项目管道文件
tutorial/settings.py: 项目配置文件
tutorial/spiders: 放置spider的目录

定义Item

Items是将要装载抓取的数据的容器，它工做方式像 python 里面的字典，但它提供更多的保护，好比对未定义的字段填充以防止拼写错误。

它经过建立一个 scrapy.item.Item 类来声明，定义它的属性为 scrpy.item.Field 对象，就像是一个对象关系映射(ORM).
咱们经过将须要的item模型化，来控制从 dmoz.org 得到的站点数据，好比咱们要得到站点的名字，url 和网站描述，咱们定义这三种属性的域。要作到这点，咱们编辑在 tutorial 目录下的 items.py 文件，咱们的 Item 类将会是这样

pythonfrom scrapy.item import Item, Field 
class DmozItem(Item):
    title = Field()
    link = Field()
    desc = Field()

刚开始看起来可能会有些困惑，可是定义这些 item 能让你用其余 Scrapy 组件的时候知道你的 items 究竟是什么。

编写爬虫(Spider)

Spider 是用户编写的类，用于从一个域（或域组）中抓取信息。们定义了用于下载的URL的初步列表，如何跟踪连接，以及如何来解析这些网页的内容用于提取items。

要创建一个 Spider，你能够为 scrapy.spider.BaseSpider 建立一个子类，并肯定三个主要的、强制的属性：

name：爬虫的识别名，它必须是惟一的，在不一样的爬虫中你必须定义不一样的名字.
start_urls：爬虫开始爬的一个 URL 列表。爬虫从这里开始抓取数据，因此，第一次下载的数据将会从这些 URLS 开始。其余子 URL 将会从这些起始 URL 中继承性生成。
parse()：爬虫的方法，调用时候传入从每个 URL 传回的 Response 对象做为参数，response 将会是 parse 方法的惟一的一个参数,

这个方法负责解析返回的数据、匹配抓取的数据(解析为 item )并跟踪更多的 URL。

在 tutorial/spiders 目录下建立 DmozSpider.py

pythonfrom scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)

运行项目

bash$ scrapy crawl dmoz

该命令从 dmoz.org 域启动爬虫，第三个参数为 DmozSpider.py 中的 name 属性值。

xpath选择器

Scrapy 使用一种叫作 XPath selectors 的机制，它基于 XPath 表达式。若是你想了解更多selectors和其余机制你能够查阅资料。

这是一些XPath表达式的例子和他们的含义：

/html/head/title: 选择HTML文档 <head> 元素下面的 <title> 标签。
/html/head/title/text(): 选择前面提到的<title> 元素下面的文本内容
//td: 选择全部 <td> 元素
//div[@class="mine"]: 选择全部包含 class="mine" 属性的div 标签元素

这只是几个使用 XPath 的简单例子，可是实际上 XPath 很是强大。若是你想了解更多 XPATH 的内容，咱们向你推荐这个 XPath 教程

为了方便使用 XPaths，Scrapy 提供 Selector 类，有三种方法

xpath()：返回selectors列表, 每个select表示一个xpath参数表达式选择的节点.
extract()：返回一个unicode字符串，该字符串为XPath选择器返回的数据
re()：返回unicode字符串列表，字符串做为参数由正则表达式提取出来
css()

提取数据

咱们能够经过以下命令选择每一个在网站中的 <li> 元素:

pythonsel.xpath('//ul/li')

而后是网站描述:

pythonsel.xpath('//ul/li/text()').extract()

网站标题:

pythonsel.xpath('//ul/li/a/text()').extract()

网站连接:

pythonsel.xpath('//ul/li/a/@href').extract()

如前所述，每一个 xpath() 调用返回一个 selectors 列表，因此咱们能够结合 xpath() 去挖掘更深的节点。咱们将会用到这些特性，因此:

pythonsites = sel.xpath('//ul/li')
for site in sites:
    title = site.xpath('a/text()').extract()
    link = site.xpath('a/@href').extract()
    desc = site.xpath('text()').extract()
    print title, link, desc

使用Item

scrapy.item.Item 的调用接口相似于 python 的 dict ，Item 包含多个 scrapy.item.Field。这跟 django 的 Model 与

Item 一般是在 Spider 的 parse 方法里使用，它用来保存解析到的数据。

最后修改爬虫类，使用 Item 来保存数据，代码以下：

pythonfrom scrapy.spider import Spider
from scrapy.selector import Selector

from dirbot.items import Website


class DmozSpider(Spider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
    ]

    def parse(self, response):
        """
        The lines below is a spider contract. For more info see:
        http://doc.scrapy.org/en/latest/topics/contracts.html

        @url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
        @scrapes name
        """
        sel = Selector(response)
        sites = sel.xpath('//ul[@class="directory-url"]/li')
        items = []

        for site in sites:
            item = Website()
            item['name'] = site.xpath('a/text()').extract()
            item['url'] = site.xpath('a/@href').extract()
            item['description'] = site.xpath('text()').re('-\s([^\n]*?)\\n')
            items.append(item)

        return items

如今，能够再次运行该项目查看运行结果：

bash$ scrapy crawl dmoz

使用Item Pipeline

在 settings.py 中设置 ITEM_PIPELINES，其默认为[]，与 django 的 MIDDLEWARE_CLASSES 等类似。
从 Spider 的 parse 返回的 Item 数据将依次被 ITEM_PIPELINES 列表中的 Pipeline 类处理。

一个 Item Pipeline 类必须实现如下方法：

process_item(item, spider) 为每一个 item pipeline 组件调用，而且须要返回一个 scrapy.item.Item 实例对象或者抛出一个 scrapy.exceptions.DropItem 异常。当抛出异常后该 item 将不会被以后的 pipeline 处理。参数:
- item (Item object) – 由 parse 方法返回的 Item 对象
- spider (BaseSpider object) – 抓取到这个 Item 对象对应的爬虫对象

也可额外的实现如下两个方法：

open_spider(spider) 当爬虫打开以后被调用。参数: spider (BaseSpider object) – 已经运行的爬虫
close_spider(spider) 当爬虫关闭以后被调用。参数: spider (BaseSpider object) – 已经关闭的爬虫

保存抓取的数据

保存信息的最简单的方法是经过 Feed exports，命令以下：

bash$ scrapy crawl dmoz -o items.json -t json

除了 json 格式以外，还支持 JSON lines、CSV、XML格式，你也能够经过接口扩展一些格式。

对于小项目用这种方法也足够了。若是是比较复杂的数据的话可能就须要编写一个 Item Pipeline 进行处理了。

全部抓取的 items 将以 JSON 格式被保存在新生成的 items.json 文件中

总结

上面描述了如何建立一个爬虫项目的过程，你能够参照上面过程联系一遍。做为学习的例子，你还能够参考这篇文章：scrapy 中文教程（爬cnbeta实例）。

这篇文章中的爬虫类代码以下：

pythonfrom scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector

from cnbeta.items import CnbetaItem

class CBSpider(CrawlSpider):
    name = 'cnbeta'
    allowed_domains = ['cnbeta.com']
    start_urls = ['http://www.cnbeta.com']

    rules = (
        Rule(SgmlLinkExtractor(allow=('/articles/.*\.htm', )),
             callback='parse_page', follow=True),
    )

    def parse_page(self, response):
        item = CnbetaItem()
        sel = Selector(response)
        item['title'] = sel.xpath('//title/text()').extract()
        item['url'] = response.url
        return item

须要说明的是：

该爬虫类继承的是 CrawlSpider 类，而且定义规则，rules指定了含有 /articles/.*\.htm 的连接都会被匹配。
该类并无实现parse方法，而且规则中定义了回调函数 parse_page，你能够参考更多资料了解 CrawlSpider 的用法

3. 学习资料

接触 Scrapy，是由于想爬取一些知乎的数据，最开始的时候搜索了一些相关的资料和别人的实现方式。

Github 上已经有人或多或少的实现了对知乎数据的爬取，我搜索到的有如下几个仓库：

https://github.com/KeithYue/Zhihu_Spider 实现先经过用户名和密码登录再爬取数据，代码见 zhihu_spider.py。
https://github.com/immzz/zhihu-scrapy 使用 selenium 下载和执行 javascript 代码。
https://github.com/tangerinewhite32/zhihu-stat-py
https://github.com/Zcc/zhihu 主要是爬指定话题的topanswers，还有用户我的资料，添加了登陆代码。
https://github.com/pelick/VerticleSearchEngine 基于爬取的学术资源，提供搜索、推荐、可视化、分享四块。使用了 Scrapy、MongoDB、Apache Lucene/Solr、Apache Tika等技术。
https://github.com/geekan/scrapy-examples scrapy的一些例子，包括获取豆瓣数据、linkedin、腾讯招聘数据等例子。
https://github.com/owengbs/deeplearning 实现分页获取话题。
https://github.com/gnemoug/distribute_crawler 使用scrapy、redis、mongodb、graphite实现的一个分布式网络爬虫,底层存储mongodb集群,分布式使用redis实现,爬虫状态显示使用graphite实现
https://github.com/weizetao/spider-roach 一个分布式定向抓取集群的简单实现。

其余资料：

http://www.52ml.net/tags/Scrapy 收集了不少关于 Scrapy 的文章，推荐阅读
用Python Requests抓取知乎用户信息
使用scrapy框架爬取本身的博文
Scrapy 深刻一点点
使用python，scrapy写（定制）爬虫的经验，资料，杂。
Scrapy 轻松定制网络爬虫
在scrapy中怎么让Spider自动去抓取豆瓣小组页面

scrapy 和 javascript 交互例子：

还有一些待整理的知识点：

如何先登录再爬数据
如何使用规则作过滤
如何递归爬取数据
scrapy的参数设置和优化
如何实现分布式爬取

4. 总结

以上就是最近几天学习 Scrapy 的一个笔记和知识整理，参考了一些网上的文章才写成此文，对此表示感谢，也但愿这篇文章可以对你有所帮助。若是你有什么想法，欢迎留言；若是喜欢此文，请帮忙分享，谢谢!