使用 Scrapy 创建一个网站抓取器

时间 2020-01-11

原文原文链接

Scrapy 是一个用于爬行网站以及在数据挖掘、信息处理和历史档案等大量应用范围内抽取结构化数据的应用程序框架，普遍用于工业。
在本文中咱们将创建一个从 Hacker News 爬取数据的爬虫，并将数据按咱们的要求存储在数据库中。html

安装

咱们将须要 Scrapy以及 BeautifulSoup用于屏幕抓取，SQLAlchemy用于存储数据.
若是你使用ubuntu已经其余发行版的 unix 能够经过 pip 命令安装 Scrapy。python

pip install Scrapy

若是你使用 Windows，你须要手工安装 scrapy 的一些依赖。
Windows 用户须要 pywin3二、pyOpenSSL、Twisted、lxml 和 zope.interface。你能够下载这些包的编译版本来完成简易安装。
能够参照官方文档查看详情指导。
都安装好后，经过在python命令行下输入下面的命令验证你的安装：git

>> import scrapy
>>

若是没有返回内容，那么你的安装已就绪。github

安装HNScrapy

为了建立一个新项目，在终端里输入如下命令web

$ scrapy startproject hn

这将会建立一系列的文件帮助你更容易的开始，cd 到 hn 目录而后打开你最喜欢的文本编辑器。
在 items.py 文件里，scrapy 须要咱们定义一个容器用于放置爬虫抓取的数据。若是你原来用过 Django tutorial，你会发现items.py 与 Django 中的 models.py 相似。
你将会发现 class HnItem 已经存在了，它继承自 Item--一个 scrapy 已经为咱们准备好的预约义的对象。
让咱们添加一些咱们真正想抓取的条目。咱们给它们赋值为Field()是由于这样咱们才能把元数据(metadata)指定给scrapy。sql

from scrapy.item import Item, Field

class HnItem(Item):
    title = Field()
    link = Field()

没什么难的--恩，就是这样。在 scrapy 里，没有别的 filed 类型，这点和 Django 不一样。因此，咱们和 Field() 杠上了。
scrapy 的 Item 类的行为相似于 Python 里面的 dictionary ，你能从中获取 key 和 value。数据库

开始写爬虫

在 spiders 文件夹下建立一个 hn_spider.py 文件。这是奇迹发生的地方--这正是咱们告诉 scrapy 如何找到咱们寻找的确切数据的地方。正如你所想的那样，一个爬虫只针对一个特定网页。它可能不会在其余网站上工做。
在 ht_spider.py 里，咱们将定义一个类，HnSpider 以及一些通用属性，例如name 和 urls。
首先，咱们先创建 HnSpider 类以及一些属性(在类内部定义的变量，也被称为field)。咱们将从 scrapy 的 BaseSpider 继承：django

from scrapy.spider import BaseSpider
from scrapy.selector import Selector

class HnSpider(BaseSpider):
    name = 'hn'
    allowed_domains = []
    start_urls = ['http://news.ycombinator.com']

    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//td[@class="title"]')

        for site in sites:
            title = site.xpath('a/text()').extract()
            link = site.xpath('a/@href').extract()

            print title, link

前面的几个变量是自解释的 :name 定义了爬虫的名字，allowed_domains 列出了供爬虫爬行的容许域名(allowed domain)的 base-URL，start_urls 列出了爬虫从这里开始爬行的 URL。后续的 URL 将从爬虫从 start_urls 下载的数据的URL开始。
接着，scrapy 使用 XPath 选择器从网站获取数据--经过一个给定的 XPath 从 HTML 数据的特定部分进行选择。正如它们的文档所说，"XPath 是一种用于从XML选择节点的语言，它也能够被用于HTML"。你也能够阅读它们的文档了更多关于XPath选择器的信息。json

注意

在抓取你本身的站点并尝试计算 XPath 时, Chrome 的开发工具提供了检查html元素的能力, 可让你拷贝出任何你想要的元素的 xpath. 它也提供了检测 xpath 的能力，只须要在 Javascript 控制台中使用 $x, 例如 $x("//img")。而在这个教程就很少深究这个了, Firefox 有一个插件, FirePath 一样也能够编辑，检查和生成XPath。
咱们通常会基于一个定义好的 Xpath 来告诉 scrapy 到哪里去开始寻找数据. 让咱们浏览咱们的 Hacker News 站点，并右击选择”查看源代码“:
ubuntu

你会看到那个 sel.xpath('//td[@class="title"]') 有点貌似咱们见过的 HTML 的代码. 从它们的文档中你能够解读出构造 XPath 并使用相对 XPath 的方法. 但本质上, '//td[@class="title"]' 是在说: 全部的 <td> 元素中, 若是一个 <a class="title"></a> 被展示了出来，那就到 <td>元素里面去寻找那个拥有一个被称做 title 的类型的 <a> 元素。

parse() 方法使用了一个参数: response。嘿，等一下 – 这个 self 是干什么的 – 看起来像是有两个参数!
每个实体方法(在这种状况下, parse() 是一个实体方法 ) 接受一个对它自身的引用做为其第一个参数. 为了方便就叫作“self”。
response 参数是抓取器在像 Hacker News 发起一次请求以后所要返回的东西。咱们会用咱们的 XPaths 转换那个响应。
如今咱们将使用 BeautifulSoup 来进行转换. Beautiful Soup 将会转换任何你给它的东西。
下载 BeautifulSoup 并在抓取器目录里面建立 soup.py 文件，将代码复制到其中。
在你的 hn_spider.py文件里面引入 beautifulSoup 和来自 items.py 的 Hnitem，而且像下面这样修改转换方法。

from soup import BeautifulSoup as bs
from scrapy.http import Request
from scrapy.spider import BaseSpider
from hn.items import HnItem

class HnSpider(BaseSpider):
    name = 'hn'
    allowed_domains = []
    start_urls = ['http://news.ycombinator.com']

    def parse(self, response):
        if 'news.ycombinator.com' in response.url:
            soup = bs(response.body)
            items = [(x[0].text, x[0].get('href')) for x in
                     filter(None, [
                         x.findChildren() for x in
                         soup.findAll('td', {'class': 'title'})
                     ])]

            for item in items:
                print item
                hn_item = HnItem()
                hn_item['title'] = item[0]
                hn_item['link'] = item[1]
                try:
                    yield Request(item[1], callback=self.parse)
                except ValueError:
                    yield Request('http://news.ycombinator.com/' + item[1], callback=self.parse)

                yield hn_item

咱们正在迭代这个items，而且给标题和连接赋上抓取来的数据。

如今就试试对Hacker News域名进行抓取，你会看到链接和标题被打印在你的控制台上。

scrapy crawl hn

2013-12-12 16:57:06+0530 [scrapy] INFO: Scrapy 0.20.2 started (bot: hn)
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Optional features available: ssl, http11, django
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'hn.spiders', 'SPIDER_MODULES': ['hn.spiders'], 'BOT_NAME': 'hn'}
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware
, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Enabled item pipelines:
2013-12-12 16:57:06+0530 [hn] INFO: Spider opened
2013-12-12 16:57:06+0530 [hn] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-12-12 16:57:06+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-12-12 16:57:07+0530 [hn] DEBUG: Redirecting (301) to <GET https://news.ycombinator.com/> from <GET http://news.ycombinator.com>
2013-12-12 16:57:08+0530 [hn] DEBUG: Crawled (200) <GET https://news.ycombinator.com/> (referer: None)
(u'Caltech Announces Open Access Policy | Caltech', u'http://www.caltech.edu/content/caltech-announces-open-access-policy')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.caltech.edu/content/caltech-announces-open-access-policy',
         'title': u'Caltech Announces Open Access Policy | Caltech'}
(u'Coinbase Raises $25 Million From Andreessen Horowitz', u'http://blog.coinbase.com/post/69775463031/coinbase-raises-25-million-from-andreessen-horowitz')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://blog.coinbase.com/post/69775463031/coinbase-raises-25-million-from-andreessen-horowitz',
         'title': u'Coinbase Raises $25 Million From Andreessen Horowitz'}
(u'Backpacker stripped of tech gear at Auckland Airport', u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.nzherald.co.nz/nz/news/article.cfm?c_id=1&objectid=11171475',
         'title': u'Backpacker stripped of tech gear at Auckland Airport'}
(u'How I introduced a 27-year-old computer to the web', u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.keacher.com/1216/how-i-introduced-a-27-year-old-computer-to-the-web/',
         'title': u'How I introduced a 27-year-old computer to the web'}
(u'Show HN: Bitcoin Pulse - Tracking Bitcoin Adoption', u'http://www.bitcoinpulse.com')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://www.bitcoinpulse.com',
         'title': u'Show HN: Bitcoin Pulse - Tracking Bitcoin Adoption'}
(u'Why was this secret?', u'http://sivers.org/ws')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://sivers.org/ws', 'title': u'Why was this secret?'}
(u'PostgreSQL Exercises', u'http://pgexercises.com/')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://pgexercises.com/', 'title': u'PostgreSQL Exercises'}
(u'What it feels like being an ipad on a stick on wheels', u'http://labs.spotify.com/2013/12/12/what-it-feels-like-being-an-ipad-on-a-stick-on-wheels/')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://labs.spotify.com/2013/12/12/what-it-feels-like-being-an-ipad-on-a-stick-on-wheels/',
         'title': u'What it feels like being an ipad on a stick on wheels'}
(u'Prototype ergonomic mechanical keyboards', u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html')
2013-12-12 16:57:08+0530 [hn] DEBUG: Scraped from <200 https://news.ycombinator.com/>
        {'link': u'http://blog.fsck.com/2013/12/better-and-better-keyboards.html',
         'title': u'Prototype ergonomic mechanical keyboards'}
(u'H5N1', u'http://blog.samaltman.com/h5n1')
.............
.............
.............
2013-12-12 16:58:41+0530 [hn] INFO: Closing spider (finished)
2013-12-12 16:58:41+0530 [hn] INFO: Dumping Scrapy stats:
        {'downloader/exception_count': 2,
         'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
         'downloader/request_bytes': 22401,
         'downloader/request_count': 71,
         'downloader/request_method_count/GET': 71,
         'downloader/response_bytes': 1482842,
         'downloader/response_count': 69,
         'downloader/response_status_count/200': 61,
         'downloader/response_status_count/301': 4,
         'downloader/response_status_count/302': 3,
         'downloader/response_status_count/404': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2013, 12, 12, 11, 28, 41, 289000),
         'item_scraped_count': 63,
         'log_count/DEBUG': 141,
         'log_count/INFO': 4,
         'request_depth_max': 2,
         'response_received_count': 62,
         'scheduler/dequeued': 71,
         'scheduler/dequeued/memory': 71,
         'scheduler/enqueued': 71,
         'scheduler/enqueued/memory': 71,
         'start_time': datetime.datetime(2013, 12, 12, 11, 27, 6, 843000)}
2013-12-12 16:58:41+0530 [hn] INFO: Spider closed (finished)<pre><code>你将会在终端上看到大约 400 行的大量输出 ( 上面的输出之因此这么短，目的是为了方便观看 ).
你能够经过下面这个小命令将输出包装成JSON格式
</code></pre>$ scrapy crawl hn -o items.json -t json<pre><code>如今咱们已经基于正在找寻的项目实现了咱们抓取器。

###保存抓取到的数据

咱们开始的步骤是建立一个保存咱们抓取到的数据的数据库。打开 `settings.py` 而且像下面展示的代码同样定义数据库配置。
</code></pre>BOT_NAME = 'hn'

SPIDER_MODULES = ['hn.spiders']
NEWSPIDER_MODULE = 'hn.spiders'

DATABASE = {'drivername': 'xxx',
            'username': 'yyy',
            'password': 'zzz',
            'database': 'vvv'}<pre><code>再在 `hn` 目录下建立一个 `mdels.py` 文件。咱们将要使用 SQLAlchemy 做为 ORM 框架创建数据库模型。
首先，咱们须要定义一个直接链接到数据库的方法。为此，咱们须要引入 SQLAlchemy 以及 `settings.py` 文件。
</code></pre>from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.engine.url import URL

import settings

DeclarativeBase = declarative_base()

def db_connect():
    return create_engine(URL(**settings.DATABASE))

def create_hn_table(engine):
    DeclarativeBase.metadata.create_all(engine)

class Hn(DeclarativeBase):
    __tablename__ = "hn"

    id = Column(Integer, primary_key=True)
    title = Column('title', String(200))
    link = Column('link', String(200))<pre><code>在开始下一步以前，我还想说明一下在 `URL()` 方法里两个星号的用法: `**settings.DATABASE`。首先，咱们经过 `settings.py` 里的变量来访问数据库。这个 `**` 实际上会取出全部在 `DATABASE` 路径下的值。`URL` 方法，一个在 `SQLAlchemy` 里定义的构造器，将会把 key 和 value 映射成一个 SQLAlchemy 能明白的URL来链接咱们的数据库。
接着，`URL()` 方法将会解析其余元素，而后建立一个下面这样的将被 `create_engine()` 方法读取的 URL 。
</code></pre>'postgresql://xxx:yyy@zzz/vvv'<pre><code>接下来，咱们要为咱们的ORM建立一个表。咱们须要 从 SQLAlchemy 引入 `declarative_base()` 以便把咱们为表结构定义的类映射到Postgres上，以及一个从表的元数据里建立咱们所须要的表的方法，还有咱们已经定义好的用于存储数据的表和列。


###管道管理

咱们已经创建了用来抓取和解析HTML的抓取器, 而且已经设置了保存这些数据的数据库 . 如今咱们须要经过一个管道来将二者链接起来.
打开 `pipelines.py` 并引入 SQLAlchemy 的 sessionmaker 功能，用来绑定数据库 (建立那个链接), 固然也要引入咱们的模型.
</code></pre>from sqlalchemy.orm import sessionmaker
from models import Hn, db_connect, create_hn_table

class HnPipeline(object):
    def __init__(self):
        engine = db_connect()
        create_hn_table(engine)
        self.Session = sessionmaker(bind=engine)

    def process_item(self, item, spider):
        session = self.Session()
        hn = Hn(**item)
        session.add(hn)
        session.commit()
        return item<pre><code>咱们在这里建立了一个类, `HnPipeline()`. 咱们有一个构造器函数 `def __init__(self)` 来经过定义引擎初始化这个类, hn表格，还使用定义的这个引擎绑定/链接到数据库.
而后咱们定义 `_process_item()` 来获取参数, `_item_` 和 `_spider_`. 咱们创建了一个同数据库的会话, 而后打开一个咱们的`Hn()` 模型中的数据项. 而后咱们经过电泳 `session.add()`来将 Hn 添加到咱们的数据库中  – 在这一步, 它尚未被保存到数据库中 – 它仍然处于 SQLAlchemy 级别. 而后, 经过调用 `session.commit()`, 它就将被放入数据库中，过程也将会被提交 .

咱们这里几乎尚未向 `settings.py` 中添加一个变量来告诉抓取器在处理数据时到哪里去找到咱们的管道.
那就在 `settings.py` 加入另一个变量, `ITEM_PIPELINES:`
</code></pre>ITEM_PIPELINES = {
    'hn.pipelines.HnPipeline':300
}

这就是咱们刚才所定义管道的目录/模块的路径.
如今咱们就能够将全部抓取到的数据放到咱们的数据库中, 让咱们试试看咱们获取到了什么,
再一次运行 crawl命令，并一直等到全部的处理过程完毕为止.
万岁！咱们如今已经成功地把咱们所抓取到的数据存入了数据库.

定时任务

若是咱们不得不按期手动去执行这个脚本，那将会是很烦人的. 全部这里须要加入定时任务 .
定时任务将会在你指定的任什么时候间自动运行. 可是! 它只会在你的计算机处在运行状态时 (并非在休眠或者关机的时候)，而且特定于这段脚本须要是在和互联网处于联通状态时，才能运行. 为了避免管你的计算机是出在何种状态都能运行这个定时任务, 你应该将 hn 代码和bash 脚本，还有 cron 任务放在分开的将一直处在”运行“状态的服务器上伺服.

总结

这是有关抓取的最简短小巧的教程，而 scrapy 拥有提供高级功能和可用性的更多特性.
从 Github 下载整个源代码.

原文：Build a Website Crawler based upon Scrapy
转自：开源中国社区 - LeoXu, BoydWang, Garfielt