Scrapy 爬虫使用指南彻底教程

时间 2019-11-21

原文原文链接

scrapy note

command

全局命令:

startproject ：在 project_name 文件夹下建立一个名为 project_name 的Scrapy项目。

scrapy startproject myproject

settings：在项目中运行时，该命令将会输出项目的设定值，不然输出Scrapy默认设定。
runspider：在未建立项目的状况下，运行一个编写在Python文件中的spider。
shell：以给定的URL(若是给出)或者空(没有给出URL)启动Scrapy shell。
fetch：使用Scrapy下载器(downloader)下载给定的URL，并将获取到的内容送到标准输出。

scrapy fetch --nolog --headers http://www.example.com/

view：在浏览器中打开给定的URL，并以Scrapy spider获取到的形式展示。

scrapy view http://www.example.com/some/page.html

version：输出Scrapy版本。

项目(Project-only)命令:

crawl：使用spider进行爬取。
scrapy crawl myspider
check：运行contract检查。
scrapy check -l
list：列出当前项目中全部可用的spider。每行输出一个spider。
edit
parse：获取给定的URL并使用相应的spider分析处理。若是您提供 --callback 选项，则使用spider的该方法处理，不然使用 parse 。

--spider=SPIDER: 跳过自动检测spider并强制使用特定的spider
--a NAME=VALUE: 设置spider的参数(可能被重复)
--callback or -c: spider中用于解析返回(response)的回调函数
--pipelines: 在pipeline中处理item
--rules or -r: 使用 CrawlSpider 规则来发现用来解析返回(response)的回调函数
--noitems: 不显示爬取到的item
--nolinks: 不显示提取到的连接
--nocolour: 避免使用pygments对输出着色
--depth or -d: 指定跟进连接请求的层次数(默认: 1)
--verbose or -v: 显示每一个请求的详细信息
scrapy parse http://www.example.com/ -c parse_item

genspider：在当前项目中建立spider。

scrapy genspider [-t template] <name> <domain>
scrapy genspider -t basic example example.com

deploy：将项目部署到Scrapyd服务。
bench：运行benchmark测试。

使用选择器(selectors)

body = '<html><body><span>good</span></body></html>'
Selector(text=body).xpath('//span/text()').extract()

response = HtmlResponse(url='http://example.com', body=body)
Selector(response=response).xpath('//span/text()').extract()

Scrapy提供了两个实用的快捷方式: response.xpath() 及 response.css()css

>>> response.xpath('//base/@href').extract()
>>> response.css('base::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/@href').extract()
>>> response.css('a[href*=image]::attr(href)').extract()
>>> response.xpath('//a[contains(@href, "image")]/img/@src').extract()
>>> response.css('a[href*=image] img::attr(src)').extract()

嵌套选择器(selectors)

选择器方法( .xpath() or .css() )返回相同类型的选择器列表，所以你也能够对这些选择器调用选择器方法。下面是一个例子:html

links = response.xpath('//a[contains(@href, "image")]')
for index, link in enumerate(links):
        args = (index, link.xpath('@href').extract(), link.xpath('img/@src').extract())
        print 'Link number %d points to url %s and image %s' % args

结合正则表达式使用选择器(selectors)

Selector 也有一个 .re() 方法，用来经过正则表达式来提取数据。然而，不一样于使用 .xpath() 或者 .css() 方法, .re() 方法返回unicode字符串的列表。因此你没法构造嵌套式的 .re() 调用。node

>>> response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')

使用相对XPaths

>>> for p in divs.xpath('//p'):  # this is wrong - gets all <p> from the whole document
...     print p.extract()
>>> for p in divs.xpath('.//p'):  # extracts all <p> inside
...     print p.extract()
>>> for p in divs.xpath('p'): #gets all <p> from the whole document
...     print p.extract()

例如在XPath的 starts-with() 或 contains() 没法知足需求时， test() 函数能够很是有用。python

>>> sel.xpath('//li//@href').extract()
>>> sel.xpath('//li[re:test(@class, "item-\d$")]//@href').extract()

XPATH TIPS

Avoid using contains(.//text(), ‘search text’) in your XPath conditions. Use contains(., ‘search text’) instead.
Beware of the difference between //node[1] and (//node)[1]
When selecting by class, be as specific as necessary，When querying by class, consider using CSS
Learn to use all the different axes
Useful trick to get text content

Item Loaders

populate items

def parse(self, response):
    l = ItemLoader(item=Product(), response=response)
    l.add_xpath('name', '//div[@class="product_name"]')
    l.add_xpath('name', '//div[@class="product_title"]')
    l.add_xpath('price', '//p[@id="price"]')
    l.add_css('stock', 'p#stock]')
    l.add_value('last_updated', 'today') # you can also use literal values
    return l.load_item()

Item Pipeline

清理HTML数据
验证爬取的数据(检查item包含某些字段)
查重(并丢弃)
将爬取结果保存到数据库中

编写你本身的item pipeline

每一个item pipeline组件都须要调用该方法，这个方法必须返回一个 Item (或任何继承类)对象，或是抛出 DropItem 异常，被丢弃的item将不会被以后的pipeline组件所处理。
参数:react

item (Item 对象) – 被爬取的item
spider (Spider 对象) – 爬取该item的spider

Write items to MongoDB

import pymongo

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE', 'items')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def close_spider(self, spider):
        self.client.close()

    def process_item(self, item, spider):
        collection_name = item.__class__.__name__
        self.db[collection_name].insert(dict(item))
        return item

为了启用一个Item Pipeline组件，你必须将它的类添加到 ITEM_PIPELINES 配置，就像下面这个例子:正则表达式

ITEM_PIPELINES = {
    'myproject.pipelines.PricePipeline': 300,
    'myproject.pipelines.JsonWriterPipeline': 800,
}

分配给每一个类的整型值，肯定了他们运行的顺序，item按数字从低到高的顺序，经过pipeline，一般将这些数字定义在0-1000范围内。mongodb

实践经验

同一进程运行多个spider

from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.project import get_project_settings

runner = CrawlerRunner(get_project_settings())
dfs = set()
for domain in ['scrapinghub.com', 'insophia.com']:
    d = runner.crawl('followall', domain=domain)
    dfs.add(d)

defer.DeferredList(dfs).addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until all crawling jobs are finished

避免被禁止(ban)

使用user agent池，轮流选择之一来做为user agent。池中包含常见的浏览器的user agent(google一下一大堆)
禁止cookies(参考 COOKIES_ENABLED)，有些站点会使用cookies来发现爬虫的轨迹。
设置下载延迟(2或更高)。参考 DOWNLOAD_DELAY 设置。
若是可行，使用 Google cache 来爬取数据，而不是直接访问站点。
使用IP池。例如免费的 Tor项目或付费服务(ProxyMesh)。
使用高度分布式的下载器(downloader)来绕过禁止(ban)，您就只须要专一分析处理页面。这样的例子有: Crawlera
增长并发 CONCURRENT_REQUESTS = 100
禁止cookies:COOKIES_ENABLED = False
禁止重试:RETRY_ENABLED = False
减少下载超时:DOWNLOAD_TIMEOUT = 15
禁止重定向:REDIRECT_ENABLED = False
启用 “Ajax Crawlable Pages” 爬取:AJAXCRAWL_ENABLED = True

对爬取有帮助的实用Firefox插件

Firebug
XPather
XPath Checker
Tamper Data
Firecookie
自动限速：AUTOTHROTTLE_ENABLED=True

other

Scrapyd
Spider中间件
 下载器中间件(Downloader Middleware)
内置设定参考手册
 Requests and Responses
Scrapy入门教程shell

Scrapy 爬虫 使用指南 彻底教程