接着前一篇经过基础爬虫对问答频道文章的采集,下面咱们试用一下Scrapy工具箱中几个不错的功能。html
因为大部分数据爬取工做具备类似性、一致性,因此Scrapy特别提供了若干个 更高程度封装的通用爬虫类 来协助咱们更快速、高效的完成爬虫开发工做python
#查看scrapy提供的通用爬虫(Generic Spiders) scrapy genspider -l
CrawlSpider 是通用爬虫里最经常使用的一个
经过一套规则引擎,它自动实现了页面连接的搜索跟进,解决了包含但不限于自动采集详情页、跟进分类/分页地址等问题 最后,咱们仅仅须要开发实现 ’详情页解析器‘ 逻辑便能完成爬虫开发工做bash
这里咱们以爬取马蜂窝北京行程资源( http://www.mafengwo.cn/xc/10065/ )为例:dom
#基于通用爬虫模板建立爬虫 scrapy genspider --template crawl xinchen www.mafengwo.cn/xc/10065
而后咱们设计如下的具体爬虫逻辑,编辑文件 mafengwo/mafengwo/spiders/xinchen.py
为了方便演示,本例中咱们把相关的爬虫主逻辑、持久化逻辑,数据建模逻辑等等都封装在该爬虫文件中scrapy
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule import pymongo class XinchenSpider(CrawlSpider): name = 'xinchen' allowed_domains = ['www.mafengwo.cn'] start_urls = ['http://www.mafengwo.cn/xc/10065/'] rules = ( # 提取下一页连接并跟进 Rule( LinkExtractor(allow=r'/xc/10065/(\d+).html', restrict_xpaths='//div[@class="page-hotel"]/a[@class="ti next"]'), callback=None, follow=True ), # 提取详情页连接,并使用parse_item解析器抓取数据 Rule( LinkExtractor(allow=r'/schedule/(\d+).html', restrict_xpaths='//div[@class="post-list"]/ul/li/dl/dt/a'), callback='parse_item', follow=False ), ) def __init__(self, *args, **kwargs): super(XinchenSpider, self).__init__(*args, **kwargs) #调用父类方法 #mongo配置 self.client = pymongo.MongoClient('localhost') self.collection = self.client['mafengwo']['xinchen_pages'] def closed(spider, reason): self.client.close() def parse_item(self, response): item = {} item['url'] = response.url item['author'] = response.xpath('//dl[@class="flt1 show_from clearfix"]/dd/p/a[@class="name"]').extract_first() item['title'] = response.xpath('//p[@class="dd_top"]/a/text()').extract_first() item['content'] = response.xpath('//div[@class="guide"]').extract_first() self.collection.update({'url': item['url']}, item, upsert=True) #排重式的往mongo中插入数据 yield item
运行爬虫ide
scrapy crawl --nolog xinchen