愈来愈感受到scrapy的便利,下边继续记录Scrapyphp
scrapy是基于twisted框架http://twistedmatrix.com/trac/编写的,搞定PyBrain有机会就继续深刻一下Twisted框架。node
Twisted is an event-driven networking engine written in Python and licensed under the open source
1. 上一篇中缺乏了不少记述,如今补充上web
* `scrapy startproject xxx` 新建一个xxx的project * `scrapy crawl xxx` 开始爬取,必须在project中 * `scrapy shell url` 在scrapy的shell中打开url,很是实用 * `scrapy runspider <spider_file.py>` 能够在没有project的状况下运行爬虫
scrapy crawl xxx -a category=xxx 向spider传递参数(早知道这个,京东爬虫就不会写的那么乱套了,哎)def __init__(self, category=None): 在spider的init函数获取参数。shell
第一个Request对象是由make_requests_from_url函数生成的,callback=self.parse。服务器
除了BaseSpider之外,还有不少能够直接继承来用的Spider,好比class scrapy.contrib.spiders.CrawlSpider框架
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.
这个比BaseSpider多了一个rules对象,经过这个Rules咱们能够选择爬取哪些结构的URL。示例代码:dom
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = hxs.select('//td[@id="item_name"]/text()').extract() item['description'] = hxs.select('//td[@id="item_description"]/text()').extract() return item
XMLFeedSpider: from scrapy import log from scrapy.contrib.spiders import XMLFeedSpider from myproject.items import TestItemscrapy
class MySpider(XMLFeedSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/feed.xml'] iterator = 'iternodes' # This is actually unnecessary, since it's the default value itertag = 'item' def parse_node(self, response, node): log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract()))) item = Item() item['id'] = node.select('@id').extract() item['name'] = node.select('name').extract() item['description'] = node.select('description').extract() return item
还有CSVFeedSpider SitemapSpider 等等各类针对不一样需求的Spider,scrapy.contrib.spiderside
Scrapy 还提供了一个服务器版scrapyd。能够方便的上传管理爬虫任务。函数