官方文档: http://doc.scrapy.org/en/latest/
php
github例子: https://github.com/search?utf8=%E2%9C%93&q=scrapy
css
剩下的待会再整理...... 买饭去...... --2014年08月20日19:29:20python
の...刚搜狗输入法出问题,直接注销从新登录,结果刚才的那些内容所有没了。看来草稿箱也不是太靠谱呀!!!git
再从新整理下吧
github
-- 2014年08月21日04:02:37
web
(一)基本的 -- scrapy.spider.Spidershell
(1)使用交互shell数据库
dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/" 2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) 2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django 2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0} 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines: 2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024 2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081 2014-08-21 04:09:11+0800 [default] INFO: Spider opened 2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0xa483cec> [s] item {} [s] request <GET http://www.baidu.com/> [s] response <200 http://www.baidu.com/> [s] settings <scrapy.settings.Settings object at 0xa0de78c> [s] spider <Spider 'default' at 0xa78086c> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >>> # response.body 返回的全部内容 # response.xpath('//ul/li') 能够测试全部的xpath内容
More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()django
也就是能够很方便的,以交互的形式来查看xpath选择是否正确。以前是用FireFox的F12来选择的,可是并不能保证每次都能正确的选择出内容。json
也可以使用:
scrapy shell ’http://scrapy.org’ --nolog # 参数 --nolog 没有日志
(2)示例
from scrapy import Spider from scrapy_test.items import DmozItem class DmozSpider(Spider): name = 'dmoz' allowed_domains = ['dmoz.org'] start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,' ''] def parse(self, response): for sel in response.xpath('//ul/li'): item = DmozItem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item
(3)保存文件
可使用,保存文件。格式能够 json,xml,csv
scrapy crawl -o 'a.json' -t 'json'
(4)使用模板建立spider
scrapy genspider baidu baidu.com # -*- coding: utf-8 -*- import scrapy class BaiduSpider(scrapy.Spider): name = "baidu" allowed_domains = ["baidu.com"] start_urls = ( 'http://www.baidu.com/', ) def parse(self, response): pass
这段先这样吧,记得以前5个的,如今只能想起4个来了. :-(
千万记得随手点下保存按钮。不然非常影响心情的(⊙o⊙)!
(二)高级 -- scrapy.contrib.spiders.CrawlSpider
(1)CrawlSpider
class scrapy.contrib.spiders.CrawlSpider This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider. Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute: rules Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute. This spider also exposes an overrideable method: parse_start_url(response) This method is called for the start_urls responses. It allows to parse the initial responses and must return either a Item object, a Request object, or an iterable containing any of them.
(2)例子
#coding=utf-8 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor import scrapy class TestSpider(CrawlSpider): name = 'test' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/'] rules = ( # 元组 Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), Rule(LinkExtractor(allow=('item\.php', )), callback='pars_item'), ) def parse_item(self, response): self.log('item page : %s' % response.url) item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re('ID:(\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() return item
(3)其余的。
其余的还有 XMLFeedSpider,这个有空再研究吧。
class scrapy.contrib.spiders.XMLFeedSpider class scrapy.contrib.spiders.CSVFeedSpider class scrapy.contrib.spiders.SitemapSpider
(三)选择器
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
能够灵活的使用 .css() 和 .xpath() 来快速的选取目标数据
!!!关于选择器,须要好好研究一下。xpath() 和 css() ,还要继续熟悉 正则.
当经过class来进行选择的时候,尽可能使用 css() 来选择,而后再用 xpath() 来选择元素的熟悉
(四)Item Pipeline
After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.
Typical use for item pipelines are: • cleansing HTML data # 清除HTML数据 • validating scraped data (checking that the items contain certain fields) # 验证数据 • checking for duplicates (and dropping them) # 检查重复 • storing the scraped item in a database # 存入数据库
(1)验证数据
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.5 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] *= self.vat_factor else: raise DropItem('Missing price in %s' % item)
(2)写Json文件
import json class JsonWriterPipeline(object): def __init__(self): self.file = open('json.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + '\n' self.file.write(line) return item
(3)检查重复
from scrapy.exceptions import DropItem class Duplicates(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem('Duplicate item found : %s' % item) else: self.ids_seen.add(item['id']) return item
至于将数据写入数据库,应该也很简单。在 process_item 函数中,将 item 存入进去便可了。
看了一夜,看到85页。 算是把基本的看的差很少了。
-- 2014年08月21日06:39:41
(五)