scrapy
scrapy startproject教程复制代码
scrapy.cfg#Scrapy部署时的配置文件
教程#项目的模块,须要从这里引入
__init__.py
items.py#Items的定义,定义爬取的数据结构
middlewares.py#Middlewares的定义,定义爬取时的中间件
pipelines.py#管道的定义,定义数据管道
settings.py#配置文件
蜘蛛#放置蜘蛛的文件夹
__init__.py复制代码
scrapy.Spider
光盘教程
scrapy genspider报价复制代码
genspider
import scrapy
class QuotesSpider (scrapy.Spider):
name = “quotes”
allowed_domains = [ “quotes.toscrape.com” ]
start_urls = [ 'http://quotes.toscrape.com/' ]
def parse (self,response):
经过复制代码
name
allowed_domains
start_urls
parse
name
css
allowed_domains
html
start_urls
git
parse
github
start_urls
scrapy.Item
scrapy.Field
text
author
tags
import scrapy
class QuoteItem (scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()复制代码
parse()
resposne
start_urls
parse
response
class
quote
text
author
tags
quote
quote
parse()
高清 解析(个体经营,响应):
报价= response.css( '.quote' )
的报价在报价:
文本= quote.css( ':: .text区段文本').extract_first()
做者= quote.css( ”。 author'text ').extract_first()
tags = quote.css( '.tags .tag :: text').extract()复制代码
quotes
for
quote
quote
text
class
text
.text
::text
extract_first()
tags
extract()
quote
< div class = “quote” itemscope = “” itemtype = “http://schema.org/CreativeWork” >
< span class = “text” itemprop = “text” > “咱们创造它的世界是一个过程,咱们的想法。“ </ span >
< span > by < small class = ”author“ itemprop = ”author“ >阿尔伯特爱因斯坦<
一个 HREF = “/做者/阿尔伯特-爱因斯坦” >(约) </ 一 >
</ 跨度 >
< DIV 类 = “标签” >
标签:
< 元 类 = “关键词” itemprop = “关键词” 内容 = “变化,深的思想,思惟,世界” >
< 一 类 = “标签” HREF = “/标签/变动/页/ 1 /” >变动 </ 一 >
<一个 class =“标记” HREF = “/标记/深想法/页面/ 1 /” >深思想 </ 一 >
< 一 类 = “标签” HREF = “/标签/思惟/页面/ 1 /” >思 </ 一 >
< 一 类 = “标签” HREF = “/标签/世界/页面/ 1 /” >世界 </ 一 >
</ DIV >
</ DIV >复制代码
[<Selector xpath =“descendant-or-self :: * [@ class and contains(concat('',normalize-space(@class),''),'text')]”data ='<span class = “text”itemprop =“text”>“The'>]复制代码
[<Selector xpath =“descendant-or-self :: * [@ class and contains(concat('',normalize-space(@class),''),'text')] / text()”data =' “咱们创造它的世界是一个公关>]复制代码
['span class =“text”itemprop =“text”>“咱们创造它的世界是咱们思考的过程。不改变咱们的想法就没法改变。“</ span>']复制代码
[''咱们创造的世界是咱们思考的过程。不改变咱们的想法就没法改变。“']复制代码
“咱们创造它的世界是咱们思考的过程。若是不改变咱们的想法,它就不能改变。“复制代码
text
extract_first()
tags
extract()
QuotesSpider
数据库
进口 scrapy
从 tutorial.items 导入 QuoteItem
类 QuotesSpider (scrapy.Spider) :
名称= “引号”
allowed_domains = [ “quotes.toscrape.com” ]
start_urls = [ 'http://quotes.toscrape.com/' ]
DEF 解析(个体,响应):
报价= response.css('.quote' )
用于引用在引号:
项= QuoteItem()
项[ '文本' ] = quote.css(':: .text区段文本')。extract_first()
项目['author' ] = quote.css('.author :: text').extract_first()
item [ 'tags' ] = quote.css('.tags .tag :: text').extract()
yield item复制代码
QuoteItem
url
callback
url
json
callback
bash
parse()
parse()
text
author
tags
parse()
parse()
next = response.css('.pager .next a :: attr(href)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url = url,callback = self.parse)复制代码
href
::attr(href)
extract_first()
urljoin()
urljoin()
urljoin()
url
callback
callback
parse()
parse
Spider
进口 scrapy
从 tutorial.items 导入 QuoteItem
类 QuotesSpider (scrapy.Spider) :
名称= “引号”
allowed_domains = [ “quotes.toscrape.com” ]
start_urls = [ 'http://quotes.toscrape.com/' ]
DEF 解析(个体,响应):
报价= response.css('.quote' )
用于引用在引号:
项= QuoteItem()
项[ '文本' ] = quote.css(':: .text区段文本')。extract_first()
项目['author' ] = quote.css('.author :: text').extract_first()
item [ 'tags' ] = quote.css('.tags .tag :: text').extract()
yield item
next = response.css('.pager .next a :: attr(“href”)').extract_first()
url = response.urljoin(next)
yield scrapy.Request(url = url,callback = self.parse)复制代码
scrapy抓取报价复制代码
2017-02-19 13:37:20 [scrapy.utils.log]信息:Scrapy 1.3.0开始(bot:教程)
2017-02-19 13:37:20 [scrapy.utils.log]信息:重写设置:{'NEWSPIDER_MODULE':'tutorial.spiders','SPIDER_MODULES':['tutorial.spiders'],'ROBOTSTXT_OBEY':True ,'BOT_NAME':'教程'}
2017-02-19 13:37:20 [scrapy.middleware]信息:启用扩展:
[ 'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-02-19 13:37:20 [scrapy.middleware]信息:启用下载中间件:
[ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-02-19 13:37:20 [scrapy.middleware]信息:启用蜘蛛中间件:
[ 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-02-19 13:37:20 [scrapy.middleware]信息:启用项目管道:
[]
2017-02-19 13:37:20 [scrapy.core.engine]信息:蜘蛛打开
2017-02-19 13:37:20 [scrapy.extensions.logstats]信息:爬行0页(0页/分钟),刮0项(0项/分钟)
2017-02-19 13:37:20 [scrapy.extensions.telnet] DEBUG:Telnet控制台监听127.0.0.1:6023
2017-02-19 13:37:21 [scrapy.core.engine] DEBUG:Crawled(404)<GET http://quotes.toscrape.com/robots.txt>(referer:无)
2017-02-19 13:37:21 [scrapy.core.engine] DEBUG:Crawled(200)<GET http://quotes.toscrape.com/>(referer:无)
2017-02-19 13:37:21 [scrapy.core.scraper]调试:从<200 http://quotes.toscrape.com/刮掉>
{'做者':爱因斯坦爱因斯坦', '标签':[u'change',u'deep-thoughts',u'thinking',u'world'], '文本':u'\ u201c咱们创造它的世界是咱们思考的过程。不改变咱们的想法就不能改变。\ u201d'} 2017-02-19 13:37:21 [scrapy.core.scraper]调试:从<200 http://quotes.toscrape.com/刮掉> {'做者':u'JK罗琳', 'tags':[u'abilities',u'choices'], '文本':你是咱们的选择,哈利,这代表咱们真正的存在,远远超过咱们的能力。\ u201d'}
...
2017-02-19 13:37:27 [scrapy.core.engine]信息:关闭蜘蛛(完成)
2017-02-19 13:37:27 [scrapy.statscollectors]信息:倾销Scrapy统计信息:
{'downloader / request_bytes':2859,
'downloader / request_count':11,
'downloader / request_method_count / GET':11,
'downloader / response_bytes':24871,
'downloader / response_count':11,
'downloader / response_status_count / 200':10,
'downloader / response_status_count / 404':1,
'dupefilter / filtered':1,
'finish_reason':'完成',
'finish_time':datetime.datetime(2017,2,19,5,37,27,227438),
'item_scraped_count':100,
'log_count / DEBUG':113,
'log_count / INFO':7,
'request_depth_max':10,
'response_received_count':11,
'调度程序/出队':10,
'scheduler / dequeued / memory':10,
'调度程序/入队':10,
'scheduler / enqueued / memory':10,
'start_time':datetime.datetime(2017,2,19,5,37,20,321557)}
2017-02-19 13:37:27 [scrapy.core.engine]信息:蜘蛛关闭(完成)复制代码
scrapy抓取引号-o quotes.json复制代码
scrapy抓取引号-o quotes.jl复制代码
scrapy抓取引号-o引用.jsonlines复制代码
scrapy抓取引号-o quotes.csv
scrapy抓取引号-o quotes.xml
scrapy抓取引号-o quotes.pickle
scrapy抓取引号-o引用.marshal
scrapy抓取引号-o ftp:// user:pass@ftp.example.com/path/to/quotes.csv复制代码
process_item()
process_item()
process_item()
微信
item
spider
text
TextPipeline
from scrapy.exceptions import DropItem
class TextPipeline (object):
def __init__ (self):
self.limit = 50
def process_item (self,item,spider):
if item [ 'text' ]:
if len(item [ 'text' ]] )> self.limit:
item [ 'text' ] = item [ 'text' ] [ 0:self.limit] .rstrip()+ '...'
return item
else:
return DropItem('Missing Text'))复制代码
process_item()
item
spide
item
text
DropItem
item
item
MongoPipeline
导入 pymongo
类 MongoPipeline (object):
def __init__ (self,mongo_uri,mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler (cls,crawler):
return cls(
mongo_uri = crawler.settings.get(' MONGO_URI'),
mongo_db = crawler.settings.get('MONGO_DB')
)
def open_spider (self,spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client [self.mongo_db]
def process_item (self,item,spider):
name = item .__ class __.__ name__
self.db [name] .insert(dict(item))
return item
def close_spider (self,蜘蛛):
self.client.close()复制代码
MongoPipeline
cookie
from_crawler
网络
@classmethod
crawler
crawler
MONGO_URI
MONGO_DB
open_spider
close_spider
process_item()
TextPipeline
MongoPipeline
ITEM_PIPELINES = {
'tutorial.pipelines.TextPipeline':300,
'tutorial.pipelines.MongoPipeline':400,
}
MONGO_URI = 'localhost'MONGO_DB
= 'tutorial'复制代码
ITEM_PIPELINES
scrapy抓取报价复制代码
text
text
author
tags
本资源首发于崔庆才的我的博客静觅: Python3网络爬虫开发实战教程 | 静觅
如想了解更多爬虫资讯,请关注个人我的微信公众号:进击的Coder
weixin.qq.com/r/5zsjOyvEZ… (二维码自动识别)