使用scrapy爬取整站图片

scrapy是简单易用的爬虫框架,python语言实现。具体去官网看看吧:http://scrapy.org/python

以前想抓一些图片制做拼贴马赛克(见 拼贴马赛克算法),没有找到顺手的爬虫软件,就本身diy一个。使用scrapy抓取很是简单,由于它竟然内置了图片抓取的管道 ImagesPipeline。简单到几行代码就能够搞定一个图片爬虫。git

scrapy的使用更ruby有点儿相似,建立一个project,而后框架就有了,只要在相应的文件中填写本身的内容就ok了。github

spider文件中添加爬取代码:算法

1
< / p> <p> class ImageDownloaderSpider(CrawlSpider):<br>name = "image_downloader" <br>allowed_domains = [ "sina.com.cn" ]<br>start_urls = [<br> "http://www.sina.com.cn/" <br>]<br>rules = [Rule(SgmlLinkExtractor(allow = []), 'parse_item' )]< / p> <p> def parse_item( self , response):<br> self .log( 'page: %s' % response.url)<br>hxs = HtmlXPathSelector(response)<br>images = hxs.select( '//img/@src' ).extract()<br>items = []<br> for image in images:<br>item = ImageDownloaderItem()<br>item[ 'image_urls' ] = [image]<br>items.append(item)<br> return items< / p> <p>

item中添加字段:ruby

1
< / p> <p> class ImageDownloaderItem(Item):<br>image_urls = Field()<br>images = Field()< / p> <p>

pipelines中过滤并保存图片:app

1
< / p> <p> class ImageDownloaderPipeline(ImagesPipeline):< / p> <p> def get_media_requests( self , item, info):<br> for image_url in item[ 'image_urls' ]:<br> yield Request(image_url)< / p> <p> def item_completed( self , results, item, info):<br>image_paths = [x[ 'path' ] for ok, x in results if ok]<br> if not image_paths:<br> raise DropItem( "Item contains no images" )<br> return item< / p> <p>

settings文件中添加project和图片过滤设置:框架

1
< / p> <p>IMAGES_MIN_HEIGHT = 50 <br>IMAGES_MIN_WIDTH = 50 <br>IMAGES_STORE = 'image-downloaded/' <br>DOWNLOAD_TIMEOUT = 1200 <br>ITEM_PIPELINES = [ '<a href="http://lzhj.me/archives/tag/scrapy" class="st_tag internal_tag" rel="tag" title="Posts tagged with scrapy">scrapy</a>.contrib.pipeline.images.ImagesPipeline' ,<br> 'image_downloader.pipelines.ImageDownloaderPipeline' ]< / p> <p>

代码下载:@githubdom

scrapy优美的数据流:scrapy

scrapy_architecture