Scrapy提供了一个 item pipeline ,来下载属于某个特定项目的图片,好比,当你抓取产品时,也想把它们的图片下载到本地。windows
这条管道,被称做图片管道,在 ImagesPipeline 类中实现,提供了一个方便并具备额外特性的方法,来下载并本地存储图片:scrapy
-将全部下载的图片转换成通用的格式(JPG)和模式(RGB)ide
Pillow 是用来生成缩略图,并将图片归一化为JPEG/RGB格式,所以为了使用图片管道,你须要安装这个库。 Python Imaging Library (PIL) 在大多数状况下是有效的,但众所周知,在一些设置里会出现问题,所以咱们推荐使用 Pillow 而不是 PIL.。url
在windows下,利用pip安装PIL找不到能够安装的版本,因此选用Pillow,顺利运行code
下面是抓取百度贴吧的一个小demo队列
spider文件夹下的spider baidu.py图片
import scrapy import requests import os from BaiduTieba.items import BaidutiebaItem class BaiduTieBaSpider(scrapy.spiders.Spider): name = 'baidutieba' start_urls = ['http://tieba.baidu.com/p/2235516502?see_lz=1&pn=%d' % i for i in range(1, 38)] image_names = {} def parse(self, response): item = BaidutiebaItem() item['image_urls'] = response.xpath("//img[@class='BDE_Image']/@src").extract() for index, value in enumerate(item['image_urls']): number = self.start_urls.index(response.url) * len(item['image_urls']) + index self.image_names[value] = 'full/%04d.jpg' % number yield item
注意在引用Item类时的路径ip
items.pyget
import scrapy class BaidutiebaItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field() image_paths = scrapy.Field()
ImagePipeline.pyrequests
import scrapy from scrapy.contrib.pipeline.images import ImagesPipeline from scrapy.exceptions import DropItem from BaiduTieba.spiders.baidu import BaiduTieBaSpider class MyImagesPipeline(ImagesPipeline): def file_path(self, request, response=None, info=None): image_name = BaiduTieBaSpider.image_names[request.url] return image_name def get_media_requests(self, item, info): for image_url in item['image_urls']: yield scrapy.Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item
setting.py
BOT_NAME = 'BaiduTieba' SPIDER_MODULES = ['BaiduTieba.spiders'] NEWSPIDER_MODULE = 'BaiduTieba.spiders' ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'BaiduTieba.ImagePipeline.MyImagesPipeline': 300, } IMAGES_STORE = '/baidutieba.01'
IMAGES_STORE是下载图片的保存路径。