python爬虫框架scrapy学习图片下载

文档地址:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.htmlhtml

实践例子: 目的:抓取http://www.hlhua.com/页面里面商品的图片git

  1. 根据文档所说,先建立item用来保存图片数据,为了可以使ImagesPipeLine生效,这个item须要有名为image_urls的field属性: items.py
import scrapy

    class MyItem(scrapy.Item):
        image_urls = scrapy.Field()
        image_paths = scrapy.Field()
        images = scrapy.Field()
  1. 继承ImagesPipeLine编写本身的ImagesPipeLine pipeline.py
import scrapy
    from scrapy.exceptions import DropItem
    from scrapy.pipelines.images import ImagesPipeline

    class MyImageDownloadPipeLine(ImagesPipeline):

        def get_media_requests(self, item, info):
            for image_url in item['image_urls']:
                yield scrapy.Request(image_url)

        def item_completed(self, results, item, info):
            image_paths = [x['path'] for ok, x in results if ok]
            if not image_paths:
                raise DropItem("Item contains no images")
            item['image_paths'] = image_paths
            return item

这里重写的item_completed用来在下载完成后保存image_path属性 3. 编辑settings.py使能MyImageDownloadPipeLine settings.pygithub

# coding=utf-8
    BOT_NAME = 'imagedemo'

    SPIDER_MODULES = ['imagedemo.spiders']
    NEWSPIDER_MODULE = 'imagedemo.spiders'

    # 使能ImagePipeLine
    ITEM_PIPELINES = {'imagedemo.pipelines.MyImageDownloadPipeLine': 1}
    # 指定图片文件保存的未知
    IMAGES_STORE = 'image'

    ROBOTSTXT_OBEY = True
  1. 编写spider实现爬虫逻辑 spider.py
# coding=utf-8
    from scrapy.spiders import Spider
    from imagedemo.items import MyItem

    class ImageSpider(Spider):
        name = 'hlhua'
        start_urls = ['http://www.hlhua.com/']

        def parse(self, response):
            # inspect_response(response, self)
            images = []
            for each in response.xpath("//img[@class='goodsimg']/@src").extract():
                m = MyItem()
                m['image_urls'] = [each,]
                images.append(m)
            return images
  1. 执行scrapy crawl hlhua -o images.json,便可在image/full/下载图片,并生成images.json记录图片信息。

github: https://github.com/chenglp1215/scrapy_demo/tree/master/imagedemojson

相关文章
相关标签/搜索