经过scrapy爬取一号店商品信息

时间 2019-11-12

原文原文链接

本文为做者原创转载请注明出处（silvasong：http://my.oschina.net/sojie/admin/edit-blog?blog=653199）html

前面的文章对scrapy的源码进行简单的分析,这里我将经过一个简单的例子介绍怎样使用scrapy。
python

肯定须要爬取一个网站以后,最早须要作的工做就是分析网站层次结构,选择入口URL.通常状况下咱们都是选择网站的首页做为起始连接.linux

分析一号店的过程当中，我发现一号店提供了一个商品分类页面（http://www.yhd.com/marketing/allproduct.html）从这个页面中就能够获取到全部商品的分类.而后咱们经过每一个分类的连接又可以获得每一个分类下的商品.git

开发环境:github

ubuntu、python 2.七、scrapyweb

scrapy能够运行在window、mac、linux上面,为了开发方便这里我选择的ubuntu，另外scrapy是基于python开发的因此安装python也是必须的.最后就是安装scrapy。正则表达式

完成环境的搭建之后接下将一步步介绍具体的实现:ubuntu

1、第一步先经过scrapy startproject yhd 建立一个爬虫工程.restful

运行上面的命令后能够生成相似下面的文件结构. tutorial被替换成yhd。cookie

scrapy.cfg scrapy配置文件能够保持默认不修改.

items.py 用来定义存储的数据结构。

pipelines.py scrapy管道用来持久化数据

spiders/ spiders文件夹是你本身编写的spider

settings.py 配置文件

2、编写item.py，这里我定义了继承scrapy.Item的YhdItem，YhdItem中定义了须要爬取的字段.

import scrapy
class YhdItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title=scrapy.Field()  #商品名称
    price=scrapy.Field()  #商品价格
    link=scrapy.Field()   #商品连接
    category=scrapy.Field() #商品分类
    product_id=scrapy.Field()  #产品ID
    img_link=scrapy.Field()  #图片连接
    pass

三,编写pipelines.py。

使用mongo来持久化数据编写了一个MongoPipeline。

class MongoPipeline(object):

      collection_name='product'
       
      def __init__(self,mongo_uri,mongo_db):
          
          self.mongo_uri=mongo_uri
          self.mongo_db=mongo_db

      @classmethod
      def from_crawler(cls,crawler):
           return cls(mongo_uri=crawler.settings.get('MONGO_URI'),
                      mongo_db=crawler.settings.get('MONGO_DB')   
                     )

      def open_spider(self,spider):  #经过URL获取db
          self.client=pymongo.MongoClient(self.mongo_uri)
          self.db=self.client[self.mongo_db]

      def close_spider(self,spider):
          self.client.close()
      
      def process_item(self,item,spider):  #经过方法process_item将数据写入Mongo
          if isinstance(item,YhdItem):
             self.db[self.collection_name].insert(dict(item))  
          else:
             self.db['product_price'].insert(dict(item))
          return item

4、在spiders文件夹下面编写spider.py。

spider.py中是经过正则表达式匹配须要爬取的URL,经过XPATH从HTML中提取数据.

class YHDSpider(CrawlSpider):
      name='yhd'
      allowed_domains=['yhd.com']
      start_urls=[
          '   #定义种子URL 
      ]

      rules=[
        Rule(le(allow=('http://www.yhd.com/marketing/allproduct.html')),follow=True),
        Rule(le(allow=('^http://list.yhd.com/c.*//$')),follow=True),
        Rule(le(allow=('^http://list.yhd.com/c.*/b/a\d*-s1-v4-p\d+-price-d0-f0d-m1-rt0-pid-mid0-k/$')),follow=True),
        Rule(le(allow=('^http://item.yhd.com/item/\d+$')),callback='parse_product')
      ]   #经过正则表达匹配须要爬取的URL
      
      def parse_product(self,response):
          item=YhdItem()   #建立YhdItem对象
          #经过xpath解析html
          item['title']=response.xpath('//h1[@id="productMainName"]/text()').extract()
          price_str=response.xpath('//a[@class="ico_sina"]/@href').extract()[0]
          item['price']=price_str
          item['link']=response.url
          pmld = response.url.split('/')[-1]
          price_url='http://gps.yhd.com/restful/detail?mcsite=1&provinceId=12&pmId='+pmld
          item['category']=response.xpath('//div[@class="crumb clearfix"]/a[contains(@onclick,"detail_BreadcrumbNav_cat")]/text()').extract()
          item['product_id']=response.xpath('//p[@id="pro_code"]/text()').extract()
          item['img_link']=response.xpath('//img[@id="J_prodImg"]/@src').extract()[0]
          request = Request(price_url,callback=self.parse_price) #商品的价格须要异步获取,经过商品ID获取价格
          request.meta['item']=item
          yield request
      
      def parse_price(self,response):
          item = response.meta['item']
          item['price']=response.body
          return item
          

      def _process_request(self,request):
          return request

5、编写配置文件settings.py.

# -*- coding: utf-8 -*-

# Scrapy settings for yhd project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'yhd'

SPIDER_MODULES = ['yhd.spiders']  #定义spider模块
NEWSPIDER_MODULE = 'yhd.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'yhd (+http://www.yourdomain.com)'

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS=32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY=3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN=16
#CONCURRENT_REQUESTS_PER_IP=16

# Disable cookies (enabled by default)
#COOKIES_ENABLED=False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED=False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'yhd.middlewares.MyCustomSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'yhd.middlewares.MyCustomDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
      'yhd.pipelines.MongoPipeline': 300,
}   #配置pipeline
MONGO_URI='127.0.0.1'  #mongo配置
MONGO_DB='yhd'
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# NOTE: AutoThrottle will honour the standard settings for concurrency and delay
#AUTOTHROTTLE_ENABLED=True
# The initial download delay
#AUTOTHROTTLE_START_DELAY=5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY=60
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG=False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED=True
#HTTPCACHE_EXPIRATION_SECS=0
#HTTPCACHE_DIR='httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES=[]
#HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'

完成代码编写后能够经过scrapy crawl yhd 命令启动爬虫.

完整源代码能够经过个人github下载:https://github.com/silvasong/yhd_scrapy