Scrapy做为爬虫的进阶内容,能够实现多线程爬取目标内容,简化代码逻辑,提升开发效率,深受爬虫开发者的喜好,本文主要以爬取某股票网站为例,简述如何经过Scrpay实现爬虫,仅供学习分享使用,若有不足之处,还请指正。html
Scrapy是用python实现的一个为了爬取网站数据,提取结构性数据而编写的应用框架。使用Twisted高效异步网络框架来处理网络通讯。Scrapy架构:python
关于Scrpay架构各项说明,以下所示:web
Scrapy数据流:ajax
在命令行模式下,经过pip install scrapy命令进行安装Scrapy,以下所示:cookie
当出现如下提示信息时,表示安装成功网络
在命令行模式下,切换到项目存放目录,经过scrapy startproject stockstar 建立爬虫项目,以下所示:多线程
根据提示,经过提供的模板,建立爬虫【命令格式:scrapy genspider 爬虫名称 域名】,以下所示:架构
注意:爬虫名称,不能跟项目名称一致,不然会报错,以下所示:app
经过Pycharm打开新建立的scrapy项目,以下所示:框架
本例主要爬取某证券网站行情中心股票ID与名称信息,以下所示:
经过命令行建立项目后,基本Scrapy爬虫框架已经造成,剩下的就是业务代码填充。
定义须要爬取的字段信息,以下所示:
1 class StockstarItem(scrapy.Item): 2 """ 3 定义须要爬取的字段名称 4 """ 5 # define the fields for your item here like: 6 # name = scrapy.Field() 7 stock_type = scrapy.Field() # 股票类型 8 stock_id = scrapy.Field() # 股票ID 9 stock_name = scrapy.Field() # 股票名称
Scrapy的爬虫结构是固定的,定义一个类,继承自scrapy.Spider,类中定义属性【爬虫名称,域名,起始url】,重写父类方法【parse】,根据须要爬取的页面逻辑不一样,在parse中定制不一样的爬虫代码,以下所示:
1 class StockSpider(scrapy.Spider): 2 name = 'stock' 3 allowed_domains = ['quote.stockstar.com'] # 域名 4 start_urls = ['http://quote.stockstar.com/stock/stock_index.htm'] # 启动的url 5 6 def parse(self, response): 7 """ 8 解析函数 9 :param response: 10 :return: 11 """ 12 item = StockstarItem() 13 styles = ['沪A', '沪B', '深A', '深B'] 14 index = 0 15 for style in styles: 16 print('********************本次抓取' + style[index] + '股票********************') 17 ids = response.xpath( 18 '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div[' 19 '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/span/a/text()').getall() 20 names = response.xpath( 21 '//div[@class="w"]/div[@class="main clearfix"]/div[@class="seo_area"]/div[' 22 '@class="seo_keywordsCon"]/ul[@id="index_data_' + str(index) + '"]/li/a/text()').getall() 23 # print('ids = '+str(ids)) 24 # print('names = ' + str(names)) 25 for i in range(len(ids)): 26 item['stock_type'] = style 27 item['stock_id'] = str(ids[i]) 28 item['stock_name'] = str(names[i]) 29 yield item
在Pipeline中,对抓取的数据进行处理,本例为简便,在控制进行输出,以下所示:
1 class StockstarPipeline: 2 def process_item(self, item, spider): 3 print('股票类型>>>>'+item['stock_type']+'股票代码>>>>'+item['stock_id']+'股票名称>>>>'+item['stock_name']) 4 return item
注意:在对item进行赋值时,只能经过item['key']=value的方式进行赋值,不能够经过item.key=value的方式赋值。
经过settings.py文件进行配置,包括请求头,管道,robots协议等内容,以下所示:
1 # Scrapy settings for stockstar project 2 # 3 # For simplicity, this file contains only settings considered important or 4 # commonly used. You can find more settings consulting the documentation: 5 # 6 # https://docs.scrapy.org/en/latest/topics/settings.html 7 # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 8 # https://docs.scrapy.org/en/latest/topics/spider-middleware.html 9 10 BOT_NAME = 'stockstar' 11 12 SPIDER_MODULES = ['stockstar.spiders'] 13 NEWSPIDER_MODULE = 'stockstar.spiders' 14 15 16 # Crawl responsibly by identifying yourself (and your website) on the user-agent 17 #USER_AGENT = 'stockstar (+http://www.yourdomain.com)' 18 19 # Obey robots.txt rules 是否遵照robots协议 20 ROBOTSTXT_OBEY = False 21 22 # Configure maximum concurrent requests performed by Scrapy (default: 16) 23 #CONCURRENT_REQUESTS = 32 24 25 # Configure a delay for requests for the same website (default: 0) 26 # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay 27 # See also autothrottle settings and docs 28 #DOWNLOAD_DELAY = 3 29 # The download delay setting will honor only one of: 30 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 31 #CONCURRENT_REQUESTS_PER_IP = 16 32 33 # Disable cookies (enabled by default) 34 #COOKIES_ENABLED = False 35 36 # Disable Telnet Console (enabled by default) 37 #TELNETCONSOLE_ENABLED = False 38 39 # Override the default request headers: 40 DEFAULT_REQUEST_HEADERS = { 41 # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 42 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Mobile Safari/537.36' #, 43 # 'Accept-Language': 'en,zh-CN,zh;q=0.9' 44 } 45 46 # Enable or disable spider middlewares 47 # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html 48 #SPIDER_MIDDLEWARES = { 49 # 'stockstar.middlewares.StockstarSpiderMiddleware': 543, 50 #} 51 52 # Enable or disable downloader middlewares 53 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html 54 #DOWNLOADER_MIDDLEWARES = { 55 # 'stockstar.middlewares.StockstarDownloaderMiddleware': 543, 56 #} 57 58 # Enable or disable extensions 59 # See https://docs.scrapy.org/en/latest/topics/extensions.html 60 #EXTENSIONS = { 61 # 'scrapy.extensions.telnet.TelnetConsole': None, 62 #} 63 64 # Configure item pipelines 65 # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html 66 ITEM_PIPELINES = { 67 'stockstar.pipelines.StockstarPipeline': 300, 68 } 69 70 # Enable and configure the AutoThrottle extension (disabled by default) 71 # See https://docs.scrapy.org/en/latest/topics/autothrottle.html 72 #AUTOTHROTTLE_ENABLED = True 73 # The initial download delay 74 #AUTOTHROTTLE_START_DELAY = 5 75 # The maximum download delay to be set in case of high latencies 76 #AUTOTHROTTLE_MAX_DELAY = 60 77 # The average number of requests Scrapy should be sending in parallel to 78 # each remote server 79 #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 80 # Enable showing throttling stats for every response received: 81 #AUTOTHROTTLE_DEBUG = False 82 83 # Enable and configure HTTP caching (disabled by default) 84 # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings 85 #HTTPCACHE_ENABLED = True 86 #HTTPCACHE_EXPIRATION_SECS = 0 87 #HTTPCACHE_DIR = 'httpcache' 88 #HTTPCACHE_IGNORE_HTTP_CODES = [] 89 #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
因scrapy是各个独立的页面,只能经过终端命令行的方式运行,格式为:scrapy crawl 爬虫名称,以下所示:
1 scrapy crawl stock
以下图所示:
本例内容相对简单,仅为说明Scrapy的常见用法,爬取的内容都是第一次请求可以获取到源码的内容,即所见即所得。
遗留两个小问题:
以上两个问题,待后续遇到时,再进一步分析。一首陶渊明的归田园居,与君共享。
归园田居(其一)
少无适俗韵,性本爱丘山。误落尘网中,一去三十年。
羁鸟恋旧林,池鱼思故渊。开荒南野际,守拙归园田。
方宅十余亩,草屋八九间。榆柳荫后檐,桃李罗堂前。
暧暧远人村,依依墟里烟。狗吠深巷中,鸡鸣桑树颠。
户庭无尘杂,虚室有余闲。久在樊笼里,复得返天然。