Scrapy——5html
(Downloader Middleware)下载中间件经常使用函数有哪些web
设置setting.py里的DOWNLOADER_MIDDLIEWARES,添加本身编写的下载中间件类cookie
详情能够参考https://scrapy-chs.readthedocs.io/zh_CN/1.0/topics/settings.html#concurrent-itemsdom
16
100
180
0
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
RANDOMIZE_DOWNLOAD_DELAY
设定。 默认状况下,Scrapy在两个请求间不等待一个固定的值, 而是使用0.5到1.5之间的一个随机值 * DOWNLOAD_DELAY
的结果做为等待间隔。CONCURRENT_REQUESTS_PER_IP
非0时,延迟针对的是每一个ip而不是网站。download_delay
属性为每一个spider设置该设定。'utf-8'
{}
True
对接selenium实战——PM2.5历史数据_空气质量指数历史数据_中国空气质量在线监测分析平...scrapy
此网站的数据都是经过加密传输的,咱们能够经过对接selenium跳过数据加密过程,直接获取到网页js渲染后的代码,达到数据的提取ide
惟一的缺点就是速度太慢函数
本次实战旨在selenium的对接,就不考虑保存数据的问题,因此不用配置items文件网站
# -*- coding: utf-8 -*- import scrapy class AqistudySpider(scrapy.Spider): name = 'aqistudy' # allowed_domains = ['aqistudy.cn'] start_urls = ['https://www.aqistudy.cn/historydata/'] def parse(self, response): print('开始获取主要城市地址...') city_list = response.xpath("//ul[@class='unstyled']/li/a/@href").extract() for city_url in city_list[1:3]: yield scrapy.Request(url=self.start_urls[0]+city_url, callback=self.parse_month) def parse_month(self, response): print('开始获取当前城市的月份地址...') month_urls= response.xpath('//ul[@class="unstyled1"]/li/a/@href').extract() for month_url in month_urls: yield scrapy.Request(url=self.start_urls[0]+month_url, callback=self.parse_day) def parse_day(self, response): print('开始获取空气数据...') print(response.xpath('//h2[@id="title"]/text()').extract_first()+'\n') item_list = response.xpath('//tr')[1:] for item in item_list: print('day: '+item.xpath('./td[1]/text()').extract_first() + '\t' + 'API: '+item.xpath('./td[2]/text()').extract_first() + '\t' + '质量: '+item.xpath('./td[3]/span/text()').extract_first() + '\t' + 'MP2.5: '+item.xpath('./td[4]/text()').extract_first() + '\t' + 'MP10: '+item.xpath('./td[5]/text()').extract_first() + '\t' + 'SO2: '+item.xpath('./td[6]/text()').extract_first() + '\t' + 'CO: '+item.xpath('./td[7]/text()').extract_first() + '\t' + 'NO2: '+item.xpath('./td[8]/text()').extract_first() + '\t' + 'O3_8h: '+item.xpath('./td[9]/text()').extract_first() )
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # https://doc.scrapy.org/en/latest/topics/spider-middleware.html import time from selenium import webdriver import scrapy from scrapy import signals class AreaSpiderMiddleware(object): ...... class AreaDownloaderMiddleware(object): ...... class AreaMiddleware(object): def process_request(self, request, spider): self.driver = webdriver.PhantomJS() if 'month' in request.url: self.driver.get(request.url) time.sleep(2) html = self.driver.page_source self.driver.quit() return scrapy.http.HtmlResponse(url=request.url, body=html,request=request, encoding='utf-8' )
# -*- coding: utf-8 -*- # @Time : 2018/11/12 16:56 # @Author : wjh # @File : main.py from scrapy.cmdline import execute execute(['scrapy','crawl','aqistudy'])
运行结果以下: