scrapy框架之CrawlSpider全站自动爬取

时间 2019-11-17

标签 scrapy 框架 crawlspider 自动栏目 Python 繁體版

原文原文链接

全站数据爬取的方式php

　　1.经过递归的方式进行深度和广度爬取全站数据，可参考相关博文（全站图片爬取），手动借助scrapy.Request模块发起请求。html

　　2.对于必定规则网站的全站数据爬取，能够使用CrawlSpider实现自动爬取。dom

CrawlSpider是基于Spider的一个子类。和蜘蛛同样，都是scrapy里面的一个爬虫类，但 CrawlSpider是蜘蛛的子类，子类要比父类功能多，它有本身的都有功能------ 提取连接的功能LinkExtractor（连接提取器）。Spider是全部爬虫的基类，其设计原则只是为了爬取start_url列表中网页，而从爬取到的网页中提取出的url进行继续的爬取工做使用CrawlSpider更合适。scrapy

项目建立ide

#建立工程项目：项目名CrawlSpiderPro可自定义
scrapy startproject CrawlSpiderPro
#切换到当前工程目录下
cd CrawlSpiderPro
#建立爬虫文件，比普通的爬虫文件多了参数“-t crawl”
scrapy genspider -t crawl crawlSpiderTest www.xxx.com
#开启爬虫项目
scrapy crawl crawlSpiderTest

初始化爬虫文件解析　　函数

 1 class CrawlspidertestSpider(CrawlSpider):
 2     name = 'crawlSpiderTest'
 3     allowed_domains = ['www.xxx.com']
 4     start_urls = ['http://www.xxx.com/']
 5     #爬虫规则rules指定不一样的规则解析器，一个Rule就是一个解析规则，能够定义多个
 6     rules = (
 7         #Rule是规则解析器；
 8         # LinkExtractor是链接提取器，提取符合allow规则的完整的url；
 9         #callback指定当前规则解析器的回调解析函数；
10         #follow指定是否将连接提取器继续做用到连接提取器提取出的连接网页；
11         Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
12     )
13 
14     def parse_item(self, response):
15         item = {}
16         #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
17         #item['name'] = response.xpath('//div[@id="name"]').get()
18         #item['description'] = response.xpath('//div[@id="description"]').get()
19         return item

东莞阳光网(http://wz.sun0769.com/index.php/question/report?page=)全站爬取案例：网站

　　1.爬虫脚本crawlSpiderTest.pyurl

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.linkextractors import LinkExtractor
 4 from scrapy.spiders import CrawlSpider, Rule
 5 from CrawlSpiderPro.items import CrawlspiderproItem
 6 
 7 
 8 class CrawlspidertestSpider(CrawlSpider):
 9     name = 'crawlSpiderTest'
10     # allowed_domains = ['www.xxx.com']
11 
12     start_urls = ['http://wz.sun0769.com/index.php/question/report?page=']
13     #爬虫规则rules指定不一样的规则解析器，一个Rule就是一个解析规则，能够定义多个
14     rules = (
15         #Rule是规则解析器；
16         # LinkExtractor是链接提取器，提取符合allow规则的完整的url；
17         #callback指定当前规则解析器的回调解析函数；
18         #follow指定是否将连接提取器继续做用到连接提取器提取出的连接网页；
19         #follow不指定默认False;
20         Rule(LinkExtractor(allow=r'page=\d+'), callback='parse_item', follow=False),#提取页码
21         Rule(LinkExtractor(allow=r'question/\d+/\d+.shtml'), callback='parse_detail'),#提取详细信息页面
22     )
23 
24     def parse_item(self, response):
25         print(response)
26         item = CrawlspiderproItem()
27         tr_list=response.xpath('//*[@id="morelist"]/div/table[2]/tbody/tr/td/table/tbody/tr')
28 
29         for tr in tr_list:
30             item['identifier']=tr.xpath('./td[1]/text()').extract_first()#解析编号
31             item['title']=tr.xpath('/td[2]/a[2]/text()').extract_first()#解析标题
32             yield item
33 
34     def parse_detail(self, response):
35         print(12345678765)
36         item = CrawlspiderproItem()
37         #xpath解析不识别tbody
38         item['identifier']=response.xpath('/html/body/div[9]/table[1]/tr/td[2]/span[2]/text()').extract_first().split(':')[-1]
39         item['content']="".join(response.xpath('/html/body/div[9]/table[2]//text()').extract())
40 
41         yield item

crawlSpiderTest.py

　　2.itmes.py字段属性定义spa

 1 import scrapy
 2 
 3 
 4 #也能够定义两个类分别存储，最后在和管道经过编号字段进行汇总对应，而后持久化存储
 5 class CrawlspiderproItem(scrapy.Item):
 6    
 7     #编号
 8     identifier=scrapy.Field()
 9     #标题
10     title=scrapy.Field()
11     #内容
12     content=scrapy.Field()
13     pass

itmes.py

　　3.pipelines.py管道配置设计

1 #自定义持久化处理
2 class CrawlspiderproPipeline(object):
3     def process_item(self, item, spider):
4         print(item)
5         return item

pipelines.py

　　4.settings.py配置

#UA假装
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
#robots协议
ROBOTSTXT_OBEY = False
#日志输出等级
LOG_LEVEL='ERROR'

#开启管道
ITEM_PIPELINES = {
   'CrawlSpiderPro.pipelines.CrawlspiderproPipeline': 300,
}