概念:为来爬取网站数据而编写的一款应用框架,集成了相应的功能且具备很强通用型的项目模版html
功能:scrapy框架提供了高性能的异步下载、解析、持久化存储操做...python
用来处理整个系统的数据流处理,出发事物(框架核心)mysql
用来接受引擎发过来的请求,压入队列中,并在引擎再次请求的时候返回,能够想象成一个url(抓取网页的网址或者说是连接)的优先队列,由它来决定下一个要抓取的网址是什么,同时去除重复的网址linux
用于下载网页的内容,并将网页内容返回给蜘蛛(scrapy下载器是简历在twisted这个高效异步模型上的)正则表达式
爬虫是主要干活的,用于从特定的网页中提取本身须要的信息,即所谓的实体(item)。用户也能够从中提取出连接,让scrapy继续抓取下一个页面redis
负责处理爬虫从网页抽取的实体,主要的功能是持久化实体,验证明体的有效性,清除不须要的信息,当页面爬虫解析后,将发送到项目管道,并通过几个特定的次序处理数据sql
安装:数据库
安装成功后能够在终端/命令行输入scrapy测试是否安装成功windows
scrapy startproject 项目名
目录结构:浏览器
- scrapy.cfg:配置文件 - items.py:设置数据存储模版,用户结构化数据 - pipelines:数据持久化处理 - settings.py:配置文件。如:递归层数,并发数,延迟下载等 - spiders:爬虫目录。如:建立文件,编写爬虫解析规则
建立的第一个爬虫文件中的内容
# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): # 爬虫文件的名称: 经过爬虫文件名称能够指定定位到某一个具体的爬虫文件 name = 'first' # 容许的域名:只能够爬取指定域名下的页面数据 allowed_domains = ['https://www.qiushibaike.com'] # 起始url:当前工程将要爬取页面所对应的url start_urls = ['http://https://www.qiushibaike.com/'] # 解析方法:对获取的页面数据进行指定内容的解析 # response:根据起始url列表发起请求,请求成功后返回的响应对象 # parse方法的返回值:必须为迭代器或者空 def parse(self, response): pass
在建立的爬虫文件中编写爬虫程序来完成爬虫的相关操做
# 请求载体的身份标示,改成浏览器的身份标示 USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36' # Obey robots.txt rules # 严格听从门户网站的robots协议 ROBOTSTXT_OBEY = False
在对爬取内容解析时建议使用xpath进行指定内容解析
# -*- coding: utf-8 -*- import scrapy class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.qiushibaike.com/text'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') f = open('段子.txt','w',encoding='utf-8') count = 0 for div in div_list: # xpath解析到的指定内容被存储到来selector对象 # extract()该方法能够将selector对象中存储的数据值拿到 # extract_first()取到第一个数据,等同于extract()[0] author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() # author = div.xpath('./div/a[2]/h2/text()').extract()[0] content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip() count += 1 f.write(author+':\n'+content+'\n---------------分割线--------------\n\n\n') print('共抓取到:',count)
class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.qiushibaike.com/text'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') data_list = [] for div in div_list: # xpath解析到的指定内容被存储到来selector对象 # extract()该方法能够将selector对象中存储的数据值拿到 # extract_first()取到第一个数据,等同于extract()[0] author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() # author = div.xpath('./div/a[2]/h2/text()').extract()[0] content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip() dic = { "author":author, "content":content } data_list.append(dic) return data_list
指令:
scrapy crawl first -o qiubai.csv --nolog
基于管道
基于管道的数据存储代码实现流程:
在items.py中:
import scrapy class QiushiItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() author = scrapy.Field() content = scrapy.Field()
在pipelines.py中
class QiushiPipeline(object): file = None def open_spider(self,spider): # 整个爬虫过程当中,该方法只会被调用一次,能够在这里打开文件 self.file = open('qiubai.txt','w',encoding='utf-8') print('开始爬虫') def process_item(self, item, spider): # 该方法能够接收爬虫文件中提交过来的item对象,而且对item对象中存储页面数据进行持久化存储 # 参数item表示的就是接收到的item对象 # 每当爬虫文件向管道提交一次item,则该方法就会被执行一次 author = item['author'] content = item['content'] self.file.write(author+':\n'+content+'\n-------------------\n\n\n') return item def close_spider(self,spider): # 该方法只会在爬虫结束的时候调用一次 print('爬虫结束') self.file.close()
在爬虫文件中:
import scrapy from qiushi.items import QiushiItem class FirstSpider(scrapy.Spider): name = 'first' # allowed_domains = ['www.qiushibaike.com/text'] start_urls = ['https://www.qiushibaike.com/text/'] def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() # author = div.xpath('./div/a[2]/h2/text()').extract()[0] content = div.xpath('.//div[@class="content"]/span/text()').extract_first().strip() # 1. 将解析到的页面数据存储到items对象 item = QiushiItem() item['author'] = author item['content'] = content # 2.将item对象提交给管道 yield item
配置文件:取消67line的注释
ITEM_PIPELINES = { 'qiushi.pipelines.QiushiPipeline': 300, }
使用mysql数据库进行持久化存储时与基于管道存储方式无太大区别,只是须要在pipelines中编写pymysql链接、操做数据库等
import pymysql class QiubaiPipeline(object): conn = None cursor = None def open_spider(self,spider): print('爬虫开始') self.conn = pymysql.connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai') def process_item(self, item, spider): sql = 'insert into qiubai(author,content) values ("%s","%s")'%(item['author'],item['content']) self.cursor = self.conn.cursor() try: self.cursor.execute(sql) self.conn.commit() except Exception as e: print(e) self.conn.rollback() return item def close_spider(self,spider): print('爬虫结束') self.cursor.close() self.conn.close()
安装redis数据库
cd redis-5.0.3
make
./redis-server ../redis.conf
reids的简单使用
127.0.0.1:6379> set name 'hahha' OK 127.0.0.1:6379> get name "hahha"
import redis class QiubaiPipeline(object): conn = None def open_spider(self,spider): print('爬虫开始') self.conn = redis.Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = { 'author':item['author'], 'content':item['content'] } self.conn.lpush('data',dic) return item def close_spider(self,spider): print('爬虫结束')
需求:将爬取到的数据值分别存储到磁盘、mysql、redis
pipelines.py
import redis class QiubaiPipeline(object): conn = None def open_spider(self,spider): print('爬虫开始') self.conn = redis.Redis(host='127.0.0.1',port=6379) def process_item(self, item, spider): dic = { 'author':item['author'], 'content':item['content'] } # self.conn.lpush('data',dic) print('数据写入到redis数据库中') return item def close_spider(self,spider): print('爬虫结束') class QiubaiFiles(object): def process_item(self, item, spider): print('数据写入到磁盘文件中') return item class QiubaiMySQL(object): def process_item(self, item, spider): print('数据写入到mysql数据库中') return item
settings.py
ITEM_PIPELINES = { 'qiubai.pipelines.QiubaiPipeline': 300, 'qiubai.pipelines.QiubaiFiles': 400, 'qiubai.pipelines.QiubaiMySQL': 500 }
使用请求的手动发送能够实现多个url进行数据爬取
import scrapy from qiushiPage.items import QiushipageItem class QiushiSpider(scrapy.Spider): name = 'qiushi' # allowed_domains = ['https://www.qiushibaike.com/text/'] start_urls = ['https://www.qiushibaike.com/text/'] pageNum = 1 url = 'https://www.qiushibaike.com/text/page/%d/' def parse(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: author = div.xpath('./div/a[2]/h2/text()').extract_first().strip() content = div.xpath('.//div[@class="content"]/span/text()').extract_first() item = QiushipageItem() item['author'] = author item['content'] = content yield item print('还在执行吗') # 使用手动请求方式进行多个url爬取 if self.pageNum <= 13: self.pageNum += 1 print('执行了码', self.pageNum) # 判断页码是否小于13 new_url = format(self.url % self.pageNum) # callback函数是回调函数,第二页解析的内容与开始解析的内容是同样的,能够使用parse函数进行解析,也能够本身定义函数进行解析 yield scrapy.Request(url=new_url,callback=self.parse)
scrapy要发送post请求,必定要对父类中的start_requests方法进行重写
class PostRequestSpider(scrapy.Spider): name = 'post_request' # allowed_domains = ['www.baidu.com'] start_urls = ['https://fanyi.baidu.com/sug'] # 发起post请求须要对父类中的start_requests方法进行重写 def start_requests(self): data = { 'kw':'dog' } for url in self.start_urls: # 方式1 # yield scrapy.Request(url=url,callback=self.parse,method='post') # 方式2 yield scrapy.FormRequest(url=url,callback=self.parse,formdata=data) def parse(self, response): print(response.text)
不须要对cookie刻意的去提取或者储存,scrapy.Request会自动存储cookie,在下次发起请求的时候会携带自动存储下来的cookie
scrapy更换请求ip是经过下载中间件实现的,在middlewares.py中能够自定义一个类,在类中实现一个process_request方法,方法有三个参数,self\request\spider
而后经过更改request.meta['proxy']的属性进行更换ip,更换后在setting.py中开启下载中间件
middlewares.py
class MyPro(object): def process_request(self,request,spider): request.meta['proxy'] = 'http://61.166.153.167:8080'
settings.py
DOWNLOADER_MIDDLEWARES = { 'postDemo.middlewares.MyPro': 543, }
在进行访问时即可以对请求ip进行自动更改
种类:
# 指定终端输出指定等级的日志信息 LOG_LEVEL = 'ERROR'
也能够指定输出日志信息的文件,而不是输出在屏幕上,一样的是在settings.py中添加一个属性LOG_FILE
LOG_FILE = 'log.txt'
import scrapy from moviePro.items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie' # allowed_domains = ['http://www.55xia.com'] start_urls = ['http://www.55xia.com/movie/'] def parseMoviePage(self,response): # 取出item item = response.meta['item'] direct = response.xpath('//html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]//text()').extract_first() country = response.xpath('//html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[4]/td[2]/a/text()').extract_first() movie_referral = response.xpath('/html/body/div[1]/div/div/div[1]/div[2]/div[2]/p/text()').extract_first() download_url = response.xpath('//td[@class="text-break"]/div/a[@rel="nofollow"]/@href').extract_first() password = response.xpath('//td[@class="text-break"]/div/strong/text()').extract_first() download = '连接:%s密码:%s'%(download_url,password) item['download'] = download item['country'] = country item['direct'] = direct item['movie_referral'] = movie_referral yield item def parse(self, response): div_list = response.xpath('//html/body/div[1]/div[1]/div[2]/div') for div in div_list: name = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first() parse_url = div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first() genre = div.xpath('.//div[@class="otherinfo"]//text()').extract() genre = '|'.join(genre) url = 'http:%s'%parse_url item = MovieproItem() item['name'] = name item['genre'] = genre # 请求传参,两个解析响应的页面不一样,须要同时保存数据,将item经过meta传到回调函数中,在回调函数中能够使用response.meta['item']取出 # meta参数必须接收一个字典 yield scrapy.Request(url=url,callback=self.parseMoviePage,meta={'item':item})
问题:若是咱们想要对某一个网站的全站数据进行爬取
解决方案:
CrawlSpider概念:CrawlSpider起始就是Spider的一个子类,CrawlSpider功能更增强(连接提取器,解析器)
scrapy genspider -t crawl chouti dig.chouti.com
连接提取器:顾名思义,用来提取指定的连接(url)
from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class ChoutiSpider(CrawlSpider): name = 'chouti' # allowed_domains = ['dig.chouti.com'] start_urls = ['https://dig.chouti.com/'] # 实例化了一个连接提取器对象 # 连接提取器:顾名思义,用来提取指定的连接(url) # allow参数:赋值一个正则表达式参数 # 连接提取器就能够根据正则表达式在页面中指定的连接 # 提取到的连接会所有交给规则解析起 link = LinkExtractor(allow=r'/all/hot/recent/\d+') rules = ( # 实例化了一个规则解析器对象 # 规则解析器接受了连接提取器发送的连接后,就会对这些连接发起请求,获取页面内容,就会根据指定规则对页面内容进行解析 # callback: 指定一个解析规则(方法/函数) # follow参数:是否将连接提取器继续做用到连接提取器提取出所表示的页面数据中 Rule(link, callback='parse_item', follow=True), ) def parse_item(self, response): print(response) # 能够对响应数据进行解析
分布式爬虫:
# 安装 pip install scrapy-redis
from scrapy_redis.spiders import RedisCrawlSpider class QiubaiSpider(RedisCrawlSpider): pass
from redisPro.items import RedisproItem class QiubaiSpider(RedisCrawlSpider): name = 'qiubai' # allowed_domains = ['https://www.qiushibaike.com/pic/'] # start_urls = ['https://www.qiushibaike.com/pic//'] # 调度器队列的名称 redis_key = 'qiubaispider' # 表示与start_urls含义是同样的 rules = ( Rule(LinkExtractor(allow=r'/pic/page/\d+'), callback='parse_item', follow=True), ) def parse_item(self, response): div_list = response.xpath('//div[@id="content-left"]/div') for div in div_list: img_url = 'https:'+div.xpath('.//div[@class="thumb"]/a/img/@src') item = RedisproItem() item['img_url'] = img_url yield item
ITEM_PIPELINES = { # 原生 # 'redisPro.pipelines.RedisproPipeline': 300, # 分布式组件提供的共享管道 'scrapy_redis.pipelines.RedisPipeline':400, }
# 使用scrapy-redis组件去重队列 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 使用scrapy-redis组件本身的调度器 SCHEDULER = 'scrapy_redis.scheduler.Scheduler' # 是否容许暂停 SCHEDULER_PERSIST = True
scrapy runspider qiubai.py
将起始url放置到调度器的队列中:redis-cli:
lpush 队列名称 (redis-key) 起始url
lpush qiubaispider https://www.qiushibaike.com/pic/
https://pic.qiushibaike.com/system/pictures/12140/121401684/medium/59AUGYJ1J0ZAPSOL.jpg