Scrapy持久化存储

时间 2019-12-12

标签 scrapy 持久存储栏目 Python 繁體版

原文原文链接

基于终端指令的持久化存储

- 保证爬虫文件的parse方法中有可迭代类型对象（一般为列表or字典）的返回，该返回值能够经过终端指令的形式写入指定格式的文件中进行持久化操做；css

- 执行输出指定格式进行存储：将爬取到的数据写入不一样格式的文件中进行存储html

scrapy crawl 爬虫名称 -o xxx.json
    scrapy crawl 爬虫名称 -o xxx.xml
    scrapy crawl 爬虫名称 -o xxx.csv

基于管道的持久化存储

- scrapy框架中已经为咱们专门集成好了高效、便捷的持久化操做功能，咱们直接使用便可：python

　　- items.py : 数据结构模板文件，定义数据属性；mysql

　　- pipelines.py ：管道文件，接受item类型的数据，进行持久化操做；redis

- 持久化流程：sql

　　- 在爬虫文件中获取到数据后，将数据封装到 items对象中；数据库

　　- 经过 yield 关键字将items对象提交给pipelines管道进行持久化操做；json

　　- 在管道文件中的process_item方法中接收爬虫文件提交过来的item对象，而后编写持久化存储的代码将item对象存储的数据进行持久化存储；数据结构

　　- settings.py文件中开启管道：框架

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipelineByRedis': 300, }

- 持久化存储示例：

　　- 将糗事百科首页中的段子和做者数据爬取下来，而后进行持久化存储

　　- 爬虫文件：

# -*- coding: utf-8 -*-
import scrapy
from secondblood.items import SecondbloodItem

class QiubaidemoSpider(scrapy.Spider):
    name = 'qiubaiDemo'
    allowed_domains = ['www.qiushibaike.com']
    start_urls = ['http://www.qiushibaike.com/']

    def parse(self, response):
        odiv = response.xpath('//div[@id="content-left"]/div')
        for div in odiv:
            # xpath函数返回的为列表，列表中存放的数据为Selector类型的数据。咱们解析到的内容被封装在了Selector对象中，须要调用extract()函数将解析的内容从Selecor中取出。
            author = div.xpath('.//div[@class="author clearfix"]//h2/text()').extract_first()
            author = author.strip('\n')#过滤空行
            content = div.xpath('.//div[@class="content"]/span/text()').extract_first()
            content = content.strip('\n')#过滤空行

            #将解析到的数据封装至items对象中
            item = SecondbloodItem()
            item['author'] = author
            item['content'] = content

            yield item#提交item到管道文件（pipelines.py）

View Code

　　- items.py:

import scrapy


class SecondbloodItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    author = scrapy.Field() #存储做者
    content = scrapy.Field() #存储段子内容

　　- pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class SecondbloodPipeline(object):
    #构造方法
    def __init__(self):
        self.fp = None  #定义一个文件描述符属性
　　#下列都是在重写父类的方法：
    #开始爬虫时，执行一次
    def open_spider(self,spider):
        print('爬虫开始')
        self.fp = open('./data.txt', 'w')

　　 #由于该方法会被执行调用屡次，因此文件的开启和关闭操做写在了另外两个只会各自执行一次的方法中。
    def process_item(self, item, spider):
        #将爬虫程序提交的item进行持久化存储
        self.fp.write(item['author'] + ':' + item['content'] + '\n')
        return item

    #结束爬虫时，执行一次
    def close_spider(self,spider):
        self.fp.close()
        print('爬虫结束')

　　- settins.py

#开启管道
ITEM_PIPELINES = {
    'secondblood.pipelines.SecondbloodPipeline': 300, #300表示为优先级，值越小优先级越高
}

基于mysql的管道存储

- pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

#导入数据库的类
import pymysql
class mysqlPipeLine(object):

    conn = None  #mysql的链接对象声明
    cursor = None#mysql游标对象声明
    def open_spider(self,spider):
        print('开始爬虫')
        #连接数据库
        self.conn = pymysql.Connect(host='127.0.0.1',port=3306,user='root',password='123456',db='qiubai')
    #编写向数据库中存储数据的相关代码
    def process_item(self, item, spider):
        #1.连接数据库
        #2.执行sql语句
        sql = 'insert into qiubai values("%s","%s")'%(item['author'],item['content'])
        self.cursor = self.conn.cursor()
        #执行事务
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()

        return item
    def close_spider(self,spider):
        print('爬虫结束')
        self.cursor.close()
        self.conn.close()

　　- settings.py

# 开启管道,自定义管道向不用的数据库存储数据
# 300是优先级,数字越小,优先级越高

ITEM_PIPELINES = {
   'boss.pipelines.BossPipeline': 300,
   'boss.pipelines.mysqlPipeLine': 301,
}

基于redis的管道存储

修改redis.window.conf配置文件保护模式改成no bind 0.0.0.1