Scrapy框架之日志等级和请求传参

时间 2019-12-06

标签 scrapy 框架日志等级请求栏目 Python 繁體版

原文原文链接

1、Scrapy的日志等级

　　在使用scrapy crawl spiderFileName运行程序时，在终端里打印输出的就是scrapy的日志信息。html

一、日志等级（信息种类）

ERROR：错误
WARNING：警告
INFO：通常信息
DEBUG：调试信息（默认）

二、设置日志信息指定输出

　　在settings.py配置文件中任意位置加入：python

# 设置终端输出指定种类的日志信息
LOG_LEVEL = 'ERROR'   # 只打印ERROR级别的日志信息

　　将日志信息存储在指定文件中，而再也不显示在终端里：web

# 设置终端输出指定种类的日志信息
LOG_LEVEL = 'ERROR'   # 只打印ERROR级别的日志信息
LOG_FILE = 'log.txt'  # 指定日志存储到一个文件中

2、请求传参

　　请求传参针对场景：爬取的数据值不在同一个页面中。
　　需求：将id97电影网站中电影详情数据进行爬取（名称、类型、导演、语言、片长）json

一、问题：如何将两个方法解析的电影详情数据存储到一个item对象中

　　meta参数可实现item对象的传递。scrapy.Request()方法中有一个参数meta.经过meta能够将items对象传递给回调函数。
　　注意：meta只能接收字典类型的数据值。所以须要将items封装到字典中，将字典赋值给meta参数，meta就能够将字典传递给回调函数。dom

def parse(self, response):
    div_list = response.xpath('/html/body/div[1]/div[1]/div[2]/div')  # 获取/html/body/div[1]/div[1]/div[2]下全部子div
        for div in div_list:
            """省略代码"""
            
            # 建立items对象
            item = MovieproItem()
            item['name'] = name
            item['kind'] = kind
            # 手动发起请求
            yield scrapy.Request(url=url, callback=self.parseBySecondPage, meta={'item': item})

　　随后能够在parseBySecondPage函数中取出Request方法的meta参数传递过来的字典。
　　取出方法是response.meta，以下所示：scrapy

def parseBySecondPage(self, response):
    """专门用于解析二级子页面中的数据值"""
    
    # 取出Request方法的meta参数传递过来的字典:取出方法是response.meta
    item = response.meta['item']
    item['actor'] = actor
    item['language'] = language
    item['longTime'] = longTime

二、爬虫文件movie.py编写以下

import scrapy
from moviePro.items import MovieproItem

class MovieSpider(scrapy.Spider):
    name = 'movie'
    # allowed_domains = ['www.id97.com']
    start_urls = ['https://www.55xia.com/movie']   # 网站地址更改。。

    def parseBySecondPage(self, response):
        """专门用于解析二级子页面中的数据值"""
        # 导演、语言、片长
        actor = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[1]/td[2]/a/text()').extract_first()
        language = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[6]/td[1]/text()').extract_first()
        longTime = response.xpath('/html/body/div[1]/div/div/div[1]/div[1]/div[2]/table/tbody/tr[8]/td[2]/text()').extract_first()

        # 取出Request方法的meta参数传递过来的字典:取出方法是response.meta
        item = response.meta['item']
        item['actor'] = actor
        item['language'] = language
        item['longTime'] = longTime

        # 将item提交给管道
        yield item

    def parse(self, response):
        # 名称、类型、导演、语言、片长
        div_list = response.xpath('/html/body/div[1]/div[1]/div[2]/div')  # 获取/html/body/div[1]/div[1]/div[2]下全部子div
        for div in div_list:
            # 电影名称
            name = div.xpath('.//div[@class="meta"]/h1/a/text()').extract_first()
            
            # 电影种类：   //text() 该div下全部文本数据均获取
            # 以下xpath方法返回的是一个列表，且列表长度为4
            kind = div.xpath('.//div[@class="otherinfo"]//text()').extract()
            # 将kind列表转化为字符串
            kind = "".join(kind)
            
            # 影片详情url
            url = div.xpath('.//div[@class="meta"]/h1/a/@href').extract_first()

            # 建立items对象
            item = MovieproItem()
            item['name'] = name
            item['kind'] = kind
            # 问题：如何将两个方法解析的电影详情数据存储到一个item对象中——meta

            # 下一步：对url发起请求，获取页面数据，进行指定数据解析
            # 手动发起请求
            yield scrapy.Request(url=url, callback=self.parseBySecondPage, meta={'item': item})

三、其余文件配置

（1）items.py文件封装全部属性

import scrapy

class MovieproItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    # 封装全部属性
    name = scrapy.Field()
    kind = scrapy.Field()
    actor = scrapy.Field()
    language = scrapy.Field()
    longTime = scrapy.Field()

（2）管道文件pipeline.py

import json

class MovieproPipeline(object):
    fp = None
    def open_spider(self, spider):
        self.fp = open('movie.txt', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # detail = item['name'] + ':' + item['kind'] + ':' + item['actor'] + ':' + item['language'] + ':' + item['longTile'] + '\n\n\n'
        detail = dict(item)
        json.dump(detail, self.fp, ensure_ascii=False)
        return item

    def close_spider(self, spider ):
        self.fp.close()

（3）settings.py配置文件

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' # 假装请求载体身份

# Obey robots.txt rules
ROBOTSTXT_OBEY = False   # 不听从门户网站robots协议，避免某些信息爬取不到

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'moviePro.pipelines.MovieproPipeline': 300,
}

三、特别处理

（1）kind(电影种类)特别处理：

kind = div.xpath('.//div[@class="otherinfo"]//text()').extract()
# 将kind列表转化为字符串
kind = "".join(kind)

　　解析它的xpath方法返回的是一个列表，且列表长度为4。所以不能再使用extract_first方法，要使用extract()方法获取列表。
　　获取列表后须要将列表转化为字符串。在这里使用"".join(list)实现。ide

（2）在管道文件中完成数据持久化

# 方法一：拼接字符串写入文件中
def process_item(self, item, spider):
    detail = item['name'] + ':' + item['kind'] + ':' + item['actor'] + ':' + item['language'] + ':' + item['longTile'] + '\n\n\n'
    self.fp.write(detail)
    return item
    
# 方法二：json.dump()将dict类型的数据转成str，并写入到json文件中
def process_item(self, item, spider):
    detail = dict(item)
    json.dump(detail, self.fp, ensure_ascii=False)
    return item