scrapy框架【基础】

时间 2019-11-15

标签 scrapy 框架基础栏目 Python 繁體版

原文原文链接

scrapy框架之基础

1、安装scrapy

安装失败看博客>>>scrapy安装失败解决方案html

pip install wheel
pip install twisted
pip install pywin32
pip install scrapy

2、建立爬虫项目

scrapy startproject firstPro

# firstPro表示项目名称

项目目录结构

cmd命令行输入 D:\爬虫项目\first>tree /fpython

└─first
    │  items.py
    │  middlewares.py
    │  pipelines.py
    │  settings.py
    │  __init__.py
    │
    ├─spiders
    │  │  jingdong.py
    │  │  __init__.py
    │  │
    │  └─__pycache__
    │          jingdong.cpython-36.pyc
    │          __init__.cpython-36.pyc
    │
    └─__pycache__
            items.cpython-36.pyc
            pipelines.cpython-36.pyc
            settings.cpython-36.pyc
            __init__.cpython-36.pyc

scrapy.cfg     # scrapy部署时的配置文件
taobao         # 项目的模块,须要从这里引入
__init__.py    
items.py       # 定义爬取的数据结构
middlewares.py  # 定义爬取时的中间件
pipelines.py    # 定义数据管道
settings.py    # 配置文件
spiders        # 放置爬虫的文件夹
__init__

3、建立爬虫

# 在spiders文件夹中建立一个py文件用于网页抓取内容并解析结果
- cmd命令行建立
- D:\爬虫项目\taobao\taobao\spiders>scrapy genspider jingdong www.xxx.com
- jingdong为蜘蛛名称     www.xxx.com为域名


# 建立的jingdong.py
import scrapy
class JingdongSpider(scrapy.Spider):
    name = 'jingdong'  # 项目惟一的名字,用来区分不一样的spider
    allowed_domains = ['www.xxx.com']  # 容许爬取的域名
    start_urls = ['http://www.xxx.com/']  # spider启动时抓取的url列表
    
    # 负责解析/提取/start_urls里面请求完的响应数据
    def parse(self, response):
        pass

定义item字段

在解析数据以前咱们要在items文件中定义咱们要解析的字段,由于咱们解析完的数据须要提交到管道,而管道只接收item对象json

import scrapy

class FirstItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()

数据爬取并解析

import scrapy
from first.items import FirstItem  # 导入items中的FirstItem类

class JingdongSpider(scrapy.Spider):
    name = 'jingdong'
    allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.cnblogs.com/songzhixue/p/10717975.html']
    
     # 管道存储 函数名parse也是不能够更改的
    def parse(self, response):
        tetle = response.xpath('//*[@id="cnblogs_post_body"]/ul/li/a/text()').extract()

        for name in tetle:
            # 实例化一个item对象
            item = FirstItem()
            # 将解析到的数据所有封装到item对象中
            item["name"] = name
            yield item  # 将item提交给管道

这里咱们使用了extrctcookie

tetle = response.xpath('//*[@id="cnblogs_post_body"]/ul/li/a/text()')
# xpath提取完是一个Selector对象
<Selector xpath='//*[@id="cnblogs_post_body"]/ul/li/a/text()' data='基础【一】基础数据类型'>

# 提取data数据须要加上.extract()
tetle = response.xpath('//*[@id="cnblogs_post_body"]/ul/li/a/text()').extract()  #返回值为列表

# extract_first() #取列表中的第一项 等于extract()[0]
# 若是咱们xpath写错或者xpath拿不到值时咱们调用了extract_first() 则返回值为None

4、数据持久化

管道只接收item对象,因此咱们必须在items中给数据定义字段,并将解到的数据写入item对象中数据结构

# 基于管道的数据持久化
# 处理jingdong爬虫数据的管道
class FirstPipeline1(object):
    fp = None
    # 打开文件 重写父类的方法 只会执行一次
    def open_spider(self,spider):
        print("开始爬虫------")
        self.fp = open("./spider.txt","w",encoding="utf-8")  # 能够定义任意文件的后缀

    def process_item(self, item, spider):
        name = item["name"]
        self.fp.write(name+"\n")
        
        return item

    def close_spider(self,spider):
        print("结束爬虫------")
        self.fp.close()

持久化流程

items.py :数据结构模板文件,定义数据属性
pipelines.py :管道文件,接收数据(items)进行持久化操做

- 持久化存储
　　- 基于终端指令：
　　　　- 前提：只能够将parse方法的返回值进行本地文件的持久化存储
　　　　- 指令：scrapy crawl spiderName -o filePath
　　- 基于管道：
　　　　1.数据解析
　　　　2.须要在item类中定义相关的属性（存储解析到的数据）
　　　　3.将解析到的数据存储或者封装到一个item类型的对象中
　　　　4.将item对象提交到管道中
　　　　5.在管道中须要接收item，且将item对象中的数据进行任意形式的持久化操做
　　　　6.在配置文件中开启管道app

终端持久化不须要开启管道,在爬虫文件中将解析后的数据返回便可框架

 
           class  
           JingdongSpider(scrapy.Spider): 
          
           name  
           =  
           'jingdong' 
          
           allowed_domains  
           =  
           [ 
           'www.xxx.com' 
           ] 
          
           start_urls  
           =  
           [ 
           'https://www.cnblogs.com/songzhixue/p/10717975.html' 
           ] 
          
           # 终端数据持久化存储,只能够存储到磁盘 
          
           def  
           parse( 
           self 
           , response): 
          
           tetle  
           =  
           response.xpath( 
           '//*[@id="cnblogs_post_body"]/ul/li/a/text()' 
           ).extract() 
          
           all_list  
           =  
           [] 
          
           for  
           name  
           in  
           tetle: 
          
           dic  
           =  
           { 
          
           "name" 
           :name 
          
           } 
          
           all_list.append(dic) 
          
           return  
           all_list 
          
           # 在终端输入指令便可本地化存储 
          
           D:\爬虫项目\first>scrapy crawl jingdong  
           - 
           o jingdong.csv 
          
           # 基于终端指令进行数据持久化存储保存的格式 
          
           'json' 
           ,  
           'jsonlines' 
           ,  
           'jl' 
           ,  
           'csv' 
           ,  
           'xml' 
           ,  
           'marshal' 
           ,  
           'pickle'

5、打开管道

settings经常使用配置dom

# 请求头配置
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36' 

# 配置文件中开启管道
ITEM_PIPELINES = {
    # 300表示优先级,(数值越小优先级越高)
   'first.pipelines.FirstPipeline': 300,
}
# 设置日志等级
LOG_LEVEL = 'WARNING'  
# ROBOTS协议配置
ROBOTSTXT_OBEY = False
# 是否处理cookie
COOKIES_ENABLED = False

 scrapy管道的细节处理
    - 数据的爬取和持久化存储，将同一组数据分别存储到不一样的平台
        - 一个管道类负责将数据存储到某一个载体或者平台中
        - 爬虫文件提交的item只会提交给第一个被执行的管道类
        - 在管道类的process_item中的return item表示的含义就是将当前管道类接收到的item传递给下一个即将被执行的管道类

    - 注意：爬虫类和管道类进行数据交互的形式
        - yild item：只能够交互item类型的对象
        - spider参数：能够交互任意形式的数据

- 一个爬虫项目能够有多个spider,每一个爬虫的数据持久化方式可能不一样,这时咱们就须要用到多个管道来处理不一样爬虫的数据存储
# 处理jingdong爬虫数据的管道
class FirstPipeline1(object):
    # 能够将item类型对象中存储的数据进行持久化存储
    def process_item(self, item, spider):
        # spider就是爬虫对象,经过点的方式调用jingdong爬虫类中的属性,判断是哪一个爬虫
        if spider.name == "jingdong":

            # item就是通过spider处理完传过来的数据
            print(item["name"])   # 取值必须用中括号

            return item  # 将item传给下一个管道

# 处理taobao爬虫数据的管道
class FirstPipeline2(object):
    def process_item(self, item, spider):
        # spider就是爬虫对象
        if spider.name == "taobao":
            print(item["name"])

            return item
        
# sttings配置       
ITEM_PIPELINES = {
    # 300表示优先级,(数值越小优先级越高)
   'first.pipelines.FirstPipeline1': 300,
   'first.pipelines.FirstPipeline2': 301,
}
哪一个管道的优先级高就先运行哪一个管道

多个爬虫对应多个管道文件

多个爬虫对应多个管道文件

6、启动项目

# 进入项目文件夹下启动
# 在cmd执行
scrapy crawl 爬虫名字
scrapy crawl jingdong --nolog  # 不打印log日志