爬虫之scrapy

时间 2019-12-13

原文原文链接

一 . scrapy框架的基本应用

　　安装步骤:

Linux： pip3 install scrapy Windows： 1. pip3 install wheel 2. 下载twisted: http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

      3. 进入下载twisted目录，执行 pip3 install Twisted‑17.1.0‑cp35‑cp35m‑win_amd64.whl 4. pip3 install pywin32 5. pip3 install scrapy

　　建立项目步骤:

1. 随便建一个文件夹 2. 进入这个文件夹内,shift+右键,打开终端(命令行) 3. 建立项目命令: scrapy startproject 项目名称 4. 到项目下: cd 项目名称 5. 建立爬虫文件: scrapy genspider 爬虫文件名 url(先随便写,能够改) 6. 用pycharm打开这个文件夹,就能够看到 项目名下的spiders中有爬虫文件名了

　　打开爬虫文件

　　　　基于终端存储

# -*- coding: utf-8 -*-
import time import scrapy   # 飘红不用管
class FirstSpider(scrapy.Spider): # name就是爬虫文件的名称
    name = 'first'
    
    # 容许的域名,通常都注释掉,不用限定
    # allowed_domains = ['www.xxx.com']
    
    # 起始的url列表,在列表里的url都会被自动的发送请求,里面能够写多个
    start_urls = ['http://pic.netbian.com/4kmeinv/index_2.html'] # 解析数据,有几个url就会执行几回这个函数
    def parse(self, response): # 拿到相应数据,这个xpath和etree的xpath不太同样
        li_list = response.xpath('//div[@class="slist"]//li') names = [] for li in li_list: time.sleep(0.2)  # 别的网站可能不用,这个网址不让请求频率过高
            # 特性,xpath返回的列表元素必定是selector对象,使用extract方法能够获取selector中data中的数据
            # author = div.xpath('地址')[0].extract()
            src = li.xpath('.//img/@src').extract_first() all_src = 'http://pic.netbian.com' + src # 基于终端指令进行持久化存储的时候解析内容须要是字典格式或者是None
            dic = { 'name': all_src } names.append(dic) # print(src)
        return names

　　执行这个文件

# 这里执行不用实例化类而后右键执行,须要在终端执行 # 执行以前须要在settings.py中修改ROBOTSTXT_OBEY = False,而且USER-AGENT也须要改.
执行命令: scrapy crawl 爬虫文件名 # 有日志文件
scrapy crawl 爬虫文件名 --nolog   通常不用,若是报错找不到,除非100%没错 持久化存储(基于终端指令,只能存到硬盘上,不能存到数据库上) scrapy crawl 爬虫文件名 -o  磁盘文件路径+名称.csv 若是不是.csv格式会报错 # 好比说: scrapy crawl spiderName -o beauty.csv 没有写文件路径直接保存到当前目录

　　基于管道存储

首先到配置文件中把 ITEM_PIPELINES 打开, 后面的数字是权重,数字越小证实权重越大,
这里面能够写多个类,好比存在不一样地方(本地,数据库等)

　　爬虫文件(xiaohua.py)

# -*- coding: utf-8 -*-
import scrapy from ..items import FirstbloodItem class XiaohuaSpider(scrapy.Spider): name = 'xiaohua'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.521609.com/daxuemeinv/'] def parse(self, response): li_list = response.xpath('//*[@id="content"]/div[2]/div[2]//li') for li in li_list: src = li.xpath('.//img/@src').extract_first() if not src: continue all_src = 'http://www.521609.com' + src # 实例化一个item对象
            item = FirstbloodItem() # 因为items中的Field底层是字典类型,因此要[]取值
            item['img_url'] = all_src # item提交给管道
            yield item

　　pipelines.py

# 当前类的做用是将解析的数据存储到某一个平台中(本地或数据库)
class FirstbloodPipeline(object): f = None # 因为传过来的item对象是循环产生的,因此不能用with
    def open_spider(self, spider):  # 因为原方法里有spider,因此这也要写
        print('开始爬虫') self.f = open('./xiaohua.txt', 'w', encoding='utf-8') # 做用:实现持久化存储的操做
    # 该方法的item参数就能够接受爬虫文件提交过来的item对象
    def process_item(self, item, spider): img_url = item.get('img_url') self.f.write('图片连接->' + img_url + '\n') # 返回值的做用就是将item传递给下一个即将被执行的管道类
        return item def close_spider(self, spider): print('结束爬虫') self.f.close()

　　items.py

import scrapy class FirstbloodItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    
    # 前边的变量名随便写,后边必须是scrapy.Field(),有几个参数就写几个
    img_url = scrapy.Field()

　　而后在终端执行 scrapy crawl xiaohua 就能够啦!

二 . 全站数据的请求

　　好比你想要的数据有不少页,那么这几页的布局都是同样的,只有页码不一样,这时候就能够把全部页的数据都爬下来

　　-- 这里还对上面那个校花网站进行数据爬取,只不过是全站爬取

xiaohua.py文件(pipelines.py和items.py都不用改) # -*- coding: utf-8 -*-
import scrapy from ..items import FirstbloodItem class XiaohuaSpider(scrapy.Spider): name = 'xiaohua'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://www.521609.com/daxuemeinv/'] # 生成一个通用的url模板
    url = 'http://www.521609.com/meinvxiaohua/list%s.html' pageNum = 121

    def parse(self, response): li_list = response.xpath('//*[@id="content"]/div[2]/div[2]//li') for li in li_list: src = li.xpath('.//img/@src').extract_first() if not src: continue all_src = 'http://www.521609.com' + src # 实例化一个item对象
            item = FirstbloodItem() # 因为items中的Field底层是字典类型,因此要[]取值
            item['img_url'] = all_src # item提交给管道
            yield item # 对其余页码的url进行手动请求发送,这里爬取的是121-125页的数据
        if self.pageNum <= 125: self.pageNum += 1 new_url = format(self.url % self.pageNum) # 其实就是递归调用parse函数
            yield scrapy.Request(url=new_url, callback=self.parse)

三 . 请求传参(应用到须要子网页中的数据)

　　爬虫文件(pipelines.py和items.py基本都同样)

# -*- coding: utf-8 -*-
import scrapy from ..items import MovieproItem class MovieSpider(scrapy.Spider): name = 'movie'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.4567tv.tv/index.php/vod/show/id/9.html'] # 接收一个请求传递过来的数据
    def detail_parse(self, response): item = response.meta['item'] desc = response.xpath('/html/body/div[1]/div/div/div/div[2]/p[5]/span[2]/text()').extract_first() item['desc'] = desc   # 在items.py文件中的属性

        yield item def parse(self, response): li_list = response.xpath('//div[@class="stui-pannel_bd"]/ul/li') for li in li_list: name = li.xpath('.//h4[@class="title text-overflow"]/a/text()').extract_first() detail_url = 'https://www.4567tv.tv' + li.xpath( './/h4[@class="title text-overflow"]/a/@href').extract_first() item = MovieproItem() item['name'] = name    # items.py中的属性
            # meta是一个字典，字典中全部的键值对均可以传递给指定好的回调函数
            yield scrapy.Request(url=detail_url, callback=self.detail_parse, meta={'item': item}) # 若是有不少层只要一层层把item传过去就行了,保证用的是同一个item

1. 爬虫之Scrapy
2. 爬虫之 Scrapy
3. 爬虫之scrapy
4. Python - 爬虫之Scrapy
5. Python之（scrapy）爬虫
6. python爬虫之Scrapy爬虫框架
7. python爬虫scrapy之scrapy终端(Scrapy shell)
8. 爬虫框架：scrapy 爬虫框架：scrapy
9. Python爬虫-Scrapy爬虫
10. scrapy爬虫与反爬虫
更多相关文章...
• Web 语义化 - 网站建设指南
• PHP localeconv() 函数 - PHP参考手册
• 互联网组织的未来：剖析GitHub员工的任性之源
• ☆基于Java Instrument的Agent实现