python从零学——scrapy初体验

时间 2019-11-10

原文原文链接

python从零学——scrapy初体验

近日由于一些事情，须要从网上爬取一些东西，故而想经过使用爬虫来顺便学习下强大的python。现将一些学习中遇到的问题记录下来，以便往后查询html

1. 开发环境的准备（本人windows10 x64）

python的爬虫框架应该说是有挺多的了，使用scrapy也是由于它名气比较大啦。首先是安装使用，由于我也是从零开始，从开始安装python开始的，因此我也就从安装python开始的。python

1.1 python安装

一开始，我安装的是python3.7，可是在安装scrapy的时候，发现一直出现依赖错误“Microsoft Visual C++ 14.0 is required”这个蛋疼的错误，死活调很差，直到我在scrapy的官方教程上看到这句话竟然只支持python2.7，wtf!!!!浪费了我好多时间，好吧，2.7就2.7，我从python的官网上下载了python-2.7.15.amd64.msi，忘记有没有自动添加环境了，若是没有的话随便添加一下吧，很简单的，在path里面添加下面的路径数据库

$(python的安装路径)windows

$(python的安装路径)\Scriptscookie

个人路径是app

D:\softwares\Python27 D:\softwares\Python27\Scripts框架

安装完成之后，win+R运行cmd，输入python看下有反应不，若是有就说明已经安装好了。dom

1.2 安装python IDE，PyCharm

PyCharm好像用的比较多，我就安装这个了，看起来是用visual studio那一套作的，很像。PyCharm有分专业版和社区版的，做为一个穷逼固然是下载社区版本的啦。国内用户好像没法直接打开连接，可是好像下载连接是能够用的，那我就像上面的pyhon同样贴一个下载地址吧：pycharm2018.1.4。python2.7

1.3 scrapy安装

python有一个很好的地方，就是有一个包管理系统（pip）来管理python的包，我们想要使用的scrapy包就能很方便的下载下来，而没必要去网上处处找。以前咱们安装的python2.7.15已经默认安装了pip，因此如今咱们就使用pip来安装一下scrapy好了。在cmd里面输入一下命令：scrapy

pip install scrapy

而后若是没有意外的话，通常会出现如下包缺失的提示：

building 'twisted.test.raiser' extension error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

不要慌，到这个网站上下载对应没有编译的包就好了，咱们就不用在本身电脑上编译了。这里是twisted缺失，因此我根据个人系统和python的版本，选择了这个Twisted‑18.7.0‑cp27‑cp27m‑win_amd64.whl下载。下载好了之后，用cmd来安装，输入如下的命令

pip install d:\Twisted-18.7.0-cp27-cp27m-win_amd64.whl

而后安装安装完成之后就能够从新安装scrapy了，从新输入pip install scrapy而后看有没有其余的依赖错误，若是有的话就跟刚才同样处理就好了。到此为止，scrapy须要的环境都安装完毕了，接下来就是使用scrapy来爬取东西了

2. 爬取静态图片

用某宝的宝贝页面来爬取是最好的了，由于某宝的宝贝页面不只有静态的数据还有动态的数据，很适合学习。咱们先来爬取这部分的图片：

2.1 建立scrapy项目

首先，使用如下命令来建立一个空的scrapy项目。

scrapy startproject taobao

生成成功，将项目用pycharm打开首先咱们编辑下items.py，这个类是用来暂存爬取到的信息的：

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class taobaoItem(scrapy.Item):
    url = scrapy.Field()
    name = scrapy.Field()
    image_urls = scrapy.Field()

这里，咱们要存的就是宝贝的地址，名字和图片的地址。而后咱们新建一个spider，叫taobaoSpider好了。spider是用来请求网页和获取爬取目标的地址的。说白了作一些处理连接的工做。

# -*- coding: utf-8 -*-
import scrapy
from scrapy.selector import Selector
from taobao.items import taobaoItem
from scrapy_splash import SplashRequest
class taobaoSpider(scrapy.Spider):
    name = "taobao"
    allowed_domains = ["taobao.com"]
    start_urls = []
    def start_requests(self):
        input_url = 'https://item.taobao.com/item.htm?spm=a1z10.1-c.w4023-18381915794.4.44d14551es5Ex7&id=556114290901'
        self.start_urls.append(input_url)
        for url in self.start_urls:
            yield SplashRequest(url=url, callback=self.parse)


    def parse(self, response):
        # sel是页面源代码，载入scrapy.selector
        sel = Selector(response)
        for link in sel.xpath('//*[@id="J_isku"]/div/dl[1]/dd/ul/li/a'):
            url = link.xpath('@style').extract()[0]
            image_url = "http://" + url[17:-28] + "400x400.jpg"
            image_urls = []
            image_urls.append(image_url)
            name = link.xpath('span/text()').extract()
            item = taobaoItem()
            item['url'] = url
            item['name'] = name
            item['image_urls'] = image_urls
            yield item  # 返回请求

接下来修改settings.py，这个文件是配置文件，配置一些参数：

# -*- coding: utf-8 -*-

# Scrapy settings for taobao project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'taobao'

SPIDER_MODULES = ['taobao.spiders']
NEWSPIDER_MODULE = 'taobao.spiders'

ITEM_PIPELINES = {
    'taobao.pipelines.taobaoPipeline': 1,
}
#设置图片下载路径
IMAGES_STORE = 'd:/download'

# Obey robots.txt rules
ROBOTSTXT_OBEY = True

注意：这里ROBOTSTXT_OBEY 默认是True，这是scrapy默认遵照爬取协议。若是这里为Ture，则没法爬取淘宝的数据，会出现一下的提示。因此须要改成False 修改为：

ROBOTSTXT_OBEY = False

最后设置piplines，用于持久化爬取的数据，也就是储存到硬盘或者数据库里面的东西：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import requests
from taobao import settings
import os

class taobaoPipeline(object):
    def process_item(self, item, spider):
        if 'image_urls' in item:  # 如何‘图片地址’在项目中
            images = []  # 定义图片空集

            dir_path = '%s/%s' % (settings.IMAGES_STORE, spider.name)

            if not os.path.exists(dir_path):
                os.makedirs(dir_path)
            for image_url in item['image_urls']:
                us = image_url.split('/')[-1:]
                image_file_name = '_'.join(us)
                file_path = '%s/%s' % (dir_path, image_file_name)
                images.append(file_path)
                if os.path.exists(file_path):
                    continue

                with open(file_path, 'wb') as handle:
                    headers = {
                        'user-agent': "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0",
                        'cookie': "user_trace_token=20170502200739-07d687303c1e44fa9c7f0259097266d6;"
                    }
                    response = requests.get(image_url, stream=True, headers=headers)
                    for block in response.iter_content(1024):
                        if not block:
                            break
                        handle.write(block)
        return item

最后在taobao目录下，新建一个main.py文件，用于启动这个爬虫（crawl）：

# -*- coding: utf-8 -*-
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
process = CrawlerProcess(get_project_settings())
# 'followall' is the name of one of the spiders of the project.
process.crawl("taobaoSpider")
process.start()  # the script will block here until the crawling is finished

项目的目录如今是这样的：

点击pycharm右上角的eidt configurations：选择main文件：而后点击运行程序，则能够看到爬取的图片存到硬盘的D:\download\taobaoSpider目录了。