Scrapy库安装和项目建立

时间 2019-11-10

标签 scrapy 安装项目建立栏目 Python 繁體版

原文原文链接

Scrapy是一个流行的网络爬虫框架，从如今起将陆续记录Python3.6下Scrapy整个学习过程，方便后续补充和学习。
本文主要介绍scrapy安装、项目建立和测试基本命令操做

scrapy库安装

　　使用pip命令安装scrapy,在安装过程当中可能会由于缺乏依赖库而报错，根据报错提示依次下载须要的依赖库，下载过程当中注意系统类型和Python版本node

　　我在安装过程当中依次安装的库有：python

　　pip install pywin32-223-cp36-cp36m-win32.whlweb

　　pip install Twisted-17.9.0-cp36-cp36m-win32.whlapi

　　pip install scrapy网络

　　Unofficial Windows Binaries for Python Extension Packages:https://www.lfd.uci.edu/~gohlke/pythonlibs/架构

建立项目

　　scrapy安装成功后打开cmd进入想要存储scrapy项目的目录使用startproject命令建立一个新项目：框架

D:\>scrapy startproject scraptest
New Scrapy project 'scraptest', using template directory 'c:\\python36-32\\lib\\
site-packages\\scrapy\\templates\\project', created in:
    D:\scraptest

You can start your first spider with:
    cd scraptest
    scrapy genspider example example.com

在D:\scraptest\目录下会生成对应的架构目录树dom

scrapytest/
    scrapy.cfg
    scrapytest/
        __init__.py
        items.py          #定义抓取域的模型
        pipelines.py
        settings.py       #定义一些设置，如用户代理、爬取延时等
        middlewares.py
        __pycache__/
        spiders/
            __pycache__/
            __init__.py

建立爬虫

　　使用genspider命令，传入爬虫模块名、域名以及可选模块参数scrapy

D:\scraptest>scrapy genspider country example.webscraping.com
Created spider 'country' using template 'basic' in module:
  scraptest.spiders.country

D:\scraptest\scraptest\spiders目录下建立country.pyide

# -*- coding: utf-8 -*-
import scrapy

class CountrySpider(scrapy.Spider):
    name = 'country'
    allowed_domains = ['example.webscraping.com']
    start_urls = ['http://example.webscraping.com/']

    def parse(self, response):
        pass

1. name做为爬虫名，必须指定名称，根据源码内容，若值为空会提示ValueErro
2. start_urls位爬取的网页
3. parse函数名不能修改，这是源码中指定的回调函数

测试爬虫

# -*- coding: utf-8 -*-
import scrapy
from lxml import etree

class CountrySpider(scrapy.Spider):
    name = 'country'
    allowed_domains = ['example.webscraping.com']
    start_urls = ['http://example.webscraping.com/places/default/view/Afghanistan-1']

    #该函数名不能改变，由于scrapy源码中默认callback函数的函数名就是parse
    def parse(self, response):
        tree = etree.HTML(response.text)
        for node in (tree.xpath('//tr/td[@class="w2p_fw"]')):
            print (node.text)

使用crawl命令，能够根据-s LOG_LEVEL=DEBUG或-s LOG_LEVEL=ERROR来设置日志信息

D:\scraptest>scrapy crawl country --nolog
None
647,500 square kilometres
29,121,286
AF
Afghanistan
Kabul
None
.af
AFN
Afghani
93
None
None
fa-AF,ps,uz-AF,tk
None