Scrapy爬虫 -- 01

时间 2019-11-13

标签 scrapy 爬虫栏目 Python 繁體版

原文原文链接

Scrapy，Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结取结构化的数据。python

--from wikiweb

说白了就是基于python的爬虫框架。mongodb

安装：shell

ubuntu 14.04ubuntu
python2.7（python3不支持，不是做者懒，是scrapy的框架依赖twisted尚未彻底迁移到python3）框架
pip
python2.7

sudo pip2 install scrapy

注意：虽然pip3也能装上scrapy，可是缺乏支持库，没法使用。。。乖乖python2吧
scrapy

使用：ide

一、新建工程test
url

scrapy startproject tutoria

这样就会建立这样一个目录结构：

tutorial/
    scrapy.cfg
    tutorial/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ..

官网的解释以下：

scrapy.cfg: the project configuration file（项目配置文件）
tutorial/: the project’s python module, you’ll later import your code from here.（项目中的定制部分，我不知道怎么翻译好）
tutorial/items.py: the project’s items file.（项目的items文件，其实就是要抓取的数据的结构定义）
tutorial/pipelines.py: the project’s pipelines file.（项目的pipelines文件，在这里能够定义将抓取的数据导出方式，pip中有scrapy-mongodb的pipelines，能够将抓取的数据直接导出到pipeline之中。）
tutorial/settings.py: the project’s settings file.（项目的配置文件）
tutorial/spiders/: a directory where you’ll later put your spiders.（存放爬虫的目录，通常用来将网页爬下来）

待续。。。