Scrapy框架——命令行工具

时间 2019-11-11

标签 scrapy 框架命令行工具栏目 Python 繁體版

原文原文链接

Scrapy经常使用命令： html

全局命令，不须要建项目也可执行：startproject settings runspider shell fetch view version
python

项目命令：crawl cheak list edit parse genspider deploy bench
chrome

下面一次介绍各命令功能：
shell

‍1.‍ startproject：新建爬虫项目浏览器

语法：app

scrapy startproject <项目名>

2.genspider：在项目中新建spiderdom

语法：scrapy

scrapy genspider [-t 模板] <爬虫名> <域名>

模板有四种 basic crawl csvfeed xmlfeed, 可用-d来预览生成的模板ide

D:\crawler\lagou\spider>scrapy genspider -d basic
# -*- coding: utf-8 -*-
import scrapy


class $classname(scrapy.Spider):
    name = "$name"
    allowed_domains = ["$domain"]
    start_urls = (
        'http://www.$domain/',
    )

    def parse(self, response):
        pass

3.crawl: 运行爬虫
测试

语法:

scrapy crawl <爬虫名>

4.cheak：检查

语法：

scrapy check [-l] <爬虫名>

例如

D:\crawler\lagou\spider>scrapy check lagou

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

5.fetch：获取指定内容

语法：

scrapy fetch <url>

使用Scrapy下载器(downloader)下载给定的URL，并将获取到的内容送到标准输出

例如要查看百度的headers:

D:\crawler\lagou\spider>scrapy fetch --nolog --headers http://www.baidu.com
> Accept-Language: en
> Accept-Encoding: gzip,deflate
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> User-Agent: Scrapy/1.0.3 (+http://scrapy.org)
>
< Bdqid: 0x9eec8e8400034b3c
< Bduserid: 0
< Set-Cookie: BAIDUID=20B9DAB5F75E14AFB3447C6857ACFDA3:FG=1; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
< Set-Cookie: BIDUPSID=20B9DAB5F75E14AFB3447C6857ACFDA3; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
< Set-Cookie: PSTM=1452846490; expires=Thu, 31-Dec-37 23:55:55 GMT; max-age=2147483647; path=/; domain=.baidu.com
< Set-Cookie: BDSVRTM=0; path=/
< Set-Cookie: BD_HOME=0; path=/
< Set-Cookie: H_PS_PSSID=18439_18720_1431_18878_12825_17565_18965_18768_18971_18778_18780_17000_18782_17072_15098_12356_18018_10634; path=/; domain=.baidu.com
< Expires: Fri, 15 Jan 2016 08:27:26 GMT
< Vary: Accept-Encoding
< X-Powered-By: HPHP
< Server: BWS/1.1
< Cxy_All: baidu+12d68e0b8747863f4dbde8d1321e60b9
< Cache-Control: private
< Date: Fri, 15 Jan 2016 08:28:10 GMT
< P3P: CP=" OTI DSP COR IVA OUR IND COM "
< Content-Type: text/html; charset=utf-8
< Bdpagetype: 1
< X-Ua-Compatible: IE=Edge,chrome=1

6.view: 在浏览器中将url以Scrapy能获取到的形式展示

语法：

scrapy view <url>

因为有些网页嵌有JS等脚本，scrapy能获取到的和用户在浏览器中看到的并不同，所以能够用此方法来检查spider获取到的页面，已确认这是您所指望的。

7. shell ： scrapy测试终端

语法：

scrapy shell [url]

将以给定的url启动scrapy,并能够在此界面进行xpath测试等操做

8. runspider ：在不建立项目的状况下，运行一个spider

语法：

scrapy runspider <spider_file>

9.deploy: 将项目部署到Scrapyd服务

语法：

scrapy deploy [ <target:project> | -l <target> | -L ]

具体查看官方文档：http://scrapyd.readthedocs.org/en/latest/deploy.html

转载请注明开源中国：http://my.oschina.net/u/2463131/blog/603333