RHEL 5下安装Scrapy-0.14.0.2841爬虫框架

时间 2019-11-05

标签 rhel 安装 scrapy 0.14.0.2841 爬虫框架栏目 Red Hat 繁體版

原文原文链接

Scrapy是一款很是成熟的爬虫框架，能够抓取网页数据并抽取结构化数据，目前已经有不少企业用于生产环境。对于它的更多介绍，能够查阅相关资料（官方网站：www.scrapy.org）。php

咱们根据官网提供的安装指南，来一步步安装，主要参考了http://doc.scrapy.org/en/latest/intro/install.html页面的介绍：html

[html] view plain copy

Requirements¶
Python 2.5, 2.6, 2.7 (3.x is not yet supported)
Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
w3lib
lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
simplejson (not required if using Python 2.6 or above)
pyopenssl (for HTTPS support. Optional, but highly recommended)

下面记录一下从安装Python到安装scrapy的过程，最后，经过执行命令进行抓取数据来验证咱们所作的安装配置工做。

准备工做

操做系统：RHEL 5
Python版本：Python-2.7.2
zope.interface版本：zope.interface-3.8.0
Twisted版本：Twisted-11.1.0
libxml2版本：libxml2-2.7.4.tar.gz
w3lib版本：w3lib-1.0
Scrapy版本：Scrapy-0.14.0.2841python

安装配置

一、安装zlib

首先检查一下你的系统中是否已经安装zlib，该库是一个与数据压缩相关的工具包，scrapy框架依赖于该工具包。我使用的RHEL 5系统，查看是否安装：linux

[python] view plain copy

[root@localhost scrapy]# rpm -qa zlib
zlib-1.2.3-3

个人系统已经默认安装了，安装的话，能够跳过该步骤。若是没有安装的话，能够到 http://www.zlib.net/ 上下载，并进行安装。假以下载的是zlib-1.2.5.tar.gz，安装命令以下所示：

[plain] view plain copy

[root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz
[root@localhost zlib-1.2.5]# cd zlib-1.2.5
[root@localhost zlib-1.2.5]# make
[root@localhost zlib-1.2.5]# make install

二、安装Python

个人系统中已经安装的Python 2.4，根据官网要求和建议，我选择了Python-2.7.2，下载地址以下所示：shell

http://www.python.org/download/（须要代理）
http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgzjson

我下载了Python的源代码，从新编译后，进行安装，过程以下所示：api

[plain] view plain copy

[root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz
[root@localhost scrapy]# cd Python-2.7.2
[root@localhost Python-2.7.2]# ./configure
[root@localhost Python-2.7.2]# make
[root@localhost Python-2.7.2]# make install

默认状况下，Python程序被安装到/usr/local/lib/python2.7。app

若是你的系统中没有安装过Python，此时经过命令行执行一下：框架

[plain] view plain copy

[root@localhost scrapy]# python
Python 2.7.2 (default, Dec 5 2011, 22:04:07)
[GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

表示最新安装的Python已经可使用了。

若是你的系统中还有其余版本的Python，例如个人系统中2.4版本的，因此要作一个符号连接：python2.7

[plain] view plain copy

[root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak
[root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python

这样操做之后，在执行python，就生效了。

三、安装setuptools

这里主要是安装一个用来管理Python模块的工具，若是已经安装就跳过该步骤。若是你须要安装，能够参考下面的连接：

http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea

不过，在安装Python-2.7.2之后，能够看到Python的解压缩包里面有一个setup.py脚本，使用这个脚本能够安装Python一些相关的模块，执行命令：

[plain] view plain copy

[root@localhost Python-2.7.2]# python setup.py install

安装执行后，相关Python模块被安装到目录/usr/local/lib/python2.7/site-packages下。

四、安装zope.interface

下载地址以下所示：

http://pypi.python.org/pypi/zope.interface/3.8.0
http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389

安装过程以下所示：

[plain] view plain copy

[root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz
[root@localhost scrapy]$ cd zope.interface-3.8.0
[root@localhost zope.interface-3.8.0]$ python setup.py build
[root@localhost zope.interface-3.8.0]$ python setup.py install

安装完成后，能够在/usr/local/lib/python2.7/site-packages下面看到zope和zope.interface-3.8.0-py2.7.egg-info。

五、安装Twisted

下载地址以下所示：

http://twistedmatrix.com/trac/
http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c

安装过程以下所示：

[plain] view plain copy

[root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2
[root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar
[root@localhost scrapy]# cd Twisted-11.1.0
[root@localhost Twisted-11.1.0]# python setup.py install

安装完成后，能够在/usr/local/lib/python2.7/site-packages下面看到twisted和Twisted-11.1.0-py2.7.egg-info。

六、安装w3lib

下载地址以下所示：

http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21

安装过程以下所示：

[plain] view plain copy

[root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz
[root@localhost scrapy]# cd w3lib-1.0
[root@localhost w3lib-1.0]# python setup.py install

安装完成后，能够在/usr/local/lib/python2.7/site-packages下面看到w3lib和w3lib-1.0-py2.7.egg-info。

七、安装libxml2

下载地址以下所示：

http://download.chinaunix.net/download.php?id=28497&ResourceID=6095
http://download.chinaunix.net/down.php?id=28497&ResourceID=6095&site=1

或者，能够到网站http://xmlsoft.org上面找到相应版本的压缩包。

安装过程以下所示：

[plain] view plain copy

[root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz
[root@localhost scrapy]# cd libxml2-2.7.4
[root@localhost libxml2-2.7.4]# ./configure
[root@localhost libxml2-2.7.4]# make
[root@localhost libxml2-2.7.4]# make install

八、安装pyOpenSSL

该步骤可选，对应的安装包下载地址为：

https://launchpad.net/pyopenssl

若是须要的话，能够选择须要的版本。我这里直接跳过该步骤。

九、安装Scrapy

下载地址以下所示：

http://scrapy.org/download/
http://pypi.python.org/pypi/Scrapy
http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200

安装过程以下所示：

[plain] view plain copy

[root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz
[root@localhost scrapy]# cd Scrapy-0.14.0.2841
[root@localhost Scrapy-0.14.0.2841]# python setup.py install

安装验证

通过上面的安装和配置过程，已经完成了Scrapy的安装，咱们能够经过以下命令行来验证一下：

[plain] view plain copy

[root@localhost scrapy]# scrapy
Scrapy 0.14.0.2841 - no active project
Usage:
scrapy <command> [options] [args]
Available commands:
fetch Fetch a URL using the Scrapy downloader
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command

上面提示信息，提供了一个fetch命令，这个命令抓取指定的网页，能够先看看fetch命令的帮助信息，以下所示：

[plain] view plain copy

[root@localhost scrapy]# scrapy fetch --help
Usage
=====
scrapy fetch [options] <url>
Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging
Options
=======
--help, -h show this help message and exit
--spider=SPIDER use this spider
--headers print response HTTP headers instead of body
Global Options
--------------
--logfile=FILE log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
log level (default: DEBUG)
--nolog disable logging completely
--profile=FILE write python cProfile stats to FILE
--lsprof=FILE write lsprof profiling stats to FILE
--pidfile=FILE write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
set/override setting (may be repeated)

根据命令提示，指定一个URL，执行后抓取一个网页的数据，以下所示：

[plain] view plain copy

[root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html > install.html
2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines:
2011-12-05 23:40:05+0800 [default] INFO: Spider opened
2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)
2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:
{'downloader/request_bytes': 227,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 22676,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),
'scheduler/memory_enqueued': 1,
'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}
2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)
2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:
{'memusage/max': 17711104, 'memusage/startup': 17711104}
[root@localhost scrapy]# ll install.html
-rw-r--r-- 1 root root 22404 Dec 5 23:40 install.html
[root@localhost scrapy]#

可见，咱们已经成功抓取了一个网页。

接下来，能够根据scrapy官网的指南来进一步应用scrapy框架，Tutorial连接页面为http://doc.scrapy.org/en/latest/intro/tutorial.html。