Scrapy是一款很是成熟的爬虫框架,能够抓取网页数据并抽取结构化数据,目前已经有不少企业用于生产环境。对于它的更多介绍,能够查阅相关资料(官方网站:www.scrapy.org)。php
咱们根据官网提供的安装指南,来一步步安装,主要参考了http://doc.scrapy.org/en/latest/intro/install.html页面的介绍:html
- Requirements¶
- Python 2.5, 2.6, 2.7 (3.x is not yet supported)
- Twisted 2.5.0, 8.0 or above (Windows users: you’ll need to install Zope.Interface and maybe pywin32 because of this Twisted bug)
- w3lib
- lxml or libxml2 (if using libxml2, version 2.6.28 or above is highly recommended)
- simplejson (not required if using Python 2.6 or above)
- pyopenssl (for HTTPS support. Optional, but highly recommended)
下面记录一下从安装Python到安装scrapy的过程,最后,经过执行命令进行抓取数据来验证咱们所作的安装配置工做。
准备工做
操做系统:RHEL 5
Python版本:Python-2.7.2
zope.interface版本:zope.interface-3.8.0
Twisted版本:Twisted-11.1.0
libxml2版本:libxml2-2.7.4.tar.gz
w3lib版本:w3lib-1.0
Scrapy版本:Scrapy-0.14.0.2841python
安装配置
一、安装zlib
首先检查一下你的系统中是否已经安装zlib,该库是一个与数据压缩相关的工具包,scrapy框架依赖于该工具包。我使用的RHEL 5系统,查看是否安装:linux
- [root@localhost scrapy]# rpm -qa zlib
- zlib-1.2.3-3
个人系统已经默认安装了,安装的话,能够跳过该步骤。若是没有安装的话,能够到
http://www.zlib.net/
上下载,并进行安装。假以下载的是zlib-1.2.5.tar.gz,安装命令以下所示:
- [root@localhost scrapy]# tar -xvzf zlib-1.2.5.tar.gz
- [root@localhost zlib-1.2.5]# cd zlib-1.2.5
- [root@localhost zlib-1.2.5]# make
- [root@localhost zlib-1.2.5]# make install
二、安装Python
个人系统中已经安装的Python 2.4,根据官网要求和建议,我选择了Python-2.7.2,下载地址以下所示:shell
http://www.python.org/download/(须要代理)
http://www.python.org/ftp/python/2.7.2/Python-2.7.2.tgzjson
我下载了Python的源代码,从新编译后,进行安装,过程以下所示:api
- [root@localhost scrapy]# tar -zvxf Python-2.7.2.tgz
- [root@localhost scrapy]# cd Python-2.7.2
- [root@localhost Python-2.7.2]# ./configure
- [root@localhost Python-2.7.2]# make
- [root@localhost Python-2.7.2]# make install
默认状况下,Python程序被安装到/usr/local/lib/python2.7。app
若是你的系统中没有安装过Python,此时经过命令行执行一下:框架
- [root@localhost scrapy]# python
- Python 2.7.2 (default, Dec 5 2011, 22:04:07)
- [GCC 4.1.1 20070105 (Red Hat 4.1.1-52)] on linux2
- Type "help", "copyright", "credits" or "license" for more information.
- >>>
表示最新安装的Python已经可使用了。
若是你的系统中还有其余版本的Python,例如个人系统中2.4版本的,因此要作一个符号连接:python2.7
- [root@localhost python2.7]# mv /usr/bin/python /usr/bin/python.bak
- [root@localhost python2.7]# ln -s /usr/local/bin/python /usr/bin/python
这样操做之后,在执行python,就生效了。
三、安装setuptools
这里主要是安装一个用来管理Python模块的工具,若是已经安装就跳过该步骤。若是你须要安装,能够参考下面的连接:
http://pypi.python.org/pypi/setuptools/0.6c11#installation-instructions
http://pypi.python.org/packages/2.7/s/setuptools/setuptools-0.6c11-py2.7.egg#md5=fe1f997bc722265116870bc7919059ea
不过,在安装Python-2.7.2之后,能够看到Python的解压缩包里面有一个setup.py脚本,使用这个脚本能够安装Python一些相关的模块,执行命令:
- [root@localhost Python-2.7.2]# python setup.py install
安装执行后,相关Python模块被安装到目录/usr/local/lib/python2.7/site-packages下。
四、安装zope.interface
下载地址以下所示:
http://pypi.python.org/pypi/zope.interface/3.8.0
http://pypi.python.org/packages/source/z/zope.interface/zope.interface-3.8.0.tar.gz#md5=8ab837320b4532774c9c89f030d2a389
安装过程以下所示:
- [root@localhost scrapy]$ tar -xvzf zope.interface-3.8.0.tar.gz
- [root@localhost scrapy]$ cd zope.interface-3.8.0
- [root@localhost zope.interface-3.8.0]$ python setup.py build
- [root@localhost zope.interface-3.8.0]$ python setup.py install
安装完成后,能够在/usr/local/lib/python2.7/site-packages下面看到zope和zope.interface-3.8.0-py2.7.egg-info。
五、安装Twisted
下载地址以下所示:
http://twistedmatrix.com/trac/
http://pypi.python.org/packages/source/T/Twisted/Twisted-11.1.0.tar.bz2#md5=972f3497e6e19318c741bf2900ffe31c
安装过程以下所示:
- [root@localhost scrapy]# bzip2 -d Twisted-11.1.0.tar.bz2
- [root@localhost scrapy]# tar -xvf Twisted-11.1.0.tar
- [root@localhost scrapy]# cd Twisted-11.1.0
- [root@localhost Twisted-11.1.0]# python setup.py install
安装完成后,能够在/usr/local/lib/python2.7/site-packages下面看到twisted和Twisted-11.1.0-py2.7.egg-info。
六、安装w3lib
下载地址以下所示:
http://pypi.python.org/pypi/w3lib
http://pypi.python.org/packages/source/w/w3lib/w3lib-1.0.tar.gz#md5=f28aeb882f27a616e0fc43d01f4dcb21
安装过程以下所示:
- [root@localhost scrapy]# tar -xvzf w3lib-1.0.tar.gz
- [root@localhost scrapy]# cd w3lib-1.0
- [root@localhost w3lib-1.0]# python setup.py install
安装完成后,能够在/usr/local/lib/python2.7/site-packages下面看到w3lib和w3lib-1.0-py2.7.egg-info。
七、安装libxml2
下载地址以下所示:
http://download.chinaunix.net/download.php?id=28497&ResourceID=6095
http://download.chinaunix.net/down.php?id=28497&ResourceID=6095&site=1
或者,能够到网站http://xmlsoft.org上面找到相应版本的压缩包。
安装过程以下所示:
- [root@localhost scrapy]# tar -xvzf libxml2-2.7.4.tar.gz
- [root@localhost scrapy]# cd libxml2-2.7.4
- [root@localhost libxml2-2.7.4]# ./configure
- [root@localhost libxml2-2.7.4]# make
- [root@localhost libxml2-2.7.4]# make install
八、安装pyOpenSSL
该步骤可选,对应的安装包下载地址为:
https://launchpad.net/pyopenssl
若是须要的话,能够选择须要的版本。我这里直接跳过该步骤。
九、安装Scrapy
下载地址以下所示:
http://scrapy.org/download/
http://pypi.python.org/pypi/Scrapy
http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.0.2841.tar.gz#md5=fe63c5606ca4c0772d937b51869be200
安装过程以下所示:
- [root@localhost scrapy]# tar -xvzf Scrapy-0.14.0.2841.tar.gz
- [root@localhost scrapy]# cd Scrapy-0.14.0.2841
- [root@localhost Scrapy-0.14.0.2841]# python setup.py install
安装验证
通过上面的安装和配置过程,已经完成了Scrapy的安装,咱们能够经过以下命令行来验证一下:
- [root@localhost scrapy]# scrapy
- Scrapy 0.14.0.2841 - no active project
-
- Usage:
- scrapy <command> [options] [args]
-
- Available commands:
- fetch Fetch a URL using the Scrapy downloader
- runspider Run a self-contained spider (without creating a project)
- settings Get settings values
- shell Interactive scraping console
- startproject Create new project
- version Print Scrapy version
- view Open URL in browser, as seen by Scrapy
-
- Use "scrapy <command> -h" to see more info about a command
上面提示信息,提供了一个fetch命令,这个命令抓取指定的网页,能够先看看fetch命令的帮助信息,以下所示:
- [root@localhost scrapy]# scrapy fetch --help
- Usage
- =====
- scrapy fetch [options] <url>
-
- Fetch a URL using the Scrapy downloader and print its content to stdout. You
- may want to use --nolog to disable logging
-
- Options
- =======
- --help, -h show this help message and exit
- --spider=SPIDER use this spider
- --headers print response HTTP headers instead of body
-
- Global Options
- --------------
- --logfile=FILE log file. if omitted stderr will be used
- --loglevel=LEVEL, -L LEVEL
- log level (default: DEBUG)
- --nolog disable logging completely
- --profile=FILE write python cProfile stats to FILE
- --lsprof=FILE write lsprof profiling stats to FILE
- --pidfile=FILE write process ID to FILE
- --set=NAME=VALUE, -s NAME=VALUE
- set/override setting (may be repeated)
根据命令提示,指定一个URL,执行后抓取一个网页的数据,以下所示:
- [root@localhost scrapy]# scrapy fetch http://doc.scrapy.org/en/latest/intro/install.html > install.html
- 2011-12-05 23:40:04+0800 [scrapy] INFO: Scrapy 0.14.0.2841 started (bot: scrapybot)
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
- 2011-12-05 23:40:04+0800 [scrapy] DEBUG: Enabled item pipelines:
- 2011-12-05 23:40:05+0800 [default] INFO: Spider opened
- 2011-12-05 23:40:05+0800 [default] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
- 2011-12-05 23:40:05+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
- 2011-12-05 23:40:07+0800 [default] DEBUG: Crawled (200) <GET http://doc.scrapy.org/en/latest/intro/install.html> (referer: None)
- 2011-12-05 23:40:07+0800 [default] INFO: Closing spider (finished)
- 2011-12-05 23:40:07+0800 [default] INFO: Dumping spider stats:
- {'downloader/request_bytes': 227,
- 'downloader/request_count': 1,
- 'downloader/request_method_count/GET': 1,
- 'downloader/response_bytes': 22676,
- 'downloader/response_count': 1,
- 'downloader/response_status_count/200': 1,
- 'finish_reason': 'finished',
- 'finish_time': datetime.datetime(2011, 12, 5, 15, 40, 7, 918833),
- 'scheduler/memory_enqueued': 1,
- 'start_time': datetime.datetime(2011, 12, 5, 15, 40, 5, 5749)}
- 2011-12-05 23:40:07+0800 [default] INFO: Spider closed (finished)
- 2011-12-05 23:40:07+0800 [scrapy] INFO: Dumping global stats:
- {'memusage/max': 17711104, 'memusage/startup': 17711104}
- [root@localhost scrapy]# ll install.html
- -rw-r--r-- 1 root root 22404 Dec 5 23:40 install.html
- [root@localhost scrapy]#
可见,咱们已经成功抓取了一个网页。
接下来,能够根据scrapy官网的指南来进一步应用scrapy框架,Tutorial连接页面为http://doc.scrapy.org/en/latest/intro/tutorial.html。