安装 html
yum install libxslt-devel libffi-devel dom
pip install Scrapy scrapy
建立项目 ide
scrapy startproject tutorial(工程名) url
定义item(至关于数据表中的一条数据) .net
vi tutorial/items.py htm
class myItem(scrapy.Item): blog
title = scrapy.Field()//至关于数据表的字段 ip
link = scrapy.Field() 文档
desc = scrapy.Field()
编写爬虫
import scrapy class DmozSpider(scrapy.spiders.Spider)://有几种抓取方式的父类
name = "dmoz"//必须定义的
allowed_domains = ["dmoz.org"]//可选属性
start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ]//必须定义
def parse(self, response)://解析网页
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
爬取数据
scrapy crawl dmoz
官方中文文档http://scrapy-chs.readthedocs.org/zh_CN/0.24/ 注:不是最新的~
参考:
http://www.cnblogs.com/rwxwsblog/p/4572367.html
http://blog.csdn.net/HanTangSongMing/article/details/24454453