学习任何东西若是有目的性老是更有效率,这里我目的是抓取oschina的博客列表:http://www.oschina.net/bloghtml
这种须要有url跟进的爬虫,scrapy使用CrawlSpider十分方便,CrawlSpider的文档连接CrawlSpiderpython
环境:python版本2.7.10, scrapy版本1.1.1git
首先建立项目:github
scrapy startproject blogscrapy
生成目录结构以下: json
编辑items.py,建立BlogScrapyItem存储博客信息dom
import scrapy class BlogScrapyItem(scrapy.Item): title = scrapy.Field() content = scrapy.Field() url = scrapy.Field()
在spiders目录下新建spider文件blog.py编写爬虫的逻辑scrapy
# coding=utf-8 from scrapy.linkextractors import LinkExtractor from scrapy.loader import ItemLoader from scrapy.spiders import Request, CrawlSpider, Rule from blogscrapy.items import BlogScrapyItem class WendaSpider(CrawlSpider): # 爬虫惟一标示 name = 'oschina' # 容许的domain allowed_domains = ['oschina.net'] # 种子url start_urls = [ 'http://www.oschina.net/blog', ] rules = ( # 解析博客详情url地址callback到parse_page, follow为false, 则url不会跟进爬取了 Rule(LinkExtractor(allow=('my\.oschina\.net/.+/blog/\d+$',)), callback='parse_page', follow=False,), ) # 博客详情页面解析 def parse_page(self, response): loader = ItemLoader(BlogScrapyItem(), response=response) loader.add_xpath('title', "//div[@class='title']/text()") loader.add_xpath('content', "//div[@class='BlogContent']") loader.add_value('url', response.url) return loader.load_item()
在项目目录下使用命令scrapy crawl oschina -o blogs.json便可在当前目录声称blogs.json文件,里面为获取到的BlogScrapyItem的json数据。ide
由于获得的数据格式不合理,能够经过在item的field上增长input_processor来在录入时处理数据 items.py学习
import scrapy from scrapy.loader.processors import MapCompose from w3lib.html import remove_tags def filter_title(value): return value.strip() class BlogScrapyItem(scrapy.Item): title = scrapy.Field(input_processor=MapCompose(remove_tags, filter_title)) content = scrapy.Field(input_processor=MapCompose(remove_tags, filter_title)) url = scrapy.Field(input_processor=MapCompose(remove_tags, filter_title))
github: https://github.com/chenglp1215/scrapy_demo/tree/master/blogscrapyurl