使用scrapy框架爬取本身的博文

时间 2019-11-09

标签使用 scrapy 框架本身博文栏目 Python 繁體版

原文原文链接

　　scrapy框架是个比较简单易用基于python的爬虫框架，http://scrapy-chs.readthedocs.org/zh_CN/latest/ 这个是不错的中文文档html

　　几个比较重要的部分：python

　　items.py：用来定义须要保存的变量，其中的变量用Field来定义，有点像python的字典app

　　pipelines.py：用来将提取出来的Item进行处理，处理过程按本身须要进行定义框架

　　spiders：定义本身的爬虫dom

　　爬虫的类型也有好几种：scrapy

　　1）spider:最基本的爬虫，其余的爬虫通常是继承了该最基本的爬虫类，提供访问url，返回response的功能，会默认调用parse方法ide

　　2）CrawlSpider：继承spider的爬虫，实际使用比较多，设定rule规则进行网页的跟进与处理，注意点：编写爬虫的规则的时候避免使用parse名，由于这会覆盖继承的spider的的方法parse形成错误。其中比较重要的是对Rule的规则的编写，要对具体的网页的状况进行分析。加密

　　3）XMLFeedSpider 与 CSVFeedSpider url

　　实际操做：spa

items.py下的：

from scrapy.item import Item, Field


class Website(Item):

    headTitle = Field()
    description = Field()
    url = Field()

spider.py下的：

# -*- coding:gb2312 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
import sys
import string
sys.stdout=open('output.txt','w') #将打印信息输出在相应的位置下


add = 0
class DmozSpider(CrawlSpider):

    name = "huhu"
    allowed_domains = ["cnblogs.com"]
    start_urls = [
        "http://www.cnblogs.com/huhuuu",
    ]

    
    rules = (
        # 提取匹配 huhuuu/default.html\?page\=([\w]+) 的连接并跟进连接(没有callback意味着follow默认为True)
        Rule(SgmlLinkExtractor(allow=('huhuuu/default.html\?page\=([\w]+)', ),)),

        # 提取匹配 'huhuuu/p/' 的连接并使用spider的parse_item方法进行分析
        Rule(SgmlLinkExtractor(allow=('huhuuu/p/', )), callback='parse_item'),
    )

    def parse_item(self, response):
        global add #用于统计博文的数量
        
        print  add
        add+=1
        
        sel = Selector(response)
        items = []

        item = Website()
        item['headTitle'] = sel.xpath('/html/head/title/text()').extract()#观察网页对应得html源码
        item['url'] = response
        print item
        items.append(item)
        return items

最后在相应的目录文件下运行scrapy crawl huhu

结果：

可是个人博文好歹有400篇左右，最后只搜出了100篇，这是什么状况

查了一些搜出来的网页地址，不少都是2013.10 到最近更新的博文状况，没道理啊，最后注意了老的博文的网址，原来老的博文地址的结构更新的博文地址的结构不一样

如今的：http://www.cnblogs.com/huhuuu/p/3384978.html

老的：http://www.cnblogs.com/huhuuu/archive/2012/04/10/2441060.html

而后在rule里面加入老网页的规则，就能够把博客中没加密的博文都搜出来了

# -*- coding:gb2312 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
import sys
import string
sys.stdout=open('output.txt','w') #将打印信息输出在相应的位置下


add = 0
class DmozSpider(CrawlSpider):

    name = "huhu"
    allowed_domains = ["cnblogs.com"]
    start_urls = [
        "http://www.cnblogs.com/huhuuu",
    ]

    
    rules = (
        # 提取匹配 huhuuu/default.html\?page\=([\w]+) 的连接并跟进连接(没有callback意味着follow默认为True)
        Rule(SgmlLinkExtractor(allow=('huhuuu/default.html\?page\=([\w]+)', ),)),

        # 提取匹配 'huhuuu/p/' 的连接并使用spider的parse_item方法进行分析
        Rule(SgmlLinkExtractor(allow=('huhuuu/p/', )), callback='parse_item'),
        Rule(SgmlLinkExtractor(allow=('huhuuu/archive/', )), callback='parse_item'), #之前的一些博客是archive形式的因此
    )

    def parse_item(self, response):
        global add #用于统计博文的数量
        
        print  add
        add+=1
        
        sel = Selector(response)
        items = []

        item = Website()
        item['headTitle'] = sel.xpath('/html/head/title/text()').extract()#观察网页对应得html源码
        item['url'] = response
        print item
        items.append(item)
        return items

又作了一个爬取博客园首页博客的代码，其实只要修改Rule便可：

# -*- coding:gb2312 -*-
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from dirbot.items import Website
import sys
import string
sys.stdout=open('output.txt','w') #将打印信息输出在相应的位置下


add = 0
class DmozSpider(CrawlSpider):

    name = "huhu"
    allowed_domains = ["cnblogs.com"]
    start_urls = [
        "http://www.cnblogs.com/",
    ]

    
    rules = (
       
        Rule(SgmlLinkExtractor(allow=('sitehome/p/[0-9]+', ),)),


        Rule(SgmlLinkExtractor(allow=('[^\s]+/p/', )), callback='parse_item'),
   
    )

    def parse_item(self, response):
        global add #用于统计博文的数量
        
        print  add
        add+=1
        
        sel = Selector(response)
        items = []

        item = Website()
        item['headTitle'] = sel.xpath('/html/head/title/text()').extract()#观察网页对应得html源码
        item['url'] = response
        print item
        items.append(item)
        return items

View Code

参考：http://scrapy-chs.readthedocs.org/zh_CN/latest/topics/spiders.html