偶然在网上看到Scrapy这个网站抓取工具,心痒痒想试一下,这一试又是一晚上,成果来之不易,还是记录一下,万一以后用到呢,我用Scrapy抓取了我的CSDN博客信息,以下是整个过程。
pip install Scrapy
打开命令行,定位到想要存储的目录,输入scrapy startproject article
生成的项目结构如下:
scrapy.cfg:项目的配置文件 article/:项目的Python模块,将会从这里引用代码 article/items.py:项目的items文件 article/pipelines.py:项目的pipelines文件 article/settings.py:项目的设置文件 article/spiders/:存储爬虫的目录
然后进入项目目录,输入下面命令创建爬虫文件,当然也也可以直接创建文件,如何你对scrapy熟悉的话。
cd article
scrapy genspider csdn csdn.net
命令完成后会在article/spiders目录下多两个文件:
__init__.py
csdn.py
csdn.py的内容如下:
import scrapy class CsdnSpider(scrapy.Spider): name = 'csdn' allowed_domains = ['csdn.net'] start_urls = ['http://csdn.net/'] def parse(self, response): pass
这段代码定义了三个变量,一个函数,它们的名称均不能改变
根据要爬取的内容定义items,在articel/items.py中添加:
import scrapy class ArticleItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() name = scrapy.Field() date = scrapy.Field() comment = scrapy.Field() readers=scrapy.Field() link=scrapy.Field()
我定义了自己要爬取的内容,博客名,博客链接,创建日期,评论数,阅读人数。
在article/spiders/csnd.py文件中补充:
# -*- coding: utf-8 -*- import scrapy from article.items import * from article.pipelines import * import re,sys class CsdnSpider(scrapy.Spider): name = 'csdn' allowed_domains = ['csdn.net'] start_urls = ['https://blog.csdn.net/xtfge0915'] def parse(self, response): page_index=2 # rst=self.callback(response) if len(response.xpath('//div[contains(@class,"article-item-box")]'))==0: sys.exit() for arc in response.xpath('//div[contains(@class,"article-item-box")]'): item=ArticleItem() item['date']=[arc.xpath('./div[contains(@class,"info-box")]/p[1]/span/text()').extract()[0]] item['readers']=arc.xpath('./div[contains(@class,"info-box")]/p[2]/span/text()').extract()[0] item['comment']=arc.xpath('./div[contains(@class,"info-box")]/p[3]/span/text()').extract()[0] item['link']=[arc.xpath('./h4/a/@href').extract()[0]] item['name']=arc.xpath('./h4/a').extract()[0] item['name']=[re.findall('.*</span>\s*(.*)\s*</a>.*',item['name'])[0]] item['readers']=[item['readers'].split(':')[1]] item['comment']=[item['comment'].split(':')[1]] yield item page_index+=1 while (True): yield scrapy.Request(self.start_urls[0]+"/article/list/%d" % page_index,callback=self.parse) # yield item # self.sa.save_process_as_excel('E:\\articels.xlsx')
xpath的语法可以参考https://blog.csdn.net/xtfge0915/article/details/83840786
即peipelines.py文件
# -*- coding: utf-8 -*- import json import pandas as pd class ArticlePipeline(object): def open_spider(self,spider): self.filename="E:\\list.xlsx" self.articles={} def process_item(self, item,spider): for key in item.keys(): if key in self.articles.keys(): self.articles[key].append(item[key][0]) else: self.articles[key]=item[key] return item def close_spider(self, spider): self.save_to_excel() def save_to_excel(self): df=pd.DataFrame(self.articles) # text =str(json.dumps(self.articles, ensure_ascii = False)) + ",\n" # self.filename.write(text) df.to_excel(self.filename)
我把博客列表保存到了一个Excel文件中。
打开setting.py文件,添加
ITEM_PIPELINES = { 'article.pipelines.ArticlePipeline': 300 }
最终结果