基于python的scrapy框架爬取豆瓣电影及其可视化

时间 2020-05-26

标签基于 python scrapy 框架豆瓣及其可视化栏目 Python 繁體版

原文原文链接

1.Scrapy框架介绍node

主要介绍，spiders，engine，scheduler,downloader,Item pipelinepython

scrapy常见命令以下：chrome

对应在scrapy文件中有，本身增长爬虫文件，系统生成items,pipelines,setting的配置文件就这些。数据库

items写须要爬取的属性名，pipelines写一些数据流操做，写入文件，仍是导入数据库中。主要爬虫文件写domain，属性名的xpath，在每页添加属性对应的信息等。json

movieRank = scrapy.Field() movieName = scrapy.Field() Director = scrapy.Field() movieDesc = scrapy.Field() movieRate = scrapy.Field() peopleCount = scrapy.Field() movieDate = scrapy.Field() movieCountry = scrapy.Field() movieCategory = scrapy.Field() moviePost = scrapy.Field()

import json class DoubanPipeline(object): def __init__(self): self.f = open("douban.json","w",encoding='utf-8') def process_item(self, item, spider): content = json.dumps(dict(item),ensure_ascii = False)+"\n" self.f.write(content) return item def close_spider(self,spider): self.f.close()

这里xpath使用过程当中，安利一个chrome插件xpathHelper。框架

allowed_domains = ['douban.com'] baseURL = "https://movie.douban.com/top250?start=" offset = 0 start_urls = [baseURL + str(offset)] def parse(self, response): node_list = response.xpath("//div[@class='item']") for node in node_list: item = DoubanItem() item['movieName'] = node.xpath("./div[@class='info']/div[1]/a/span/text()").extract()[0] item['movieRank'] = node.xpath("./div[@class='pic']/em/text()").extract()[0] item['Director'] = node.xpath("./div[@class='info']/div[@class='bd']/p[1]/text()[1]").extract()[0] if len(node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()")): item['movieDesc'] = node.xpath("./div[@class='info']/div[@class='bd']/p[@class='quote']/span[@class='inq']/text()").extract()[0] else: item['movieDesc'] = "" item['movieRate'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[@class='rating_num']/text()").extract()[0] item['peopleCount'] = node.xpath("./div[@class='info']/div[@class='bd']/div[@class='star']/span[4]/text()").extract()[0] item['movieDate'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[0] item['movieCountry'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[1] item['movieCategory'] = node.xpath("./div[2]/div[2]/p[1]/text()[2]").extract()[0].lstrip().split('\xa0/\xa0')[2] item['moviePost'] = node.xpath("./div[@class='pic']/a/img/@src").extract()[0] yield item if self.offset <250: self.offset += 25 url = self.baseURL+str(self.offset) yield scrapy.Request(url,callback = self.parse)

这里基本能够爬虫，产生须要的json文件。dom

接下来是可视化过程。scrapy

咱们先梳理一下，咱们掌握的数据状况。ide

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') douban.info()

基本咱们能够分析，电影国家产地，电影拍摄年份，电影类别以及一些导演在TOP250中影响力。函数

先作个简单了解，可使用value_counts()函数。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') df_Country = douban['movieCountry'].copy() for i in range(len(df_Country)): item = df_Country.iloc[i].strip() df_Country.iloc[i] = item[0] print(df_Country.value_counts())

美国电影占半壁江山，122/250，能够反映好莱坞电影工业之强大。一样，日本电影和香港电影在中国也有着重要地位。使人意外是，中国大陆地区电影数量不是使人满意。豆瓣影迷对于国内电影仍是很是挑剔的。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') df_Date = douban['movieDate'].copy() for i in range(len(df_Date)): item = df_Date.iloc[i].strip() df_Date.iloc[i] = item[2] print(df_Date.value_counts())

2000年以来电影数目在70%以上，考虑10代才过去9年和打分滞后性，整体来讲越新的电影越能获得受众喜好。这可能和豆瓣top250选取机制有关，必须人数在必定数量以上。

douban = pd.read_json('douban.json',lines=True,encoding='utf-8') df_Cate = douban['movieCategory'].copy() for i in range(len(df_Cate)): item = df_Cate.iloc[i].strip() df_Cate.iloc[i] = item[0] print(df_Cate.value_counts())

剧情电影情节起伏更容易获得观众承认。

下面展现几张可视化图片

不太会用python进行展现，有些难看。其实，推荐用Echarts等插件，或者用Excel，BI软件来处理图片，比较方便和美观。

第一次作这种爬虫和可视化，多有不足之处，恳请指出。