Scrapy爬取新浪微博移动版用户首页第一条微博

时间 2020-05-22

原文原文链接

你们好，本月第一次更新。html

最近找了一份关于爬虫的实习工做，须要爬取较大量的数据，这时就发现经过本身编写函数来实现爬虫效率太慢了；因而又转回来用scrapy，之前稍微学习了一下，此次恰好爬爬微博练练手，然后再使用部分数据生成词云。web

本次爬取的是新浪微博移动端（https://m.weibo.cn/），爬取的数据是用户微博首页的第一条微博（以下图），包括文字内容、转发量、评论数、点赞数和发布时间，还有用户名和其所在地区（后面能够分析不一样地区微博用户的关心的热点话题）。数据库

1、分析网页编程

获取用户微博入口url

浏览发现使用的是使用Ajax渲染的网页，微博数据（https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_5088_-_ctg1_5088&openApp=0&since_id=1）存储在json格式网页中，因此思路是先经过微博数据获得用户url（以下图），再来爬取后续内容。json

获取第一条微博数据

也是使用了Ajax渲染的网页，跟上面同样找到网页入口就好了。请求网址以下：api

这样看网址的话毫无规律可言，简化后发现 https://m.weibo.cn/api/container/getIndex?containerid=1076032554757470就能够进入。并且containerid=107603（***）这里，括号里的数字恰好是用户的id号，所以咱们能够经过这个来构造网页。app

获取用户所在地区

用户所在地在其基本资料中，以下图框架

地址为：dom

一样进行简化获得：https://m.weibo.cn/api/container/getIndex?containerid=230283（***）_-_INFO其中括号里面是用户id号。scrapy

经过以上分析可知，获取用户的 id 号是本次爬取数据的关键，只须要用 id 构成网址，后面的爬取就相对简单了。下面是编程部分。

2、编程爬取

注：转载代码请标明出处

首先经过命令行建立 scrapy 爬虫。

scrapy startproject sinaweibo scrapy genspider xxx(爬虫名) xxx(所在域名)

items.py定义爬虫字段

import scrapy class SinaweiboItem(scrapy.Item): # define the fields for your item here like:
    # name = scrapy.Field()
    name = scrapy.Field()       #用户名
    first_news = scrapy.Field()     #首条微博
    dates = scrapy.Field()     #发布时间 
    zhuanzai = scrapy.Field()       #转载数
    comment = scrapy.Field()        #评论数
    agree = scrapy.Field()      #点赞数
    city = scrapy.Field()       #所在地区

编写爬取代码

 1 # -*- coding: utf-8 -*-
 2 import scrapy  3 from sinaweibo.items import SinaweiboItem  4 import json  5 import re  6 import copy  7 
 8 
 9 class WeibodiyuSpider(scrapy.Spider): 10     name = 'weibodiyu'  #爬虫名
11     allowed_domains = ['m.weibo.cn']    #只在该域名内爬取
12     start_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_4188_-_ctg1_4188&openApp=0&since_id=1'
13  ] 14 
15     def parse1(self, response): 16         infos = json.loads(response.body)   #将内容转为json对象
17         item = response.meta['item']    #利用meta方法传入item
18         city = response.meta['city']    #传入城市
19         try: 20             name = infos["data"]["cards"][0]["mblog"]["user"]["screen_name"]    #爬取名字
21             first_news = re.findall('([\u4e00-\u9fa5]+)', str(infos["data"]["cards"][0]["mblog"]["text"]), re.S)    #爬取微博内容，使用正则去除一些杂项如网页代码
22             dates = infos["data"]["cards"][0]["mblog"]["created_at"]    #发布时间
23             zhuanzai = infos["data"]["cards"][0]["mblog"]["reposts_count"]    #转载数
24             comment = infos["data"]["cards"][0]["mblog"]["comments_count"]    #评论数
25             agree = infos["data"]["cards"][0]["mblog"]["attitudes_count"]    #点赞数
26             #将数据赋给item
27             item['name'] = name 28             item['first_news'] = first_news 29             item['dates'] = dates 30             item['zhuanzai'] = zhuanzai 31             item['comment'] = comment 32             item['agree'] = agree 33             item['city'] = city 34             return item    #返回
35         except IndexError or KeyError: 36             pass
37 
38     def parse2(self, response):    #获取所在地区函数
39         infos = json.loads(response.body) 40         try: 41             item = response.meta['item']    #传入item
42             city_cont = str(infos["data"]["cards"][1]["card_group"]) 43             city = re.findall('card_type.*?所在地.*?item.*?:(.*?)}]', city_cont, re.S)[0].replace('\'', '').replace( 44                 ' ', '')    #城市
45             item['city'] = city 46             ids = response.meta['ids']    #传入id并赋给ids变量
47             n_url1 = 'https://m.weibo.cn/api/container/getIndex?&containerid=107603' + ids 48             yield scrapy.Request(n_url1, meta={'item': item, 'city': copy.deepcopy(city)}, callback=self.parse1)    #执行完上述命令后的步骤
49         except IndexError or KeyError: 50             pass
51 
52     def parse(self, response): 53         datas = json.loads(response.body) 54         item = SinaweiboItem() 55         for i in range(0, 20): 56             try: 57                 ids = str(datas["data"]["cards"][i]["mblog"]["user"]["id"])    #获取用户id
58                 n_url2 = 'https://m.weibo.cn/api/container/getIndex?containerid=230283{}_-_INFO'.format(ids) 59                 yield scrapy.Request(n_url2, meta={'item': item, 'ids': copy.deepcopy(ids)}, callback=self.parse2)    #进入parse2函数执行命令
60             except IndexError or KeyError: 61                 pass
62         social_urls = [ 63             'https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_4188_-_ctg1_4188&openApp=0&since_id={}'.format( 64                 str(i)) for i in range(2, 100)] 65         celebritys_urls = [ 66             'https://m.weibo.cn/api/container/getIndex?containerid=102803_ctg1_4288_-_ctg1_4288&openApp=0&since_id={}'.format( 67                 str(j)) for j in range(1, 100)] 68         hots_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=102803&openApp=0&since_id={}'.format(str(t)) 69                      for
70                      t in range(1, 100)] 71         urls = celebritys_urls + social_urls + hots_urls    #入口网址
72         for url in urls: 73             yield scrapy.Request(url, callback=self.parse)

这里要注意 scrpay.Request 函数的meta参数，它是一个字典，用来进行参数传递，如上面代码所示，我想在parse2()函数中用到parse()函数中爬取的用户id，就须要进行设置，这里就不过多解释了，其实我也是处于摸着石头过河的理解程度，想深刻了解的朋友可自行百度。

在setting.py配置爬虫

此次我只将内容导出到了csv文件中，方便后续筛选制做词云，若是爬取的数据较多的话，能够存储到数据库中。

 1 BOT_NAME = 'sinaweibo'
 2 
 3 SPIDER_MODULES = ['sinaweibo.spiders']  4 NEWSPIDER_MODULE = 'sinaweibo.spiders'
 5 
 6 USER_AGENT: 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'    #消息头
 7 DOWNLOAD_DELAY = 0.5    #延时0.5s
 8 # Crawl responsibly by identifying yourself (and your website) on the user-agent
 9 #USER_AGENT = 'sinaweibo (+http://www.yourdomain.com)'
10 FEED_URI = 'file:C:/Users/lenovo/Desktop/weibo.csv'    #存入文件位置
11 FEED_FORMAT = 'csv'    #保存格式
12 ITEM_PIPELINES= {'sinaweibo.pipelines.SinaweiboPipeline': 300}     #管道设置
13 # Obey robots.txt rules
14 ROBOTSTXT_OBEY = False 15 FEED_EXPORT_ENCODING = 'UTF8'   #编码格式

本次没有下载图片及其余内容了，所以pipelines.py文件就没有编写内容了。爬取的部分数据以下：

到这里爬虫部分就结束了，本次爬取的内容也较为简单，下面的话就是使用其中的部分数据来生成词云。

词云制做

在文件中新建了一个 weibo_analysis.py 的文件，使用jieba库来进行分词，在此以前，须要先将所需数据提取出来，这里使用pandas就能够。

这部分程序很简单，就不废话了，直接上代码：

 1 import csv  2 import pandas as pd  3 import jieba.analyse  4 
 5 
 6 def get_ciyun(city):    #进行分词
 7     tags=jieba.analyse.extract_tags(str(city),topK=100,withWeight=True)  8     for item in tags:  9         print(item[0]+'\t'+str(int(item[1]*1000))) 10 
11 
12 need_citys = ['北京', '上海', '湖南', '四川', '广东'] 13 beijing = [] 14 shanghai = [] 15 hunan = [] 16 sichuan = [] 17 gd = [] 18 pd.set_option('expand_frame_repr', True)    #可换行显示
19 pd.set_option('display.max_rows', None)    #显示全部行
20 pd.set_option('display.max_columns', None)    #显示全部列
21 df = pd.read_csv('C:\\Users\lenovo\Desktop\weibo.csv')    #读取文件内容并转化为dataframes对象
22 
23 contents = df['first_news']    #取微博内容
24 city = df['city']    #取城市
25 for i in range(len(city)): 26     if need_citys[0] in city[i]:    #判断并存入
27  beijing.append(contents[i]) 28     elif need_citys[1] in city[i]: 29  shanghai.append(contents[i]) 30     elif need_citys[2] in city[i]: 31  hunan.append(contents[i]) 32     elif need_citys[3] in city[i]: 33  sichuan.append(contents[i]) 34     elif need_citys[4] in city[i]: 35  gd.append(contents[i]) 36     else: 37         pass
38 
39 #输出
40 get_ciyun(beijing) 41 print('-'*20) 42 get_ciyun(shanghai) 43 print('-'*20) 44 get_ciyun(hunan) 45 print('-'*20) 46 get_ciyun(sichuan) 47 print('-'*20) 48 get_ciyun(gd)

本次是经过Tagul网站在制做词云，将上方输出的词频导入，选择好词云形状、字体（不支持中文可自行导入中文字体包）、颜色等点击可视化就能生成了，很是方便。

下面是我本次生成的词云图片：

使用scrapy进行爬虫确实能大大提升爬取效率，同时本次使用过程当中也发现了许多的问题，如对这个框架不够深刻，还有不少方法不会用、不知道，还有就是Python自己的面向对象的知识理解的也不够，须要继续学习。这也说明了本身还只是一枚菜鸟。

如有疑问或建议，欢迎提出指正。

原文出处：https://www.cnblogs.com/berryguotoshare/p/10852404.html