百度贴吧的数据抓取和分析（二）：基础数据分析百度贴吧的数据抓取和分析（一）：指定条目帖子信息抓取

时间 2019-12-11

标签百度贴吧数据抓取分析基础指定条目帖子信息繁體版

原文原文链接

本教程的代码托管于github:　https://github.com/w392807287/spider_baidu_barpython

本教程中使用的是从某贴吧中抓取的8000条帖子中进行清理后剩下的7405条。git

发帖日期统计

引入库：github

import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt

使用pandas读取csv文件：数组

#指定列名
names = ["帖子id","帖子标题","url","回复数目","发帖日期","open_id","open_type","做者昵称","做者性别","做者等级","等级名称"]
df = pd.read_csv('post_info.csv',names=names)

数据整理：微信

weekday_list = [datetime.strptime(y,"%Y-%m-%d %H:%M").weekday() for y in df["发帖日期"].values]

归一化：iphone

dict_week = {}
for week in weekday_list:
    if week not in dict_week.keys():
        dict_week[week] = 1
    else:
        dict_week[week] += 1

使用matplotlib画图：ide

labels = ["MON","TUES","WED","THUS","FRI","SAT","SUM"]
colors = ['red','yellowgreen','lightskyblue','white','red','yellowgreen','lightskyblue']
plt.pie(list(dict_week.values()),labels=labels,colors=colors,autopct = '%3.1f%%')
plt.axis('equal')
plt.legend()
plt.show()

结果：函数

由图咱们能够看出周六周日的发帖量相对较少，而周二到周五的发帖来相对比较多，大概是由于上课比较无聊喜欢发帖咯？post

介于上面两个操做（归一化、画图）下面不少地方都须要用到，因此这里将其封装成函数．

归一化，根据参数能够返回字典，正序、倒序的数组

def norm(_list,get="dict"):
    '''
    对传进来的list进行归一化操做
    :param _list: 须要操做的list
    :param get: 返回的方式
    :return: 根据get返回
    '''
    _dict = {}
    for i in _list:
        if i not in _dict.keys():
            _dict[i] = 1
        else:
            _dict[i] += 1
    if get == "dict" or get == 0:
        return _dict
    elif get == "array" or get == 1:
        return np.array(sorted(_dict.items(),key=lambda asd:asd[0]))
    elif get == "_array" or get == -1:
        return np.array(sorted(_dict.items(),key=lambda asd:asd[0],reverse=True))
    elif get == "array2" or get == 2:
        return np.array(sorted(_dict.items(), key=lambda asd: asd[1]))
    elif get == "_array2" or get == -2:
        return np.array(sorted(_dict.items(),key=lambda asd:asd[1],reverse=True))
    else:
        print("请输入正确的值")
        return None

画饼状图的函数：

class dictPie:
    '''
    传入字典画图
    '''
    def __init__(self,dict):
        self._dict = dict

    def show(self):
        plt.pie(list(self._dict.values()), labels=list(self._dict.keys()), startangle=90, autopct='%3.1f%%')
        plt.axis('equal')
        plt.legend()
        plt.show()

class arrayPie:
    '''
    传入数组画图
    '''
    def __init__(self,array):
        self._array = array

    def show(self):
        plt.pie(self._array[:,1], labels=self._array[:,0], startangle=90, autopct='%3.1f%%')
        plt.axis('equal')
        plt.legend()
        plt.show()

def show_pie(thing,type = "dict"):
    shows = dict(dict=dictPie,array=arrayPie)
    return shows[type](thing)

使用上述函数

发帖年份的饼状图：

year_array = norm([x.split(" ")[0].split("-")[0] for x in df["发帖日期"]],get="array")
show_pie(year_array,type="array").show()

抓取的数据是此贴吧前近8000条发帖状况，因此发帖年份主要集中在今年（2016）和去年（2015），与基本状况相符。可是在这8000条信息中，2012年的贴子数量却比2013,2014,2011年的要高出不少，说明2012年可能有一些特殊状况，好比2012年的精品帖比较多之类的。

发帖月份饼状图：

month_array = norm([x.split(" ")[0].split("-")[1] for x in df["发帖日期"]],get="array")
show_pie(month_array,type="array").show()

从上图可看出，对于发帖的月份来说6,7,8,9四个月的发帖量占了一半多，而1,2,3,4,11,12六个月的发帖量仅有三分之一左右。大概是由于天气缘由吧。科科，原来天气冷了大伙的发帖热情也减低了很多。特别是１月２月，大过年的发帖都是单身狗科科。

发帖星期饼状图：

weekday_array = norm([datetime.strptime(y,"%Y-%m-%d %H:%M").weekday() for y in df["发帖日期"].values],get="array")
show_pie(weekday_array,type="array").show()

再看，星期几发帖比较多呢？一周七天分布仍是相对均匀的，可是就周六周日来说仍是会稍微第一点，毕竟上课比较无聊发帖会多一点科科。

发帖客户端饼状图：

open_id_dict = norm(df["open_id"])
show_pie(open_id_dict,type="dict").show()

就发帖客户端来说，毫无疑问手机端(tbclient)占领了绝大部分的数据量,剩下的就是网页(tieba)和wap端。

发帖客户端类型饼状图：

open_type_dict = norm(df["open_type"],get=2)
show_pie(open_type_dict,type="array").show()

就客户端类型而言，安卓理所应当是巨头，而后就是网页端（nan)，其次是iphone用户，iphone用户量感受不止这么多啊，土豪都不逛贴吧的嘛。剩下的就是一些比较难识别的客户端，占少部分。

做者性别饼状图：

sex_array = norm(df["做者性别"],get=1)[:3]
show_pie(sex_array,type="array").show()

其中1为男，2为女，3表明该用户隐藏本身的性别。都说理工的妹子少，这逛贴吧的妹子更是少咯？

做者等级饼状图：

level_array = norm(df["做者等级"],get=1)[1:16]
show_pie(level_array,type="array").show()

１２级是个鸿沟嘛。貌似１１级的小伙伴都比较喜欢发帖想要快点到１２级嘛。

发帖人统计柱状图：

level_array = norm(df["做者昵称"],get=-2)[:10]
plt.xticks(range(10),level_array[:,0])
plt.xlabel("用户名",fontproperties=font_simsun)
plt.ylabel("发帖数",fontproperties=font_simsun)
plt.bar(left=range(10),height=[int(x) for x in level_array[:,1]],color = 'g')

plt.show()

用户名隐私因此用户名的显示乱码就没去解决了，大概就是这个样子～

回复数目柱状图：

reply_array = norm(df["回复数目"],get=1)[:-50]
print(reply_array)
plt.bar(reply_array[:,0],reply_array[:,1])
plt.show()

在回复数目的处理中，将后面案例不多的去掉了，回帖量主要集中在０－１００之间，人气不够啊。。。

２４小时发帖分布：

hour_array = norm([x.split(" ")[1].split(":")[0] for x in df["发帖日期"]],get="array")
plt.xlabel("时段",fontproperties=font_simsun)
plt.ylabel("发帖数",fontproperties=font_simsun)
#show_pie(hour_array,type="array").show()
plt.bar(hour_array[:,0],[int(x) for x in hour_array[:,1]])
plt.show()

凌晨时段发帖仍是比较少的，可是深夜发帖量为啥蹭蹭的长呢，一群单身狗没有夜生活科科。。

还有一个分钟段的发帖分布，没啥意义。

待补充……

欢迎多来访问博客：http://liqiongyu.com/blog

微信公众号：

百度贴吧的数据抓取和分析（二）：基础数据分析 百度贴吧的数据抓取和分析（一）：指定条目帖子信息抓取