Python可视化：Seaborn（一）

时间 2019-11-16

标签 python 可视化 seaborn 栏目 Python 繁體版

原文原文链接

进行数据分析&挖掘时，描述性统计必不可少。好比，咱们须要看下各个quantitative变量的分布状况，良好的分布可视化效果能为以后进一步作数据建模打下基础。app

其中，Seaborn即是个功能强大的库，能够用它作出很棒的数据可视化效果。咱们此处结合科赛网上公开的链家二手房数据集，对如何使用Seaborn作Distribution Visualization进行说明。工具

说明：文中全部代码部分都可经过K-Lab在线 数据分析协做工具 复现。能够登陆科赛网，尝试用不一样的数据集利用Seaborn进行可视化练习。

对于quantitative变量作分布可视化，主要有两点：ui

探寻变量自身的分布规律，也就是univariate distributions可视化；spa
探寻两个变量之间是否有分布关系，也就是bivariate distributions可视化。3d

Seaborn也是按这个workflow给出了plot function。cdn

univariate distributions visualization:

distplot --- 绘制某单一变量的分布状况blog

kdeplot --- fit某变量(单一变量或两个变量之间)分布的核密度估计(kernel density estimate)ip

rugplot --- 在坐标轴上按戳的样式(sticks)依次绘制数据点序列ci

bivariate distributions visualization:

jointplot --- 绘制某两个变量之间的分布关系get

读取数据

import pandas as pd sh = pd.read_csv('sh.csv',encoding='gbk')

为了不中文解码出现bug，将表头进行替换：

导入绘图的包

import warnings warnings.filterwarnings("ignore") import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline

单一变量可视化初探

在这个数据集中，quantitative的变量主要有房屋的面积Area,每平米单价Price，以及房屋总价Tprice。

先来看看上海每一个行政区房屋总价Tprice的分布状况，咱们用distplot绘制。须要注意的是，在默认状况下，distplot会直接给出变量核密度估计的fit曲线。

dist = sh.Dist.unique() plt.figure(1,figsize=(16,30))with sns.axes_style("ticks"):

for i in range(17): temp = sh[sh.Dist == dist[i]] plt.subplot(6,3,i+1) plt.title(dist[i]) sns.distplot(temp.Tprice) plt.xlabel(' ') plt.show()

固然，咱们也能够关闭核密度估计fit曲线，直接去看直方图分布(histograms)。Seaborn在distplot function的API中给出了kde和rug这两个参数，分别对应kernel density和rugplot(也就是在坐标轴上绘制出datapoint所在的位置)。

咱们单独取出徐汇区(Xuhui)的数据，对kde和rug这两个参数进行设置，作出的直方图以下。

temp = sh[sh.Dist == 'Xuhui'] plt.figure(1,figsize=(6,6)) plt.title('Xuhui') sns.distplot(temp.Tprice,kde=False,bins=20,rug=True) plt.xlabel(' ') plt.show()

在Seaborn中，咱们也能够直接调用kdeplot和rugplot作图。

如今咱们去研究一下徐汇区数据中，房屋面积变量Area的分布状况。

from scipy import stats, integrate plt.figure(1,figsize=(12,6))with sns.axes_style("ticks"): plt.subplot(1,2,1) sns.kdeplot(temp.Area,shade=True) sns.rugplot(temp.Area) plt.title('Xuhui --- Area Distribution') plt.subplot(1,2,2) plt.title('Xuhui - Area Distribution fits with gamma distribution') sns.distplot(temp.Area, kde=False, fit=stats.gamma) plt.show()

左：kdeplot function和rugplot function分别调用后的叠加，体现Seaborn作图灵活性
右：在distplot function设置了fit参数，让数据的分布与gamma分布进行拟合

两个变量(pairs)可视化

在作了单个quantitative变量分布的可视化研究后，咱们来看看某两个变量组之间是否存在分布关系。

Seaborn在这里提供了jointplot function使用。下面咱们来对整个数据集的房屋面积(Area)和房价(Tprice)这两个变量进行可视化分析。

绘制散点图Scatterplot

sns.jointplot(x='Area',y='Tprice',data=sh) plt.show()

咱们发现房价小于1000W而且面积小于200平方米的数据点很集中。设置一个filter，将这部分数据单独拿出来作研究，从新绘制散点图。

test = sh[(sh.Tprice<1000)&(sh.Area<200)]with sns.axes_style("white"): sns.jointplot(x='Area',y='Tprice',data=test) plt.show()

当数据量很大的时候，能够进一步利用hexbin plot去作可视化，显示数据集中分布的区域，以下图所示。

with sns.axes_style("white"): sns.jointplot(x='Tprice',y='Area',data=test,kind='hex') plt.show()

固然，咱们也能够用kernel density estimation去作可视化，看分布状况。

with sns.axes_style("white"): sns.jointplot(x='Area',y='Tprice',data=test,kind='kde') plt.show()

小结

seaborn的巧妙之处就是利用最短的代码去可视化尽量多的内容，并且API十分灵活，只有你想不到，没有你作不到。

另外，这篇小短文对数据集自己的探索与解释不是不少，若但愿更深层次的探索数据集，能够直接登陆科赛网，点击「数据集」查看。

本文由保一雄@科赛网数据分析师原创。