Python数据可视化，完整版实操指南 !

时间 2021-04-08

标签 html git 程序员 github web 编程浏览器 app 机器学习 ide 栏目 Python 繁體版

原文原文链接

你们好，今天让咱们看一下使用Python进行数据可视化的主要库以及可使用它们完成的全部类型的图表。咱们还将看到建议在每种状况下使用哪一个库以及每一个库的独特功能。html

咱们将从最基本的可视化开始，直接查看数据，而后继续绘制图表，最后制做交互式图表。git

数据集

咱们将使用两个数据集来适应本文中显示的可视化效果，数据集可经过下方连接进行下载。程序员

数据集：github.com/albertsl/datgithub

这些数据集都是与人工智能相关的三个术语（数据科学，机器学习和深度学习）在互联网上搜索流行度的数据，从搜索引擎中提取而来。web

该数据集包含了两个文件temporal.csv和mapa.csv。在这个教程中，咱们将更多使用的第一个包括随时间推移（从2004年到2020年）的三个术语的受欢迎程度数据。另外，我添加了一个分类变量（1和0）来演示带有分类变量的图表的功能。编程

mapa.csv文件包含按国家/地区分隔的受欢迎程度数据。在最后的可视化地图时，咱们会用到它。浏览器

Pandas

在介绍更复杂的方法以前，让咱们从可视化数据的最基本方法开始。咱们将只使用熊猫来查看数据并了解其分布方式。app

咱们要作的第一件事是可视化一些示例，查看这些示例包含了哪些列、哪些信息以及如何对值进行编码等等。机器学习

import pandas as pd df = pd.read_csv('temporal.csv') df.head(10) #View first 10 data rowside

使用命令描述，咱们将看到数据如何分布，最大值，最小值，均值……

df.describe（）

使用info命令，咱们将看到每列包含的数据类型。咱们能够发现一列的状况，当使用head命令查看时，该列彷佛是数字的，可是若是咱们查看后续数据，则字符串格式的值将被编码为字符串。

df.info（）

一般状况下，pandas都会限制其显示的行数和列数。这可能让不少程序员感到困扰，由于你们都但愿可以可视化全部数据。

使用这些命令，咱们能够增长限制，而且能够可视化整个数据。对于大型数据集，请谨慎使用此选项，不然可能没法显示它们。

pd.set_option（'display.max_rows'，500） pd.set_option（'display.max_columns'，500） pd.set_option（'display.width'，1000）

使用Pandas样式，咱们能够在查看表格时得到更多信息。首先，咱们定义一个格式字典，以便以清晰的方式显示数字（以必定格式显示必定数量的小数、日期和小时，并使用百分比、货币等）。不要惊慌，这是仅显示而不会更改数据，之后再处理也不会有任何问题。

为了给出每种类型的示例，我添加了货币和百分比符号，即便它们对于此数据没有任何意义。

format_dict = {'data science':'${0:,.2f}', 'Mes':'{:%m-%Y}', 'machine learning':'{:.2%}'} #We make sure that the Month column has datetime format df['Mes'] = pd.to_datetime(df['Mes']) #We apply the style to the visualization df.head().style.format(format_dict)

咱们能够用颜色突出显示最大值和最小值。

format_dict = {'Mes':'{:%m-%Y}'} #Simplified format dictionary with values that do make sense for our data df.head().style.format(format_dict).highlight_max(color='darkgreen').highlight_min(color='#ff0000')

咱们使用颜色渐变来显示数据值。

df.head(10).style.format(format_dict).background_gradient(subset=['data science', 'machine learning'], cmap='BuGn')

咱们也能够用条形显示数据值。

df.head().style.format(format_dict).bar(color='red', subset=['data science', 'deep learning'])

此外，咱们还能够结合以上功能并生成更复杂的可视化效果。

df.head（10）.style.format（format_dict）.background_gradient（subset = ['data science'，'machine learning']，cmap ='BuGn'）。highlight_max（color ='yellow'）

Pandas分析

Pandas分析是一个库，可以使用咱们的数据生成交互式报告，咱们能够看到数据的分布，数据的类型以及可能出现的问题。它很是易于使用，只需三行，咱们就能够生成一个报告，该报告能够发送给任何人，即便您不了解编程也可使用。

from pandas_profiling import ProfileReport prof = ProfileReport(df) prof.to_file(output_file='report.html')

Matplotlib

Matplotlib是用于以图形方式可视化数据的最基本的库。它包含许多咱们能够想到的图形。仅仅由于它是基本的并不意味着它并不强大，咱们将要讨论的许多其余数据可视化库都基于它。

Matplotlib的图表由两个主要部分组成，即轴（界定图表区域的线）和图形（咱们在其中绘制轴，标题和来自轴区域的东西），如今让咱们建立最简单的图：

import matplotlib.pyplot as plt plt.plot(df['Mes'], df['data science'], label='data science') #The parameter label is to indicate the legend. This doesn't mean that it will be shown, we'll have to use another command that I'll explain later.

咱们能够在同一张图中制做多个变量的图，而后进行比较。

plt.plot（df ['Mes']，df ['data science']，label ='data science'） plt.plot（df ['Mes']，df ['machine learning']，label ='machine learning '） plt.plot（df ['Mes']，df ['deep learning']，label ='deep learning'）

每种颜色表明哪一个变量还不是很清楚。咱们将经过添加图例和标题来改进图表。

plt.plot(df['Mes'], df['data science'], label='data science') plt.plot(df['Mes'], df['machine learning'], label='machine learning') plt.plot(df['Mes'], df['deep learning'], label='deep learning') plt.xlabel('Date') plt.ylabel('Popularity') plt.title('Popularity of AI terms by date') plt.grid(True) plt.legend()

若是您是从终端或脚本中使用Python，则在使用咱们上面编写的函数定义图后，请使用plt.show（）。若是您使用的是Jupyter Notebook，则在制做图表以前，将％matplotlib内联添加到文件的开头并运行它。

咱们能够在一个图形中制做多个图形。这对于比较图表或经过单个图像轻松共享几种图表类型的数据很是有用。

fig, axes = plt.subplots(2,2) axes[0, 0].hist(df['data science']) axes[0, 1].scatter(df['Mes'], df['data science']) axes[1, 0].plot(df['Mes'], df['machine learning']) axes[1, 1].plot(df['Mes'], df['deep learning'])

咱们能够为每一个变量的点绘制具备不一样样式的图形：

plt.plot（df ['Mes']，df ['data science']，'r-'） plt.plot（df ['Mes']，df ['data science'] * 2，'bs'） plt .plot（df ['Mes']，df ['data science'] * 3，'g ^'）

如今让咱们看一些使用Matplotlib能够作的不一样图形的例子。咱们从散点图开始：

plt.scatter(df['data science'], df['machine learning'])

条形图示例：

plt.bar（df ['Mes']，df ['machine learning']，width = 20）

直方图示例：

plt.hist（df ['deep learning']，bins = 15）

咱们能够在图形中添加文本，并以与图形中看到的相同的单位指示文本的位置。在文本中，咱们甚至能够按照TeX语言添加特殊字符

咱们还能够添加指向图形上特定点的标记。

plt.plot(df['Mes'], df['data science'], label='data science') plt.plot(df['Mes'], df['machine learning'], label='machine learning') plt.plot(df['Mes'], df['deep learning'], label='deep learning') plt.xlabel('Date') plt.ylabel('Popularity') plt.title('Popularity of AI terms by date') plt.grid(True) plt.text(x='2010-01-01', y=80, s=r'$\lambda=1, r^2=0.8$') #Coordinates use the same units as the graph plt.annotate('Notice something?', xy=('2014-01-01', 30), xytext=('2006-01-01', 50), arrowprops={'facecolor':'red', 'shrink':0.05})

Seaborn

Seaborn是基于Matplotlib的库。基本上，它提供给咱们的是更好的图形和功能，只需一行代码便可制做复杂类型的图形。

咱们导入库并使用sns.set（）初始化图形样式，若是没有此命令，图形将仍然具备与Matplotlib相同的样式。咱们显示了最简单的图形之一，散点图：

import seaborn as sns sns.set() sns.scatterplot(df['Mes'], df['data science'])

咱们能够在同一张图中添加两个以上变量的信息。为此，咱们使用颜色和大小。咱们还根据类别列的值制做了一个不一样的图：

sns.relplot(x='Mes', y='deep learning', hue='data science', size='machine learning', col='categorical', data=df)

Seaborn提供的最受欢迎的图形之一是热图。一般使用它来显示数据集中变量之间的全部相关性：

sns.heatmap（df.corr（），annot = True，fmt ='。2f'）

另外一个最受欢迎的是配对图，它向咱们显示了全部变量之间的关系。若是您有一个大数据集，请谨慎使用此功能，由于它必须显示全部数据点的次数与有列的次数相同，这意味着经过增长数据的维数，处理时间将成倍增长。

sns.pairplot（df）

如今让咱们作一个成对图，显示根据分类变量的值细分的图表。

sns.pairplot（df，hue ='categorical'）

联合图是一个很是有用的图，它使咱们能够查看散点图以及两个变量的直方图，并查看它们的分布方式：

sns.jointplot(x='data science', y='machine learning', data=df)

另外一个有趣的图形是ViolinPlot：

sns.catplot(x='categorical', y='data science', kind='violin', data=df)

咱们能够像使用Matplotlib同样在一个图像中建立多个图形：

fig, axes = plt.subplots(1, 2, sharey=True, figsize=(8, 4)) sns.scatterplot(x="Mes", y="deep learning", hue="categorical", data=df, ax=axes[0]) axes[0].set_title('Deep Learning') sns.scatterplot(x="Mes", y="machine learning", hue="categorical", data=df, ax=axes[1]) axes[1].set_title('Machine Learning')

Bokeh

Bokeh是一个库，可用于生成交互式图形。咱们能够将它们导出到HTML文档中，并与具备Web浏览器的任何人共享。

当咱们有兴趣在图形中查找事物而且但愿可以放大并在图形中移动时，它是一个很是有用的库。或者，当咱们想共享它们并给其余人探索数据的可能性时。

咱们首先导入库并定义将要保存图形的文件：

from bokeh.plotting import figure, output_file, save output_file('data_science_popularity.html')

咱们绘制所需内容并将其保存在文件中：

p = figure(title='data science', x_axis_label='Mes', y_axis_label='data science') p.line(df['Mes'], df['data science'], legend='popularity', line_width=2) save(p)

将多个图形添加到单个文件：

output_file('multiple_graphs.html') s1 = figure(width=250, plot_height=250, title='data science') s1.circle(df['Mes'], df['data science'], size=10, color='navy', alpha=0.5) s2 = figure(width=250, height=250, x_range=s1.x_range, y_range=s1.y_range, title='machine learning') #share both axis range s2.triangle(df['Mes'], df['machine learning'], size=10, color='red', alpha=0.5) s3 = figure(width=250, height=250, x_range=s1.x_range, title='deep learning') #share only one axis range s3.square(df['Mes'], df['deep learning'], size=5, color='green', alpha=0.5) p = gridplot([[s1, s2, s3]]) save(p)

Altair

我认为Altair不会给咱们已经与其余图书馆讨论的内容带来任何新的东西，所以，我将不对其进行深刻讨论。我想提到这个库，由于也许在他们的示例画廊中，咱们能够找到一些能够帮助咱们的特定图形。

Folium

Folium是一项研究，可让咱们绘制地图，标记，也能够在上面绘制数据。Folium让咱们选择地图的提供者，这决定了地图的样式和质量。在本文中，为简单起见，咱们仅将OpenStreetMap视为地图提供者。

使用地图很是复杂，值得一读。在这里，咱们只是看一下基础知识，并用咱们拥有的数据绘制几张地图。

让咱们从基础开始，咱们将绘制一个简单的地图，上面没有任何内容。

import folium m1 = folium.Map(location=[41.38, 2.17], tiles='openstreetmap', zoom_start=18) m1.save('map1.html')

咱们为地图生成一个交互式文件，您能够在其中随意移动和缩放。

咱们能够在地图上添加标记：

m2 = folium.Map(location=[41.38, 2.17], tiles='openstreetmap', zoom_start=16) folium.Marker([41.38, 2.176], popup='<i>You can use whatever HTML code you want</i>', tooltip='click here').add_to(m2) folium.Marker([41.38, 2.174], popup='<b>You can use whatever HTML code you want</b>', tooltip='dont click here').add_to(m2) m2.save('map2.html')

你能够看到交互式地图文件，能够在其中单击标记。

在开头提供的数据集中，咱们有国家名称和人工智能术语的流行度。快速可视化后，您会发现有些国家缺乏这些值之一。咱们将消除这些国家，以使其变得更加容易。而后，咱们将使用Geopandas将国家/地区名称转换为可在地图上绘制的坐标。

from geopandas.tools import geocode df2 = pd.read_csv('mapa.csv') df2.dropna(axis=0, inplace=True) df2['geometry'] = geocode(df2['País'], provider='nominatim')['geometry'] #It may take a while because it downloads a lot of data. df2['Latitude'] = df2['geometry'].apply(lambda l: l.y) df2['Longitude'] = df2['geometry'].apply(lambda l: l.x)

如今，咱们已经按照纬度和经度对数据进行了编码，如今让咱们在地图上进行表示。咱们将从BubbleMap开始，在其中绘制各个国家的圆圈。它们的大小将取决于该术语的受欢迎程度，而颜色将是红色或绿色，具体取决于它们的受欢迎程度是否超过某个值。

m3 = folium.Map(location=[39.326234,-4.838065], tiles='openstreetmap', zoom_start=3) def color_producer(val): if val <= 50: return 'red' else: return 'green' for i in range(0,len(df2)): folium.Circle(location=[df2.iloc[i]['Latitud'], df2.iloc[i]['Longitud']], radius=5000*df2.iloc[i]['data science'], color=color_producer(df2.iloc[i]['data science'])).add_to(m3) m3.save('map3.html')

在什么时候使用哪一个库？

有了各类各样的库，怎么作选择？快速的答案是让你能够轻松制做所需图形的库。

对于项目的初始阶段，使用Pandas和Pandas分析，咱们将进行快速可视化以了解数据。若是须要可视化更多信息，可使用在matplotlib中能够找到的简单图形做为散点图或直方图。

对于项目的高级阶段，咱们能够在主库（Matplotlib，Seaborn，Bokeh，Altair）的图库中搜索咱们喜欢并适合该项目的图形。这些图形可用于在报告中提供信息，制做交互式报告，搜索特定值等。