介绍python
对于绘制某些类型的数据来讲,瀑布图是一种十分有用的工具。不足为奇的是,咱们可使用Pandas和matplotlib建立一个可重复的瀑布图。函数
在往下进行以前,我想先告诉你们我指代的是哪一种类型的图表。我将创建一个维基百科文章中描述的2D瀑布图。工具
这种图表的一个典型的用处是显示开始值和结束值之间起“桥梁”做用的+和-的值。由于这个缘由,财务人员有时会将其称为一个桥梁。跟我以前所采用的其余例子类似,这种类型的绘图在Excel中不容易生成,固然确定有生成它的方法,可是不容易记住。oop
关于瀑布图须要记住的关键点是:它本质上是一个堆叠在一块儿的条形图,不过特殊的一点是,它有一个空白底栏,因此顶部栏会“悬浮”在空中。那么,让咱们开始吧。学习
首先,执行标准的输入,并确保IPython能显示matplot图。spa
1
2
3
|
import
numpy as np
import
pandas as pd
import
matplotlib.pyplot as plt
|
1
|
%
matplotlib inline
|
设置咱们想画出瀑布图的数据,并将其加载到数据帧(DataFrame)中。code
数据须要以你的起始值开始,可是你须要给出最终的总数。咱们将在下面计算它。orm
1
2
3
|
index
=
[
'sales'
,
'returns'
,
'credit fees'
,
'rebates'
,
'late charges'
,
'shipping'
]
data
=
{
'amount'
: [
350000
,
-
30000
,
-
7500
,
-
25000
,
95000
,
-
7000
]}
trans
=
pd.DataFrame(data
=
data,index
=
index)
|
我使用了IPython中便捷的display函数来更简单地控制我要显示的内容。blog
1
2
|
from
IPython.display
import
display
display(trans)
|
瀑布图的最大技巧是计算出底部堆叠条形图的内容。有关这一点,我从stackoverflow上的讨论中学到不少。ip
首先,咱们获得累积和。
1
2
3
4
5
6
7
8
|
display(trans.amount.cumsum())
sales
350000
returns
320000
credit fees
312500
rebates
287500
late charges
382500
shipping
375500
Name: amount, dtype: int64
|
这看起来不错,但咱们须要将一个地方的数据转移到右边。
1
2
|
blank
=
trans.amount.cumsum().shift(
1
).fillna(
0
)
display(blank)
|
1
2
3
4
5
6
7
|
sales
0
returns
350000
credit fees
320000
rebates
312500
late charges
287500
shipping
382500
Name: amount, dtype: float64
|
咱们须要向trans和blank数据帧中添加一个净总量。
1
2
3
4
5
|
total
=
trans.
sum
().amount
trans.loc[
"net"
]
=
total
blank.loc[
"net"
]
=
total
display(trans)
display(blank)
|
1
2
3
4
5
6
7
8
|
sales
0
returns
350000
credit fees
320000
rebates
312500
late charges
287500
shipping
382500
net
375500
Name: amount, dtype: float64
|
建立咱们用来显示变化的步骤。
1
2
3
|
step
=
blank.reset_index(drop
=
True
).repeat(
3
).shift(
-
1
)
step[
1
::
3
]
=
np.nan
display(step)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
0
0
0
NaN
0
350000
1
350000
1
NaN
1
320000
2
320000
2
NaN
2
312500
3
312500
3
NaN
3
287500
4
287500
4
NaN
4
382500
5
382500
5
NaN
5
375500
6
375500
6
NaN
6
NaN
Name: amount, dtype: float64
|
对于“net”行,为了避免使堆叠加倍,咱们须要确保blank值为0。
1
|
blank.loc[
"net"
]
=
0
|
而后,将其画图,看一下什么样子。
1
2
|
my_plot
=
trans.plot(kind
=
'bar'
, stacked
=
True
, bottom
=
blank,legend
=
None
, title
=
"2014 Sales Waterfall"
)
my_plot.plot(step.index, step.values,
'k'
)
|
看起来至关不错,可是让咱们试着格式化Y轴,以使其更具备可读性。为此,咱们使用FuncFormatter和一些Python2.7+的语法来截断小数并向格式中添加一个逗号。
1
2
3
|
def
money(x, pos):
'The two args are the value and tick position'
return
"${:,.0f}"
.
format
(x)
|
1
2
|
from
matplotlib.ticker
import
FuncFormatter
formatter
=
FuncFormatter(money)
|
而后,将其组合在一块儿。
1
2
3
4
|
my_plot
=
trans.plot(kind
=
'bar'
, stacked
=
True
, bottom
=
blank,legend
=
None
, title
=
"2014 Sales Waterfall"
)
my_plot.plot(step.index, step.values,
'k'
)
my_plot.set_xlabel(
"Transaction Types"
)
my_plot.yaxis.set_major_formatter(formatter)
|
完整脚本
基本图形可以正常工做,可是我想添加一些标签,并作一些小的格式修改。下面是我最终的脚本:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
|
import
numpy as np
import
pandas as pd
import
matplotlib.pyplot as plt
from
matplotlib.ticker
import
FuncFormatter
#Use python 2.7+ syntax to format currency
def
money(x, pos):
'The two args are the value and tick position'
return
"${:,.0f}"
.
format
(x)
formatter
=
FuncFormatter(money)
#Data to plot. Do not include a total, it will be calculated
index
=
[
'sales'
,
'returns'
,
'credit fees'
,
'rebates'
,
'late charges'
,
'shipping'
]
data
=
{
'amount'
: [
350000
,
-
30000
,
-
7500
,
-
25000
,
95000
,
-
7000
]}
#Store data and create a blank series to use for the waterfall
trans
=
pd.DataFrame(data
=
data,index
=
index)
blank
=
trans.amount.cumsum().shift(
1
).fillna(
0
)
#Get the net total number for the final element in the waterfall
total
=
trans.
sum
().amount
trans.loc[
"net"
]
=
total
blank.loc[
"net"
]
=
total
#The steps graphically show the levels as well as used for label placement
step
=
blank.reset_index(drop
=
True
).repeat(
3
).shift(
-
1
)
step[
1
::
3
]
=
np.nan
#When plotting the last element, we want to show the full bar,
#Set the blank to 0
blank.loc[
"net"
]
=
0
#Plot and label
my_plot
=
trans.plot(kind
=
'bar'
, stacked
=
True
, bottom
=
blank,legend
=
None
, figsize
=
(
10
,
5
), title
=
"2014 Sales Waterfall"
)
my_plot.plot(step.index, step.values,
'k'
)
my_plot.set_xlabel(
"Transaction Types"
)
#Format the axis for dollars
my_plot.yaxis.set_major_formatter(formatter)
#Get the y-axis position for the labels
y_height
=
trans.amount.cumsum().shift(
1
).fillna(
0
)
#Get an offset so labels don't sit right on top of the bar
max
=
trans.
max
()
neg_offset
=
max
/
25
pos_offset
=
max
/
50
plot_offset
=
int
(
max
/
15
)
#Start label loop
loop
=
0
for
index, row
in
trans.iterrows():
# For the last item in the list, we don't want to double count
if
row[
'amount'
]
=
=
total:
y
=
y_height[loop]
else
:
y
=
y_height[loop]
+
row[
'amount'
]
# Determine if we want a neg or pos offset
if
row[
'amount'
] >
0
:
y
+
=
pos_offset
else
:
y
-
=
neg_offset
my_plot.annotate(
"{:,.0f}"
.
format
(row[
'amount'
]),(loop,y),ha
=
"center"
)
loop
+
=
1
#Scale up the y axis so there is room for the labels
my_plot.set_ylim(
0
,blank.
max
()
+
int
(plot_offset))
#Rotate the labels
my_plot.set_xticklabels(trans.index,rotation
=
0
)
my_plot.get_figure().savefig(
"waterfall.png"
,dpi
=
200
,bbox_inches
=
'tight'
)
|
运行该脚本将生成下面这个漂亮的图表:
若是你以前不熟悉瀑布图,但愿这个示例可以向你展现它究竟是多么有用。我想,可能一些人会以为对于一个图表来讲须要这么多的脚本代码有点糟糕。在某些方面,我赞成这种想法。若是你仅仅只是作一个瀑布图,而之后不会再碰它,那么你仍是继续用Excel中的方法吧。
然而,若是瀑布图真的颇有用,而且你须要将它复制给100个客户,将会怎么样呢?接下来你将要怎么作呢?此时使用Excel将会是一个挑战,而使用本文中的脚原本建立100个不一样的表格将至关容易。再次说明,这一程序的真正价值在于,当你须要扩展这个解决方案时,它可以便于你建立一个易于复制的程序。
我真的很喜欢学习更多Pandas、matplotlib和IPothon的知识。我很高兴这种方法可以帮到你,并但愿其余人也能够从中学习到一些知识,并将这一课所学应用到他们的平常工做中。