教你用Python建立瀑布图

时间 2019-12-11

标签 python 建立瀑布栏目 Python 繁體版

原文原文链接

介绍python

对于绘制某些类型的数据来讲，瀑布图是一种十分有用的工具。不足为奇的是，咱们可使用Pandas和matplotlib建立一个可重复的瀑布图。函数

在往下进行以前，我想先告诉你们我指代的是哪一种类型的图表。我将创建一个维基百科文章中描述的2D瀑布图。工具

这种图表的一个典型的用处是显示开始值和结束值之间起“桥梁”做用的+和-的值。由于这个缘由，财务人员有时会将其称为一个桥梁。跟我以前所采用的其余例子类似，这种类型的绘图在Excel中不容易生成，固然确定有生成它的方法，可是不容易记住。oop

关于瀑布图须要记住的关键点是：它本质上是一个堆叠在一块儿的条形图，不过特殊的一点是，它有一个空白底栏，因此顶部栏会“悬浮”在空中。那么，让咱们开始吧。学习

建立图表

首先，执行标准的输入，并确保IPython能显示matplot图。spa

 
        import 
        numpy as np 
       
        import 
        pandas as pd 
       
        import 
        matplotlib.pyplot as plt

 
        % 
        matplotlib inline

设置咱们想画出瀑布图的数据，并将其加载到数据帧（DataFrame）中。code

数据须要以你的起始值开始，可是你须要给出最终的总数。咱们将在下面计算它。orm

 
   
    
      
      
        index  
        = 
        [ 
        'sales' 
        , 
        'returns' 
        , 
        'credit fees' 
        , 
        'rebates' 
        , 
        'late charges' 
        , 
        'shipping' 
        ] 
       
 
        data  
        = 
        { 
        'amount' 
        : [ 
        350000 
        , 
        - 
        30000 
        , 
        - 
        7500 
        , 
        - 
        25000 
        , 
        95000 
        , 
        - 
        7000 
        ]} 
       
 
        trans  
        = 
        pd.DataFrame(data 
        = 
        data,index 
        = 
        index) 
       
 
    
 
   
 

我使用了IPython中便捷的display函数来更简单地控制我要显示的内容。blog

 
        from 
        IPython.display  
        import 
        display 
       
        display(trans)

瀑布图的最大技巧是计算出底部堆叠条形图的内容。有关这一点，我从stackoverflow上的讨论中学到不少。ip

首先，咱们获得累积和。

 
        display(trans.amount.cumsum()) 
       
        sales            
        350000 
       
        returns          
        320000 
       
        credit fees      
        312500 
       
        rebates          
        287500 
       
        late charges     
        382500 
       
        shipping         
        375500 
       
        Name: amount, dtype: int64

这看起来不错，但咱们须要将一个地方的数据转移到右边。

 
        blank 
        = 
        trans.amount.cumsum().shift( 
        1 
        ).fillna( 
        0 
        ) 
       
        display(blank)

 
        sales                 
        0 
       
        returns          
        350000 
       
        credit fees      
        320000 
       
        rebates          
        312500 
       
        late charges     
        287500 
       
        shipping         
        382500 
       
        Name: amount, dtype: float64

咱们须要向trans和blank数据帧中添加一个净总量。

 
        total  
        = 
        trans. 
        sum 
        ().amount 
       
        trans.loc[ 
        "net" 
        ]  
        = 
        total 
       
        blank.loc[ 
        "net" 
        ]  
        = 
        total 
       
        display(trans) 
       
        display(blank)

 
        sales                 
        0 
       
        returns          
        350000 
       
        credit fees      
        320000 
       
        rebates          
        312500 
       
        late charges     
        287500 
       
        shipping         
        382500 
       
        net              
        375500 
       
        Name: amount, dtype: float64

建立咱们用来显示变化的步骤。

 
        step  
        = 
        blank.reset_index(drop 
        = 
        True 
        ).repeat( 
        3 
        ).shift( 
        - 
        1 
        ) 
       
        step[ 
        1 
        :: 
        3 
        ]  
        = 
        np.nan 
       
        display(step)

 
        0         
        0 
       
        0       
        NaN 
       
        0    
        350000 
       
        1    
        350000 
       
        1       
        NaN 
       
        1    
        320000 
       
        2    
        320000 
       
        2       
        NaN 
       
        2    
        312500 
       
        3    
        312500 
       
        3       
        NaN 
       
        3    
        287500 
       
        4    
        287500 
       
        4       
        NaN 
       
        4    
        382500 
       
        5    
        382500 
       
        5       
        NaN 
       
        5    
        375500 
       
        6    
        375500 
       
        6       
        NaN 
       
        6       
        NaN 
       
        Name: amount, dtype: float64

对于“net”行，为了避免使堆叠加倍，咱们须要确保blank值为0。

 
        blank.loc[ 
        "net" 
        ]  
        = 
        0

而后，将其画图，看一下什么样子。

 
   
    
      
      
        my_plot  
        = 
        trans.plot(kind 
        = 
        'bar' 
        , stacked 
        = 
        True 
        , bottom 
        = 
        blank,legend 
        = 
        None 
        , title 
        = 
        "2014 Sales Waterfall" 
        ) 
       
 
        my_plot.plot(step.index, step.values, 
        'k' 
        ) 
       
 
    
 
   
 

看起来至关不错，可是让咱们试着格式化Y轴，以使其更具备可读性。为此，咱们使用FuncFormatter和一些Python2.7+的语法来截断小数并向格式中添加一个逗号。

 
        def 
        money(x, pos): 
       
        'The two args are the value and tick position' 
       
        return 
        "${:,.0f}" 
        . 
        format 
        (x)

 
        from 
        matplotlib.ticker  
        import 
        FuncFormatter 
       
        formatter  
        = 
        FuncFormatter(money)

而后，将其组合在一块儿。

 
        my_plot  
        = 
        trans.plot(kind 
        = 
        'bar' 
        , stacked 
        = 
        True 
        , bottom 
        = 
        blank,legend 
        = 
        None 
        , title 
        = 
        "2014 Sales Waterfall" 
        ) 
       
        my_plot.plot(step.index, step.values, 
        'k' 
        ) 
       
        my_plot.set_xlabel( 
        "Transaction Types" 
        ) 
       
        my_plot.yaxis.set_major_formatter(formatter)

完整脚本

基本图形可以正常工做，可是我想添加一些标签，并作一些小的格式修改。下面是我最终的脚本：

 
   
    
      
      
        import 
        numpy as np 
       
 
        import 
        pandas as pd 
       
 
        import 
        matplotlib.pyplot as plt 
       
 
        from 
        matplotlib.ticker  
        import 
        FuncFormatter 
       

           
       
 
        #Use python 2.7+ syntax to format currency 
       
 
        def 
        money(x, pos): 
       
 
             
        'The two args are the value and tick position' 
       
 
             
        return 
        "${:,.0f}" 
        . 
        format 
        (x) 
       
 
        formatter  
        = 
        FuncFormatter(money) 
       

           
       
 
        #Data to plot. Do not include a total, it will be calculated 
       
 
        index  
        = 
        [ 
        'sales' 
        , 
        'returns' 
        , 
        'credit fees' 
        , 
        'rebates' 
        , 
        'late charges' 
        , 
        'shipping' 
        ] 
       
 
        data  
        = 
        { 
        'amount' 
        : [ 
        350000 
        , 
        - 
        30000 
        , 
        - 
        7500 
        , 
        - 
        25000 
        , 
        95000 
        , 
        - 
        7000 
        ]} 
       

           
       
 
        #Store data and create a blank series to use for the waterfall 
       
 
        trans  
        = 
        pd.DataFrame(data 
        = 
        data,index 
        = 
        index) 
       
 
        blank  
        = 
        trans.amount.cumsum().shift( 
        1 
        ).fillna( 
        0 
        ) 
       

           
       
 
        #Get the net total number for the final element in the waterfall 
       
 
        total  
        = 
        trans. 
        sum 
        ().amount 
       
 
        trans.loc[ 
        "net" 
        ] 
        = 
        total 
       
 
        blank.loc[ 
        "net" 
        ]  
        = 
        total 
       

           
       
 
        #The steps graphically show the levels as well as used for label placement 
       
 
        step  
        = 
        blank.reset_index(drop 
        = 
        True 
        ).repeat( 
        3 
        ).shift( 
        - 
        1 
        ) 
       
 
        step[ 
        1 
        :: 
        3 
        ]  
        = 
        np.nan 
       

           
       
 
        #When plotting the last element, we want to show the full bar, 
       
 
        #Set the blank to 0 
       
 
        blank.loc[ 
        "net" 
        ]  
        = 
        0 
       

           
       
 
        #Plot and label 
       
 
        my_plot  
        = 
        trans.plot(kind 
        = 
        'bar' 
        , stacked 
        = 
        True 
        , bottom 
        = 
        blank,legend 
        = 
        None 
        , figsize 
        = 
        ( 
        10 
        ,  
        5 
        ), title 
        = 
        "2014 Sales Waterfall" 
        ) 
       
 
        my_plot.plot(step.index, step.values, 
        'k' 
        ) 
       
 
        my_plot.set_xlabel( 
        "Transaction Types" 
        ) 
       

           
       
 
        #Format the axis for dollars 
       
 
        my_plot.yaxis.set_major_formatter(formatter) 
       

           
       
 
        #Get the y-axis position for the labels 
       
 
        y_height  
        = 
        trans.amount.cumsum().shift( 
        1 
        ).fillna( 
        0 
        ) 
       

           
       
 
        #Get an offset so labels don't sit right on top of the bar 
       
 
        max 
        = 
        trans. 
        max 
        () 
       
 
        neg_offset  
        = 
        max 
        / 
        25 
       
 
        pos_offset  
        = 
        max 
        / 
        50 
       
 
        plot_offset  
        = 
        int 
        ( 
        max 
        / 
        15 
        ) 
       

           
       
 
        #Start label loop 
       
 
        loop  
        = 
        0 
       
 
        for 
        index, row  
        in 
        trans.iterrows(): 
       
 
             
        # For the last item in the list, we don't want to double count 
       
 
             
        if 
        row[ 
        'amount' 
        ]  
        = 
        = 
        total: 
       
 
                 
        y  
        = 
        y_height[loop] 
       
 
             
        else 
        : 
       
 
                 
        y  
        = 
        y_height[loop]  
        + 
        row[ 
        'amount' 
        ] 
       
 
             
        # Determine if we want a neg or pos offset 
       
 
             
        if 
        row[ 
        'amount' 
        ] >  
        0 
        : 
       
 
                 
        y  
        + 
        = 
        pos_offset 
       
 
             
        else 
        : 
       
 
                 
        y  
        - 
        = 
        neg_offset 
       
 
             
        my_plot.annotate( 
        "{:,.0f}" 
        . 
        format 
        (row[ 
        'amount' 
        ]),(loop,y),ha 
        = 
        "center" 
        ) 
       
 
             
        loop 
        + 
        = 
        1 
       

           
       
 
        #Scale up the y axis so there is room for the labels 
       
 
        my_plot.set_ylim( 
        0 
        ,blank. 
        max 
        () 
        + 
        int 
        (plot_offset)) 
       
 
        #Rotate the labels 
       
 
        my_plot.set_xticklabels(trans.index,rotation 
        = 
        0 
        ) 
       
 
        my_plot.get_figure().savefig( 
        "waterfall.png" 
        ,dpi 
        = 
        200 
        ,bbox_inches 
        = 
        'tight' 
        ) 
       
 
    
 
   
 

运行该脚本将生成下面这个漂亮的图表：

最后的想法

若是你以前不熟悉瀑布图，但愿这个示例可以向你展现它究竟是多么有用。我想，可能一些人会以为对于一个图表来讲须要这么多的脚本代码有点糟糕。在某些方面，我赞成这种想法。若是你仅仅只是作一个瀑布图，而之后不会再碰它，那么你仍是继续用Excel中的方法吧。

然而，若是瀑布图真的颇有用，而且你须要将它复制给100个客户，将会怎么样呢？接下来你将要怎么作呢？此时使用Excel将会是一个挑战，而使用本文中的脚原本建立100个不一样的表格将至关容易。再次说明，这一程序的真正价值在于，当你须要扩展这个解决方案时，它可以便于你建立一个易于复制的程序。

我真的很喜欢学习更多Pandas、matplotlib和IPothon的知识。我很高兴这种方法可以帮到你，并但愿其余人也能够从中学习到一些知识，并将这一课所学应用到他们的平常工做中。

教你用Python建立瀑布图

建立图表

最后的想法

关于做者： PyPer