ch8_02 数据规整:聚合、合并、重塑

Jupyter notebook】更好的阅读体验!

接上一部分


轴向连接

  • 另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)。NumPy的concatenation函数可以用NumPy数组来做:
import numpy as np
import pandas as pd
arr = np.arange(12).reshape((3,4))
arr
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
np.concatenate([arr,arr],axis=1)
array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])
np.concatenate([arr,arr],axis=0)
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

pandas 的concat()函数

s1 = pd.Series([0,1], index=['a','b'])
s2 = pd.Series([2,3,4],index= ['c','d','e'])
s3 = pd.Series([5,6],index=['f','g'])
# 使用concat可以将这些Series的值和索引连接起来
pd.concat([s1,s2,s3])
a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64
  • 默认情况下是axis=0,也可以传入参数在列上合并,这样结果就会变成一个DataFrame
pd.concat([s1,s2,s3],axis=1,sort=False)
0 1 2
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
s4 = pd.concat([s1,s3])
s4
a    0
b    1
f    5
g    6
dtype: int64
pd.concat([s1,s4],axis=1,sort=False)
0 1
a 0.0 0
b 1.0 1
f NaN 5
g NaN 6
pd.concat([s1,s4],axis=1,join='inner')
0 1
a 0 0
b 1 1
  • 通过join_axes指定要在其它轴上使用的索引:
pd.concat([s1,s4],axis=1, join_axes=[['a','c','b','e']])
0 1
a 0.0 0.0
c NaN NaN
b 1.0 1.0
e NaN NaN
#不过有个问题,参与连接的片段在结果中区分不开。
#假设你想要在连接轴上创建一个层次化索引。使用keys参数即可达到这个目的:
result = pd.concat([s1,s2,s3],keys = ['one','two','three'])
result
one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64
result.unstack()
a b c d e f g
one 0.0 1.0 NaN NaN NaN NaN NaN
two NaN NaN 2.0 3.0 4.0 NaN NaN
three NaN NaN NaN NaN NaN 5.0 6.0
  • 如果沿着axis=1 进行合并,则keys就会成为列头
pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'],sort=False)
one two three
a 0.0 NaN NaN
b 1.0 NaN NaN
c NaN 2.0 NaN
d NaN 3.0 NaN
e NaN 4.0 NaN
f NaN NaN 5.0
g NaN NaN 6.0
  • 同样的逻辑也适用于DataFrame
df1 = pd.DataFrame(np.arange(6).reshape((3,2)),
                   index = ['a','b','c'],
                   columns=['one','two'])
df2 = pd.DataFrame(5+np.arange(4).reshape((2,2)),
                  index = ['a','c'],
                  columns = ['three','four'])
df1
one two
a 0 1
b 2 3
c 4 5
df2
three four
a 5 6
c 7 8
pd.concat([df1,df2],axis=1,keys=['level1','level2'],sort=False)
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
  • 如果传入的不是一个列表,而是一个字典,那么字典的键就会被当作keys的选项
pd.concat({'level1':df1, 'level2':df2},axis=1,sort=False)
level1 level2
one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
  • 此外还可以传入names参数
pd.concat([df1,df2],axis=1,keys=['level1','level2'],names=['upper','lower'],sort=False)
upper level1 level2
lower one two three four
a 0 1 5.0 6.0
b 2 3 NaN NaN
c 4 5 7.0 8.0
  • concat的参数如下:
    1

合并重叠数据集

a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b = pd.Series(np.arange(len(a),dtype=np.float64),index=a.index)
a
f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64
b[-1]=np.nan
b
f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64
np.where(pd.isnull(a),b,a)
array([0. , 2.5, 2. , 3.5, 4.5, nan])
b[:-2].combine_first(a[1:])
a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64
  • 注: np.where(condition, x, y)满足condition则输出x,否则输出y.上面的代码:如果a的元素是nan则输出b对应的元素

  • 对于DataFrame,combine_first会做同样的事情

df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
df1
a b c
0 1.0 NaN 2
1 NaN 2.0 6
2 5.0 NaN 10
3 NaN 6.0 14
df2
a b
0 5.0 NaN
1 4.0 3.0
2 NaN 4.0
3 3.0 6.0
4 7.0 8.0
df1.combine_first(df2)
a b c
0 1.0 NaN 2.0
1 4.0 2.0 6.0
2 5.0 4.0 10.0
3 3.0 6.0 14.0
4 7.0 8.0 NaN

8.3重塑和轴向旋转

重塑层次化索引

  • 主要方法有:
    • stack():将列旋转为行
    • unstack():将行旋转为列
data = pd.DataFrame(np.arange(6).reshape((2,3)),
                   index=pd.Index(['Ohio','Colordo'],name='state'),
                   columns=pd.Index(['one','two','three'],name='number'))
data
number one two three
state
Ohio 0 1 2
Colordo 3 4 5
result = data.stack()
result
state    number
Ohio     one       0
         two       1
         three     2
Colordo  one       3
         two       4
         three     5
dtype: int32
result.unstack()
number one two three
state
Ohio 0 1 2
Colordo 3 4 5
  • 默认情况下,unstack操作的是最内层(stack也是如此)。传入分层级别的编号或名称即可对其它级别进行unstack操作:
result.unstack(0)
state Ohio Colordo
number
one 0 3
two 1 4
three 2 5
result.unstack('state')
state Ohio Colordo
number
one 0 3
two 1 4
three 2 5
  • 如果有的值找不到,就会引入缺失数据
s1 = pd.Series([0,1,2,3],index=['a','b','c','d'])
s2 = pd.Series([4,5,6],index=['c','d','e'])
data = pd.concat([s1,s2],keys=['one','two'])
data
one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64
data.unstack()
a b c d e
one 0.0 1.0 2.0 3.0 NaN
two NaN NaN 4.0 5.0 6.0
data.unstack().stack()
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64
data.unstack().stack(dropna=False)
one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64
  • DataFrame进行unstack操作时,作为旋转轴的级别将会成为结果中的最低级别:
df = pd.DataFrame({'left': result, 'right': result + 5},
                  columns=pd.Index(['left', 'right'], name='side'))
df
side left right
state number
Ohio one 0 5
two 1 6
three 2 7
Colordo one 3 8
two 4 9
three 5 10
df.unstack('state')
side left right
state Ohio Colordo Ohio Colordo
number
one 0 3 5 8
two 1 4 6 9
three 2 5 7 10
# 调用stack()可以指明轴的名字
df.unstack('state').stack('side')
state Colordo Ohio
number side
one left 3 0
right 8 5
two left 4 1
right 9 6
three left 5 2
right 10 7

将长格式转换为宽格式

  • 多个时间序列数据通常是以所谓的“长格式”(long)或“堆叠格式”(stacked)存储在数据库和CSV中的。
data = pd.read_csv('data/examples/macrodata.csv')
data.head()
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.98 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.15 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.35 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.37 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.54 139.6 3.50 5.2 180.007 2.31 1.19
periods = pd.PeriodIndex(year=data.year,quarter=data.quarter,name='data')
columns = pd.Index(['realgdp','infl','unemp'],name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D','end')
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:5]
data item value
0 1959-03-31 realgdp 2710.349
1 1959-03-31 infl 0.000
2 1959-03-31 unemp 5.800
3 1959-06-30 realgdp 2778.801
4 1959-06-30 infl 2.340
pivoted = ldata.pivot('data','item','value')
pivoted[:5]
item infl realgdp unemp
data
1959-03-31 0.00 2710.349 5.8
1959-06-30 2.34 2778.801 5.1
1959-09-30 2.74 2775.488 5.3
1959-12-31 0.27 2785.204 5.6
1960-03-31 2.31 2847.699 5.2

将宽格式旋转为长格式

df = pd.DataFrame({'key':['foo','bar','baz'],
                  'A':[1,2,3],
                  'B':[4,5,6],
                  'C':[7,8,9]})
df
key A B C
0 foo 1 4 7
1 bar 2 5 8
2 baz 3 6 9
melted = pd.melt(df,['key'])
melted
key variable value
0 foo A 1
1 bar A 2
2 baz A 3
3 foo B 4
4 bar B 5
5 baz B 6
6 foo C 7
7 bar C 8
8 baz C 9
reshaped = melted.pivot('key','variable','value')
reshaped
variable A B C
key
bar 2 5 8
baz 3 6 9
foo 1 4 7
reshaped.reset_index()
variable key A B C
0 bar 2 5 8
1 baz 3 6 9
2 foo 1 4 7