【Jupyter notebook】更好的阅读体验!
接上一部分
轴向连接
- 另一种数据合并运算也被称作连接(concatenation)、绑定(binding)或堆叠(stacking)。NumPy的concatenation函数可以用NumPy数组来做:
import numpy as np
import pandas as pd
arr = np.arange(12).reshape((3,4))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np.concatenate([arr,arr],axis=1)
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
np.concatenate([arr,arr],axis=0)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
pandas 的concat()函数
s1 = pd.Series([0,1], index=['a','b'])
s2 = pd.Series([2,3,4],index= ['c','d','e'])
s3 = pd.Series([5,6],index=['f','g'])
pd.concat([s1,s2,s3])
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
- 默认情况下是axis=0,也可以传入参数在列上合并,这样结果就会变成一个DataFrame
pd.concat([s1,s2,s3],axis=1,sort=False)
|
0 |
1 |
2 |
a |
0.0 |
NaN |
NaN |
b |
1.0 |
NaN |
NaN |
c |
NaN |
2.0 |
NaN |
d |
NaN |
3.0 |
NaN |
e |
NaN |
4.0 |
NaN |
f |
NaN |
NaN |
5.0 |
g |
NaN |
NaN |
6.0 |
s4 = pd.concat([s1,s3])
s4
a 0
b 1
f 5
g 6
dtype: int64
pd.concat([s1,s4],axis=1,sort=False)
|
0 |
1 |
a |
0.0 |
0 |
b |
1.0 |
1 |
f |
NaN |
5 |
g |
NaN |
6 |
pd.concat([s1,s4],axis=1,join='inner')
- 通过join_axes指定要在其它轴上使用的索引:
pd.concat([s1,s4],axis=1, join_axes=[['a','c','b','e']])
|
0 |
1 |
a |
0.0 |
0.0 |
c |
NaN |
NaN |
b |
1.0 |
1.0 |
e |
NaN |
NaN |
result = pd.concat([s1,s2,s3],keys = ['one','two','three'])
result
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
result.unstack()
|
a |
b |
c |
d |
e |
f |
g |
one |
0.0 |
1.0 |
NaN |
NaN |
NaN |
NaN |
NaN |
two |
NaN |
NaN |
2.0 |
3.0 |
4.0 |
NaN |
NaN |
three |
NaN |
NaN |
NaN |
NaN |
NaN |
5.0 |
6.0 |
- 如果沿着axis=1 进行合并,则keys就会成为列头
pd.concat([s1,s2,s3],axis=1,keys=['one','two','three'],sort=False)
|
one |
two |
three |
a |
0.0 |
NaN |
NaN |
b |
1.0 |
NaN |
NaN |
c |
NaN |
2.0 |
NaN |
d |
NaN |
3.0 |
NaN |
e |
NaN |
4.0 |
NaN |
f |
NaN |
NaN |
5.0 |
g |
NaN |
NaN |
6.0 |
df1 = pd.DataFrame(np.arange(6).reshape((3,2)),
index = ['a','b','c'],
columns=['one','two'])
df2 = pd.DataFrame(5+np.arange(4).reshape((2,2)),
index = ['a','c'],
columns = ['three','four'])
df1
|
one |
two |
a |
0 |
1 |
b |
2 |
3 |
c |
4 |
5 |
df2
pd.concat([df1,df2],axis=1,keys=['level1','level2'],sort=False)
|
level1 |
level2 |
|
one |
two |
three |
four |
a |
0 |
1 |
5.0 |
6.0 |
b |
2 |
3 |
NaN |
NaN |
c |
4 |
5 |
7.0 |
8.0 |
- 如果传入的不是一个列表,而是一个字典,那么字典的键就会被当作keys的选项
pd.concat({'level1':df1, 'level2':df2},axis=1,sort=False)
|
level1 |
level2 |
|
one |
two |
three |
four |
a |
0 |
1 |
5.0 |
6.0 |
b |
2 |
3 |
NaN |
NaN |
c |
4 |
5 |
7.0 |
8.0 |
pd.concat([df1,df2],axis=1,keys=['level1','level2'],names=['upper','lower'],sort=False)
upper |
level1 |
level2 |
lower |
one |
two |
three |
four |
a |
0 |
1 |
5.0 |
6.0 |
b |
2 |
3 |
NaN |
NaN |
c |
4 |
5 |
7.0 |
8.0 |
- concat的参数如下:
合并重叠数据集
a = pd.Series([np.nan,2.5,np.nan,3.5,4.5,np.nan],index=['f','e','d','c','b','a'])
b = pd.Series(np.arange(len(a),dtype=np.float64),index=a.index)
a
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
b[-1]=np.nan
b
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
np.where(pd.isnull(a),b,a)
array([0. , 2.5, 2. , 3.5, 4.5, nan])
b[:-2].combine_first(a[1:])
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
-
注: np.where(condition, x, y)满足condition则输出x,否则输出y.上面的代码:如果a的元素是nan则输出b对应的元素
-
对于DataFrame,combine_first会做同样的事情
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
'b': [np.nan, 2., np.nan, 6.],
'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
'b': [np.nan, 3., 4., 6., 8.]})
df1
|
a |
b |
c |
0 |
1.0 |
NaN |
2 |
1 |
NaN |
2.0 |
6 |
2 |
5.0 |
NaN |
10 |
3 |
NaN |
6.0 |
14 |
df2
|
a |
b |
0 |
5.0 |
NaN |
1 |
4.0 |
3.0 |
2 |
NaN |
4.0 |
3 |
3.0 |
6.0 |
4 |
7.0 |
8.0 |
df1.combine_first(df2)
|
a |
b |
c |
0 |
1.0 |
NaN |
2.0 |
1 |
4.0 |
2.0 |
6.0 |
2 |
5.0 |
4.0 |
10.0 |
3 |
3.0 |
6.0 |
14.0 |
4 |
7.0 |
8.0 |
NaN |
8.3重塑和轴向旋转
重塑层次化索引
- 主要方法有:
- stack():将列旋转为行
- unstack():将行旋转为列
data = pd.DataFrame(np.arange(6).reshape((2,3)),
index=pd.Index(['Ohio','Colordo'],name='state'),
columns=pd.Index(['one','two','three'],name='number'))
data
number |
one |
two |
three |
state |
|
|
|
Ohio |
0 |
1 |
2 |
Colordo |
3 |
4 |
5 |
result = data.stack()
result
state number
Ohio one 0
two 1
three 2
Colordo one 3
two 4
three 5
dtype: int32
result.unstack()
number |
one |
two |
three |
state |
|
|
|
Ohio |
0 |
1 |
2 |
Colordo |
3 |
4 |
5 |
- 默认情况下,unstack操作的是最内层(stack也是如此)。传入分层级别的编号或名称即可对其它级别进行unstack操作:
result.unstack(0)
state |
Ohio |
Colordo |
number |
|
|
one |
0 |
3 |
two |
1 |
4 |
three |
2 |
5 |
result.unstack('state')
state |
Ohio |
Colordo |
number |
|
|
one |
0 |
3 |
two |
1 |
4 |
three |
2 |
5 |
s1 = pd.Series([0,1,2,3],index=['a','b','c','d'])
s2 = pd.Series([4,5,6],index=['c','d','e'])
data = pd.concat([s1,s2],keys=['one','two'])
data
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
data.unstack()
|
a |
b |
c |
d |
e |
one |
0.0 |
1.0 |
2.0 |
3.0 |
NaN |
two |
NaN |
NaN |
4.0 |
5.0 |
6.0 |
data.unstack().stack()
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data.unstack().stack(dropna=False)
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
- DataFrame进行unstack操作时,作为旋转轴的级别将会成为结果中的最低级别:
df = pd.DataFrame({'left': result, 'right': result + 5},
columns=pd.Index(['left', 'right'], name='side'))
df
|
side |
left |
right |
state |
number |
|
|
Ohio |
one |
0 |
5 |
two |
1 |
6 |
three |
2 |
7 |
Colordo |
one |
3 |
8 |
two |
4 |
9 |
three |
5 |
10 |
df.unstack('state')
side |
left |
right |
state |
Ohio |
Colordo |
Ohio |
Colordo |
number |
|
|
|
|
one |
0 |
3 |
5 |
8 |
two |
1 |
4 |
6 |
9 |
three |
2 |
5 |
7 |
10 |
df.unstack('state').stack('side')
|
state |
Colordo |
Ohio |
number |
side |
|
|
one |
left |
3 |
0 |
right |
8 |
5 |
two |
left |
4 |
1 |
right |
9 |
6 |
three |
left |
5 |
2 |
right |
10 |
7 |
将长格式转换为宽格式
- 多个时间序列数据通常是以所谓的“长格式”(long)或“堆叠格式”(stacked)存储在数据库和CSV中的。
data = pd.read_csv('data/examples/macrodata.csv')
data.head()
|
year |
quarter |
realgdp |
realcons |
realinv |
realgovt |
realdpi |
cpi |
m1 |
tbilrate |
unemp |
pop |
infl |
realint |
0 |
1959.0 |
1.0 |
2710.349 |
1707.4 |
286.898 |
470.045 |
1886.9 |
28.98 |
139.7 |
2.82 |
5.8 |
177.146 |
0.00 |
0.00 |
1 |
1959.0 |
2.0 |
2778.801 |
1733.7 |
310.859 |
481.301 |
1919.7 |
29.15 |
141.7 |
3.08 |
5.1 |
177.830 |
2.34 |
0.74 |
2 |
1959.0 |
3.0 |
2775.488 |
1751.8 |
289.226 |
491.260 |
1916.4 |
29.35 |
140.5 |
3.82 |
5.3 |
178.657 |
2.74 |
1.09 |
3 |
1959.0 |
4.0 |
2785.204 |
1753.7 |
299.356 |
484.052 |
1931.3 |
29.37 |
140.0 |
4.33 |
5.6 |
179.386 |
0.27 |
4.06 |
4 |
1960.0 |
1.0 |
2847.699 |
1770.5 |
331.722 |
462.199 |
1955.5 |
29.54 |
139.6 |
3.50 |
5.2 |
180.007 |
2.31 |
1.19 |
periods = pd.PeriodIndex(year=data.year,quarter=data.quarter,name='data')
columns = pd.Index(['realgdp','infl','unemp'],name='item')
data = data.reindex(columns=columns)
data.index = periods.to_timestamp('D','end')
ldata = data.stack().reset_index().rename(columns={0:'value'})
ldata[:5]
|
data |
item |
value |
0 |
1959-03-31 |
realgdp |
2710.349 |
1 |
1959-03-31 |
infl |
0.000 |
2 |
1959-03-31 |
unemp |
5.800 |
3 |
1959-06-30 |
realgdp |
2778.801 |
4 |
1959-06-30 |
infl |
2.340 |
pivoted = ldata.pivot('data','item','value')
pivoted[:5]
item |
infl |
realgdp |
unemp |
data |
|
|
|
1959-03-31 |
0.00 |
2710.349 |
5.8 |
1959-06-30 |
2.34 |
2778.801 |
5.1 |
1959-09-30 |
2.74 |
2775.488 |
5.3 |
1959-12-31 |
0.27 |
2785.204 |
5.6 |
1960-03-31 |
2.31 |
2847.699 |
5.2 |
将宽格式旋转为长格式
df = pd.DataFrame({'key':['foo','bar','baz'],
'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
df
|
key |
A |
B |
C |
0 |
foo |
1 |
4 |
7 |
1 |
bar |
2 |
5 |
8 |
2 |
baz |
3 |
6 |
9 |
melted = pd.melt(df,['key'])
melted
|
key |
variable |
value |
0 |
foo |
A |
1 |
1 |
bar |
A |
2 |
2 |
baz |
A |
3 |
3 |
foo |
B |
4 |
4 |
bar |
B |
5 |
5 |
baz |
B |
6 |
6 |
foo |
C |
7 |
7 |
bar |
C |
8 |
8 |
baz |
C |
9 |
reshaped = melted.pivot('key','variable','value')
reshaped
variable |
A |
B |
C |
key |
|
|
|
bar |
2 |
5 |
8 |
baz |
3 |
6 |
9 |
foo |
1 |
4 |
7 |
reshaped.reset_index()
variable |
key |
A |
B |
C |
0 |
bar |
2 |
5 |
8 |
1 |
baz |
3 |
6 |
9 |
2 |
foo |
1 |
4 |
7 |