>>> from pandas import Series,DataFrame >>> import pandas as pd
相似于一维数组的对象,由一组数据和相关的数据标签(索引)组成python
>>> obj=Series([4,7,-5,3]) >>> obj 0 4 1 7 2 -5 3 3 dtype: int64
经过values和index属性获取其数组表示形式和索引对象数组
>>> obj.values array([ 4, 7, -5, 3]) >>> obj.index RangeIndex(start=0, stop=4, step=1)
对各个数据点进行标记的索引安全
>>> obj2=Series([4,7,-5,3],index=['d','b','a','c']) >>> obj2 d 4 b 7 a -5 c 3 dtype: int64 >>> obj2.index Index([u'd', u'b', u'a', u'c'], dtype='object')
与普通Numpy数组相比,能够经过索引的方式选取Series中的单个或一组值数据结构
>>> obj2['a'] -5 >>> obj2[['a','b','c']] a -5 b 7 c 3 dtype: int64
将Series当作一个定长的有序字典app
>>> 'b' in obj2 True >>> 'e' in obj2 False
若是数据被存放在一个python字典中,能够直接经过这个字典建立Seriesdom
>>> sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000} >>> obj3=Series(sdata) >>> obj3 Ohio 35000 Oregon 16000 Texas 71000 Utah 500
若是只传入一个字典,则结果Series中的索引就是原字典的键函数
sdate中跟states索引相匹配,按照传入的states顺序进行排列性能
>>> states=['California','Ohio','Oregon','Texas'] >>> obj4=Series(sdata,index=states) >>> obj4 California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 dtype: float64
pandas的isnull和notnull函数用于检查缺失数据spa
>>> pd.isnull(obj4) California True Ohio False Oregon False Texas False dtype: bool
Series也有相似的实例方法设计
>>> obj4.isnull() California True Ohio False Oregon False Texas False dtype: bool
Series最重要的功能是---算术运算中会自动对齐不一样索引的数据;数据对齐功能
>>> obj3+obj4 California NaN Ohio 70000.0 Oregon 32000.0 Texas 142000.0 Utah NaN dtype: float64
Series对象自己及其索引都有一个name属性
>>> obj4.name='population' >>> obj4.index.name='state' >>> obj4 state California NaN Ohio 35000.0 Oregon 16000.0 Texas 71000.0 Name: population, dtype: float64
Series的索引能够经过赋值的方式就地修改
>>> obj.index=['Bob','Steve','Jeff','Ryan'] >>> obj Bob 4 Steve 7 Jeff -5 Ryan 3 dtype: int64
构建DataFrame,直接传入一个由等长列表或Numpy数组组成的字典
自动加上索引,且所有列会被有序排列
>>> data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],'year':[2000,2001,2002,2001,2002],'pop':[1.5,1.7,3.6,2.4,2.9]} >>> frame=DataFrame(data) >>> frame pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002
若是指定了列序列,则DataFrame的列就会按照指定顺序进行排列
>>> DataFrame(data,columns=['year','state','pop']) year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9
跟Series同样,若是传入的列在数据中找不到,就会产生NA值
>>> frame2=DataFrame(data,columns=['year','state','pop','debt'],index=['one','two','three','four','five']) >>> frame2 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN >>> frame2.columns Index([u'year', u'state', u'pop', u'debt'], dtype='object')
经过相似字典标记的方式或属性,能够将DataFrame的列获取为一个Series,拥有原DataFrame相同的索引,其name属性已经被相应地设置好
>>> frame2['state'] one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object >>> frame2.year one 2000 two 2001 three 2002 four 2001 five 2002 Name: year, dtype: int64
行也能够经过位置或名称的方式进行获取,好比用索引字段ix
>>> frame2.ix['three'] year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: object
列能够经过赋值的方式进行修改
>>> frame2['debt']=16.5 >>> frame2['debt']=np.arange(5.) >>> frame2 year state pop debt one 2000 Ohio 1.5 0.0 two 2001 Ohio 1.7 1.0 three 2002 Ohio 3.6 2.0 four 2001 Nevada 2.4 3.0 five 2002 Nevada 2.9 4.0
将列表或数组赋值给某个列时,长度必须跟DataFrame的长度相匹配
若是赋值的是一个Series,就会精确匹配DataFrame的索引,全部的空位都将被填上缺省值
>>> val=Series([-1.2,-1.5,-1.7],index=['two','four','five']) >>> frame2['debt']=val >>> frame2 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
为不存在的列赋值会建立出一个新列
>>> frame2['eastern']=frame2.state == 'Ohio' >>> frame2 year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False
关键字del用于删除列
>>> del frame2['eastern'] >>> frame2 year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7
经过索引方式返回的列只是相应数据的视图而已,并非副本。对返回的Series所作的任何就地修改所有会反映到源DataFrame上
另外一个常见的数据形式是嵌套字典(字典的字典)
外层字典的键做为列,内层键则做为行索引
>>> pop={'Nevada':{2001:2.4,2002:2.9}, ... 'Ohio':{2000:1.5,2001:1.7,2002:3.6}} >>> frame3=DataFrame(pop) >>> frame3 Nevada Ohio 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
对结果进行转置
>>> frame3.T 2000 2001 2002 Nevada NaN 2.4 2.9 Ohio 1.5 1.7 3.6
内层字典的键会被合并,排序以造成最终的索引
>>> DataFrame(pop,index=[2001,2002,2003]) Nevada Ohio 2001 2.4 1.7 2002 2.9 3.6 2003 NaN NaN
设置DataFrame的index和columns的name属性
>>> frame3.index.name='year';frame3.columns.name='state' >>> frame3 state Nevada Ohio year 2000 NaN 1.5 2001 2.4 1.7 2002 2.9 3.6
跟Series同样,values属性也会以二维ndarray的形式返回DataFrame中的数据
>>> frame3.values array([[ nan, 1.5], [ 2.4, 1.7], [ 2.9, 3.6]])
若是DataFrame各列的数据类型不一样,则值数组的数据类型就会选用兼容全部列的数据的数据类型
>>> frame2.values array([[2000, 'Ohio', 1.5, nan], [2001, 'Ohio', 1.7, -1.2], [2002, 'Ohio', 3.6, nan], [2001, 'Nevada', 2.4, -1.5], [2002, 'Nevada', 2.9, -1.7]], dtype=object)
pandas数据模型的重要组成部分
负责管理轴标签和其它元数据。构建Series或DataFrame时,所用到的任何数组或其它序列的标签都会被转换成一个Index
>>> obj=Series(range(3),index=['a','b','c']) >>> index=obj.index >>> index Index([u'a', u'b', u'c'], dtype='object')
Index对象是不可修改的(immutable),使index对象在多个数据结构之间安全共享
>>> obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c']) >>> obj d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64
调用该Series的reindex将会根据新索引进行重排,索引值不存在引入缺失值
>>> obj2=obj.reindex(['a','b','c','d','e']) >>> obj2 a -5.3 b 7.2 c 3.6 d 4.5 e NaN dtype: float64
设定默认的缺失值
>>> obj2=obj.reindex(['a','b','c','d','e'],fill_value=0) >>> obj2 a -5.3 b 7.2 c 3.6 d 4.5 e 0.0 dtype: float64
对于时间序列这样的有序数据,从新索引时能够须要作一些插值处理
method选项
>>> obj3=Series(['blue','purple','yellow'],index=[0,2,4])
reindex的插值method选项
ffill或pad;前向填充或搬运值
>>> obj3.reindex(range(6)method='ffill') 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object
对于DataFrame,reindex能够修改行索引,列,或两个都修改。若是仅传入一个序列,则会从新索引行
>>> frame=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['Ohio','Texas','California']) >>> frame Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 >>> frame2=frame.reindex(['a','b','c','d']) >>> frame2 Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0
使用columns关键字便可从新索引列
>>> states=['Texas','Utah','California'] >>> frame.reindex(columns=states) Texas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8
能够同时对行和列进行从新索引,而插值则只能按行应用
>>> frame.reindex(index=['a','b','c','d'],method='ffill',columns=states) Texas Utah California a 1 NaN 2 b 1 NaN 2 c 4 NaN 5 d 7 NaN 8
利用ix的标签索引功能,从新索引任务能够变得更简洁
>>> frame.ix[['a','b','c','d'],states] Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN c 4.0 NaN 5.0 d 7.0 NaN 8.0
index 索引的新序列
method 插值填充方式
fill_value 在从新索引的过程当中,须要引入缺失值时使用的替代值
limit 前向或后向填充时的最大填充量
level 在MultiIndex的指定级别上匹配简单索引,不然选取其子集
copy 默认为true,不管如何都复制
>>> obj=Series(np.arange(5.),index=['a','b','c','d','e']) >>> obj a 0.0 b 1.0 c 2.0 d 3.0 e 4.0 dtype: float64 >>> new_obj=obj.drop('c') >>> new_obj a 0.0 b 1.0 d 3.0 e 4.0 dtype: float64
对于DataFrame,能够删除任意轴上的索引值
>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) >>> data one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 >>> data.drop(['Colorado','Ohio']) one two three four Utah 8 9 10 11 New York 12 13 14 15 >>> data.drop('two',axis=1) one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15 >>> data.drop(['two','four'],axis=1) one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14
相似于Numpy数组的索引,只不过Series的索引值不仅是整数
Series利用标签的切片运算与普通的python切片运算不一样,其末端是包含的
>>> data=DataFrame(np.arange(16).reshape((4,4)),index=['Ohio','Colorado','Utah','New York'],columns=['one','two','three','four']) >>> data one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15
DataFrame的切片
>>> data[:2] one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 >>> data<5 one two three four Ohio True True True True Colorado True False False False Utah False False False False New York False False False False >>> data[data<5]=0 >>> data one two three four Ohio 0 0 0 0 Colorado 0 5 6 7 Utah 8 9 10 11 New York 12 13 14 15 >>> data.ix['Colorado',['two','three']] two 5 three 6 Name: Colorado, dtype: int64 >>> data.ix[['Colorado','Utah'],[3,0,1]] four one two Colorado 7 0 5 Utah 11 8 9 >>> data.ix[2] one 8 two 9 three 10 four 11 Name: Utah, dtype: int64 >>> data.ix[:'Utah','two'] Ohio 0 Colorado 5 Utah 9 Name: two, dtype: int64
自动的数据对齐操做在不重叠的索引处引入NA值
对于DataFrame,对齐操做会同时发生在行和列上
使用add方法,传入加数以及一个fill_value参数:obj.add(obj2,fill_value=0)
>>> arr=np.arange(12.).reshape((3,4)) >>> arr array([[ 0., 1., 2., 3.], [ 4., 5., 6., 7.], [ 8., 9., 10., 11.]]) >>> arr[0] array([ 0., 1., 2., 3.]) >>> arr-arr[0] array([[ 0., 0., 0., 0.], [ 4., 4., 4., 4.], [ 8., 8., 8., 8.]])
这叫作广播(broadcasting)
>>> frame=DataFrame(np.arange(12.).reshape((4,3)),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) >>> series=frame.ix[0] >>> frame b d e Utah 0.0 1.0 2.0 Ohio 3.0 4.0 5.0 Texas 6.0 7.0 8.0 Oregon 9.0 10.0 11.0 >>> series b 0.0 d 1.0 e 2.0 Name: Utah, dtype: float64
默认状况下,DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的行,而后沿着行一直向下广播
>>> frame-series b d e Utah 0.0 0.0 0.0 Ohio 3.0 3.0 3.0 Texas 6.0 6.0 6.0 Oregon 9.0 9.0 9.0
若是某个索引值在DataFrame的列或Series的索引中找不到,则参与运算的两个对象被从新索引以造成并集
>>> series2=Series(range(3),index=['b','e','f']) >>> series2 b 0 e 1 f 2 dtype: int64 >>> frame+series2 b d e f Utah 0.0 NaN 3.0 NaN Ohio 3.0 NaN 6.0 NaN Texas 6.0 NaN 9.0 NaN Oregon 9.0 NaN 12.0 NaN
匹配行且在列上广播,则必须使用算术运算方法
>>> frame.sub(series3,axis=0) b d e Utah -1.0 0.0 1.0 Ohio -1.0 0.0 1.0 Texas -1.0 0.0 1.0 Oregon -1.0 0.0 1.0
许多最为常见的数组统计功能都被实现成DataFrame的方法
>>> frame=DataFrame(np.random.randn(4,3),columns=list('bde'),index=['Utah','Ohio','Texas','Oregon']) >>> frame b d e Utah -1.120701 -0.772813 -1.183221 Ohio -0.690566 0.610834 0.382371 Texas 0.287303 -0.001705 -1.055101 Oregon 1.149945 1.056177 -0.178909 >>> def f(x): ... return Series([x.min(),x.max()],index=['min','max']) ... >>> frame.apply(f) b d e min -1.120701 -0.772813 -1.183221 max 1.149945 1.056177 0.382371
frame中各个浮点值的格式化字符串
>>> format=lambda x:'%.2f' % x >>> frame.applymap(format) b d e Utah -1.12 -0.77 -1.18 Ohio -0.69 0.61 0.38 Texas 0.29 -0.00 -1.06 Oregon 1.15 1.06 -0.18
Series有一个用于元素级函数的map方法
>>> frame['e'].map(format) Utah -1.18 Ohio 0.38 Texas -1.06 Oregon -0.18 Name: e, dtype: object
>>> obj=Series(range(4),index=['d','a','b','c']) >>> obj.sort_index() a 1 b 2 c 3 d 0 dtype: int64
对于DataFrame,则能够根据任意一个轴上的索引进行排序
数据默认是按升序排列的,但也能够降序排列
>>> frame=DataFrame(np.arange(8).reshape((2,4)),index=['three','one'],columns=['d','a','b','c']) >>> frame.sort_index(axis=1) a b c d three 1 2 3 0 one 5 6 7 4 >>> frame.sort_index(axis=1,ascending=False) d c b a three 0 3 2 1 one 4 7 6 5
按值对Series进行排序,可以使用其order方法
>>> obj=Series([4,7,-3,2]) >>> obj.order() >>> obj.sort_values() 2 -3 3 2 0 4 1 7 dtype: int64
在排序时,任何缺失值默认都会被放到Series的末尾
>>> frame=DataFrame({'b':[4,7,-3,2],'a':[0,1,0,1]}) >>> frame a b 0 0 4 1 1 7 2 0 -3 3 1 2 >>> frame.sort_values(by='b') a b 2 0 -3 3 1 2 0 0 4 1 1 7
根据某种规则破坏平级关系
>>> obj=Series([7,-5,7,4,2,0,4]) >>> obj 0 7 1 -5 2 7 3 4 4 2 5 0 6 4 dtype: int64 >>> obj.rank() 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64
排名时用于破坏平级关系的method选项
average 默认,在相等分组中,为各个值分配平均排名
min 使用整个分组的最小排名
max 使用整个分组的最大排名
first 按值在原始数据中的出现顺序分配排名
许多pandas函数(eg:reindex)都要求标签惟一,但并非强制性
索引的is_unique属性
>>> obj.index.is_unique
False
某个索引对应多个值,则返回一个Series
>>> obj['a'] a 0 a 1 dtype: int64
对应单值,返回一个标量值
>>> obj['c'] 4
sum求和,传入axis=1将会按行进行求和运算
NA值会自动被排除,除非整个切片(行或列)都是NA
经过skipna选项能够禁用该功能,df.mean(axis=1,skipna=False)
describe,用于一次性产生多个汇总统计
能够从一维Series的值中抽取信息
>>> obj=Series(['c','a','d','a','a','b','b','c','c']) >>> uniques=obj.unique() >>> uniques array(['c', 'a', 'd', 'b'], dtype=object)
计算一个Series中各值出现的频率
>>> obj.value_counts() c 3 a 3 b 2 d 1 dtype: int64
矢量化集合的成员资格,可用于选取Series中或DataFrame列中数据的子集
>>> mask=obj.isin(['b','c']) >>> mask 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool >>> obj[mask] 0 c 5 b 6 b 7 c 8 c dtype: object
missing data在大部分数据分析应用中都很常见,pandas的设计目标是让缺失数据的处理任务尽可能轻松
python内置的None值也会被当作NA处理
>>> from numpy import nan as NA >>> data=Series([1,NA,3.5,NA,7]) >>> data 0 1.0 1 NaN 2 3.5 3 NaN 4 7.0 dtype: float64 >>> data.dropna() 0 1.0 2 3.5 4 7.0 dtype: float64
经过布尔型索引
>>> data[data.notnull()] 0 1.0 2 3.5 4 7.0 dtype: float64
对于DataFrame,dropna默认丢弃任何含有缺失值的行
传入how='all'将会丢弃全为NA的那些行
>>> data=DataFrame([[1.,6.5,3.],[1.,NA,NA],[NA,NA,NA],[NA,6.5,3.]]) >>> data 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0 >>> data.dropna(how='all') 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.0 >>> data[4]=NA >>> data 0 1 2 4 0 1.0 6.5 3.0 NaN 1 1.0 NaN NaN NaN 2 NaN NaN NaN NaN 3 NaN 6.5 3.0 NaN >>> data.dropna(axis=1,how='all') 0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.0
时间序列数据,只想留下一部分观测数据
>>> df=DataFrame(np.random.randn(7,3)) >>> df 0 1 2 0 1.367974 -0.556556 0.679336 1 -0.480919 -1.535185 -0.299710 2 0.230583 0.140626 0.604209 3 0.437830 -0.467286 -0.859989 4 -0.254706 -0.227431 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273 >>> df.ix[:4,1]=NA >>> df.ix[:2,2] >>> df 0 1 2 0 1.367974 NaN NaN 1 -0.480919 NaN NaN 2 0.230583 NaN NaN 3 0.437830 NaN -0.859989 4 -0.254706 NaN -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
一行中至少有3个非NA值将其保留
>>> df.dropna(thresh=3) 0 1 2 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
fillna方法是最主要的函数
>>> df.fillna(0) 0 1 2 0 1.367974 0.000000 0.000000 1 -0.480919 0.000000 0.000000 2 0.230583 0.000000 0.000000 3 0.437830 0.000000 -0.859989 4 -0.254706 0.000000 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
一个字典调用fillna,就能够实现对不一样列填充不一样的值
>>> df.fillna({1:0.5,3:-1}) 0 1 2 0 1.367974 0.500000 NaN 1 -0.480919 0.500000 NaN 2 0.230583 0.500000 NaN 3 0.437830 0.500000 -0.859989 4 -0.254706 0.500000 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
fillna默认会返回新对象,但也能够对现有对象进行就地修改
返回被填充对象的引用
>>> _=df.fillna(0,inplace=True) >>> df 0 1 2 0 1.367974 0.000000 0.000000 1 -0.480919 0.000000 0.000000 2 0.230583 0.000000 0.000000 3 0.437830 0.000000 -0.859989 4 -0.254706 0.000000 -0.956299 5 0.966204 -2.010860 -0.010693 6 -0.673721 1.497827 -0.257273
对reindex有效的那些插值方法也能够用fillna
>>> df=DataFrame(np.random.randn(6,3)) >>> df.ix[2:,1]=NA;df.ix[4:,2]=NA >>> df 0 1 2 0 0.647866 0.891312 -0.211922 1 -1.455856 -0.629213 -1.043685 2 2.078467 NaN -0.067846 3 -0.223047 NaN 0.513800 4 0.306559 NaN NaN 5 0.404265 NaN NaN
填充最靠近行的数值填充,列行为
>>> df.fillna(method='ffill') 0 1 2 0 0.647866 0.891312 -0.211922 1 -1.455856 -0.629213 -1.043685 2 2.078467 -0.629213 -0.067846 3 -0.223047 -0.629213 0.513800 4 0.306559 -0.629213 0.513800 5 0.404265 -0.629213 0.513800 >>> df.fillna(method='ffill',limit=2) 0 1 2 0 0.647866 0.891312 -0.211922 1 -1.455856 -0.629213 -1.043685 2 2.078467 -0.629213 -0.067846 3 -0.223047 -0.629213 0.513800 4 0.306559 NaN 0.513800 5 0.404265 NaN 0.513800
hierachical indexing
一个轴上拥有多个(两个以上)索引级别
>>> data=Series(np.random.randn(10),index=[['a','a','a','b','b','b','c','c','d','d'],[1,2,3,1,2,3,1,2,2,3]]) >>> data a 1 -0.521370 2 0.658209 3 0.841101 b 1 0.354237 2 -0.426983 3 0.835357 c 1 -0.246308 2 0.709859 d 2 -1.215098 3 0.400793 dtype: float64
这就是带有MultiIndex索引的Series的格式化输出形式
>>> data.index MultiIndex(levels=[[u'a', u'b', u'c', u'd'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 1, 2]]) >>> data['b'] 1 0.354237 2 -0.426983 3 0.835357 dtype: float64
内层进行选取
>>> data[:,2] a 0.658209 b -0.426983 c 0.709859 d -1.215098 dtype: float64
层次化索引在数据重塑和基于分组的操做(如透视表生成)扮演重要的角色
>>> data.unstack() 1 2 3 a -0.521370 0.658209 0.841101 b 0.354237 -0.426983 0.835357 c -0.246308 0.709859 NaN d NaN -1.215098 0.400793
unstack的逆运算是stack
对于DataFrame,每条轴均可以有分层索引
>>> frame=DataFrame(np.arange(12).reshape((4,3)),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Green','Red','Green']]) >>> frame Ohio Colorado Green Red Green a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
各层均可以有名字(能够是字符串,也能够是别的python对象)
>>> frame.index.names=['key1','key2'] >>> frame Ohio Colorado Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 >>> frame.columns.names=['state','color'] >>> frame state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11
有了分部的列索引,能够轻松选取列分组
能够单首创建MultiIndex而后复用
>>> MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Green','Red','Green']],names=['state','color'])
swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象
>>> frame.swaplevel('key1','key2') state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11
sortlevel则根据单个级别中的值对数据进行排序
>>> frame state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 2 3 4 5 b 1 6 7 8 2 9 10 11 >>> frame.sortlevel(1) state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11 >>> frame.swaplevel(0,1).sortlevel(0) state Ohio Colorado color Green Red Green key2 key1 1 a 0 1 2 b 6 7 8 2 a 3 4 5 b 9 10 11
在层次化索引的对象上,若是索引是按字典方式从外向内排序,即调用sortlevel(0)或sort_index()的结果,数据选取操做的性能要好的多
许多对DataFrame和Series的描述和汇总统计都有一个level选项,用于指定在某条轴上求和的级别
>>> frame.sum(level='key2') state Ohio Colorado color Green Red Green key2 1 6 8 10 2 12 14 16 >>> frame.sum(level='color',axis=1) color Green Red key1 key2 a 1 2 1 2 8 4 b 1 14 7 2 20 10
想要将DataFrame的一个或多个列当作行索引来用,或者将行索引当成DataFrame的列
>>> frame=DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one','two','two','two','two'],'d':[0,1,2,0,1,2,3]}) >>> frame a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3
set_index()函数会将其一个或多个列转换为行索引,并建立一个新的DataFrame
>>> frame2=frame.set_index(['c','d']) >>> frame2 a b c d one 0 0 7 1 1 6 2 2 5 two 0 3 4 1 4 3 2 5 2 3 6 1
默认状况下,那些列会从DataFrame中移除,但也能够将其保留下来
>>> frame.set_index(['c','d'],drop=False) a b c d c d one 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 two 0 3 4 two 0 1 4 3 two 1 2 5 2 two 2 3 6 1 two 3
reset_index的功能相反,层次化索引的级别会被转移到列里面
>>> frame2.reset_index() c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1