Pandas入门

pandas的数据结构介绍

主要包含两个数据结构,Series和DataFramepython

Series

相似于一维数组,有数据和索引。默认建立整数型索引。
能够经过values和index获取数据和索引。web

from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj=Series([4,7,-5,3])
obj
0    4
1    7
2   -5
3    3
dtype: int64

若是想要自定义索引,举例以下,index就是一个列表:json

obj2=Series([4,7,-5,3],index=['b','d','a','c'])
obj2
b    4
d    7
a   -5
c    3
dtype: int64

经过索引选择Series中单个或者一组值,输入的参数是一个索引或者一个索引的list数组

obj2[['a','b','c']]
a   -5
b    4
c    3
dtype: int64

Series相似与一个Dict,索引和数据之间存在映射关系。能够直接使用Dict建立一个Series。微信

'b' in obj2
True
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
obj3
Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

上述例子中只传入一个字典,那么Series的索引就是原来Dict中的key,若是设置的index不一样的话,会出现NaN的状况,后面会详细讲解一下NaN的处理。数据结构

states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
pd.isnull(obj4)
California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

DataFrame

DataFrame是一个表格型的数据结构,含有一组有序的列,每列能够使不一样的值类。
DataFrame既有行索引也有列索引。
构建DataFrame的经常使用方法是直接传入一个由等长列表或者Numpy数组组成的Dict:app

data={
    'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
    'year':[2000,2001,2002,2001,2002],
    'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame=DataFrame(data)
frame
pop state year
0 1.5 Ohio 2000
1 1.7 Ohio 2001
2 3.6 Ohio 2002
3 2.4 Nevada 2001
4 2.9 Nevada 2002

若是指定列序列,那么DataFrame的列会按照制定顺序排列:dom

DataFrame(data,columns=['year','state','pop'])
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2002 Ohio 3.6
3 2001 Nevada 2.4
4 2002 Nevada 2.9

若是传入的列找不到对应的数据,那么就会产生NA值:函数

frame2=DataFrame(data,columns=['year','state','pop','debt'],
                 index=['one','two','three','four','five'])
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
five 2002 Nevada 2.9 NaN

frame2['state']或者frame2.year的方式,能够获取一个Series,也就是一列。
获取行的方法是用索引字段ix,好比frame2.ix['three']。url

frame2.year
one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64
frame2.ix['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

列能够经过赋值的方式进行修改,若是将列表或数组赋值给某个列,长度须要跟DataFrame的长度匹配,若是赋值的是一个Series,就是精确匹配DataFrame的索引,全部的空位都会填上缺失值:

frame2['debt']=16.5
frame2
year state pop debt
one 2000 Ohio 1.5 16.5
two 2001 Ohio 1.7 16.5
three 2002 Ohio 3.6 16.5
four 2001 Nevada 2.4 16.5
five 2002 Nevada 2.9 16.5
frame2['debt']=np.arange(5.)
frame2
year state pop debt
one 2000 Ohio 1.5 0.0
two 2001 Ohio 1.7 1.0
three 2002 Ohio 3.6 2.0
four 2001 Nevada 2.4 3.0
five 2002 Nevada 2.9 4.0
val=Series([-1.2,-1.5,-1.7],
          index=['two','four','five'])
frame2['debt']=val
frame2
year state pop debt
one 2000 Ohio 1.5 NaN
two 2001 Ohio 1.7 -1.2
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 -1.5
five 2002 Nevada 2.9 -1.7

为不存在的列赋值会建立出一个新列,使用del关键字能够删除列:

frame2['eastern']=frame2.state=='Ohio'
frame2
year state pop debt eatern eastern
one 2000 Ohio 1.5 NaN True True
two 2001 Ohio 1.7 -1.2 True True
three 2002 Ohio 3.6 NaN True True
four 2001 Nevada 2.4 -1.5 False False
five 2002 Nevada 2.9 -1.7 False False
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

若是使用嵌套字典来建立DataFrame,那么外层字典的key做为列,内层字典的key做为行索引:

pop={
    'Nevada':{2001:2.4,2002:2.9},
    'Ohio':{2000:1.5,2001:1.7,2002:3.6}
}
frame3=DataFrame(pop)
frame3
Nevada Ohio
2000 NaN 1.5
2001 2.4 1.7
2002 2.9 3.6
frame3.T
2000 2001 2002
Nevada NaN 2.4 2.9
Ohio 1.5 1.7 3.6
DataFrame(pop,index=[2001,2002,2003])
Nevada Ohio
2001 2.4 1.7
2002 2.9 3.6
2003 NaN NaN

索引对象

构建Series和DataFrame时,所用到的任何数组或其余序列的标签都会转换成一个Index对象,Index对象是不能修改的,于是才能使Index对象在多个数据结构中共享。

index=pd.Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
obj2.index is index
True

Index的方法和属性:
append,链接另外一个Index对象,产生一个新的Index
diff,计算差集,并获得一个Index
delete,删除索引i处的元素,并获得新的Index
drop,删除传入的值,并获得新的Index

基本功能

从新索引

reindex方法,建立一个适应新索引的新对象.
调用该Series的reindex将会根据新索引进行重拍,若是某个索引值不存在,就引入缺失值,fill_value。
method选项能够进行插值填充,ffill或pad,向前填充,bfill或backfill,向后填充。
好比:

obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj2=obj.reindex(['a','b','c','d','e'],fill_value=0)
obj2
a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64
obj3=Series(['blue','purple','yellow'],index=[0,2,4])
obj3.reindex(range(6),method='ffill')
0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

使用columns关键字能够从新索引列,可是插值只能按行应用,也就是index方向。

frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)

丢弃指定轴上的项

使用drop方法,给出一个索引数据或者列表,就能够删除。

obj=Series(np.arange(5.),index=['a','b','c','d','e'])
new_obj=obj.drop(['b','c'])
new_obj
a    0.0
d    3.0
e    4.0
dtype: float64

索引、选取和过滤

Series的索引相似于Numpy数组的索引,只不过不是整数,好比:

obj=Series(np.arange(4.),index=['a','b','c','d'])
obj['b']
1.0
obj[1]
1.0
obj[2:4]#这种切片使不包含末端的
c    2.0
d    3.0
dtype: float64
obj[['b','a','d']]
b    1.0
a    0.0
d    3.0
dtype: float64
obj[[1,3]]
b    1.0
d    3.0
dtype: float64
obj[obj>2]
d    3.0
dtype: float64
obj['b':'c']#若是是利用标签的切片,是包含末端的。
b    1.0
c    2.0
dtype: float64
obj['b':'c']=5#设置值的方式很简单
obj
a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

对DataFrame进行索引就是得到一个或者多个列:

data=DataFrame(np.arange(16).reshape(4,4),
              index=['Ohio','Colorado','Utah','New York'],
              columns=['one','two','three','four'])
data
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data['two']#获取标签为two的那一列
Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32
data[:2]#获取前两行
one two three four
Ohio 0 1 2 3
Colorado 4 5 6 7
data[data['three']>5]#获取three这一列中大于5的那几行
one two three four
Colorado 4 5 6 7
Utah 8 9 10 11
New York 12 13 14 15
data<5#布尔方法,计算每一个元素与5的大小
one two three four
Ohio True True True True
Colorado True False False False
Utah False False False False
New York False False False False
data[data<5]=0#将全部小于5的元素值设置为0
data
one two three four
Ohio 0 0 0 0
Colorado 0 5 6 7
Utah 8 9 10 11
New York 12 13 14 15

DataFrame在行上进行索引时,能够使用专门的.loc索引基于标签的字段,.iloc索引基于位置的字段

data.loc['Colorado',['two','three']]
two      5
three    6
Name: Colorado, dtype: int32

DataFrame和Series之间的运算

arr=np.arange(12.).reshape(3,4)
arr-arr[0]
#默认状况下DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,而后沿着行一直向下广播。
array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

函数应用和映射

frame=DataFrame(np.random.randn(4,3),
               columns=list('bde'),
               index=['Utah','Ohio','Texas','Oregon'])
np.abs(frame)
b d e
Utah 0.855613 1.696205 0.503547
Ohio 1.086818 1.448180 1.568419
Texas 0.360607 0.674741 0.590972
Oregon 1.270708 0.461014 0.427092
f=lambda x: x.max()-x.min()
frame.apply(f)#默认axis=0,也就是在列方向上,竖直方向上应用函数,能够设置axis=1
b    0.910101
d    2.370946
e    2.071966
dtype: float64

排序和排名

要对行或者列索引进行排序,能够用sort_index方法:

obj=Series(range(4),index=['d','a','c','b'])
obj.sort_index()
#按照index排序
a    1
b    3
c    2
d    0
dtype: int64
frame=DataFrame(np.arange(8).reshape(2,4),
               index=['three','one'],
               columns=['d','a','b','c'])
frame.sort_index()
#原本three在上,排序后one在上了,也就是默认为竖直方向排序,axis=0.还能够添加ascending=False进行降序排列
d a b c
one 4 5 6 7
three 0 1 2 3
frame.sort_index(axis=1)
a b c d
three 1 2 3 0
one 5 6 7 4

若是须要按值对Series排序,能够使用sort_values方法:

obj=pd.Series(np.random.randn(8))
obj.sort_values()
6   -0.896499
2   -0.827439
3   -0.520070
5   -0.216063
7    0.353973
1    0.400870
0    0.902996
4    1.854120
dtype: float64

汇总和计算描述统计

df=DataFrame(np.arange(8.).reshape(4,2),
            index=['a','b','c','d'],
            columns=['one','two'])
df.sum()
#默认计算列方向上的和,axis=0,能够设置axis=1计算行方向,设置skipna=True自动排除NA值,默认是true
one    12.0
two    16.0
dtype: float64
df.describe()
#针对Series或DataFrame各列计算汇总统计
one two
count 4.000000 4.000000
mean 3.000000 4.000000
std 2.581989 2.581989
min 0.000000 1.000000
25% 1.500000 2.500000
50% 3.000000 4.000000
75% 4.500000 5.500000
max 6.000000 7.000000
df.cumsum()
#样本值的累计和
one two
a 0.0 1.0
b 2.0 4.0
c 6.0 9.0
d 12.0 16.0

相关系数与协方差

from pandas_datareader import data as web
all_data={}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})

returns=price.pct_change()
returns.tail()
#这个例子不演示了,打不开雅虎的网页了。。。。
---------------------------------------------------------------------------

RemoteDataError                           Traceback (most recent call last)

<ipython-input-45-5ca20168c7a5> in <module>()
      2 all_data={}
      3 for ticker in ['AAPL','IBM','MSFT','GOOG']:
----> 4     all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
      5 price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
      6 volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})


c:\py35\lib\site-packages\pandas_datareader\data.py in get_data_yahoo(*args, **kwargs)
     38 
     39 def get_data_yahoo(*args, **kwargs):
---> 40     return YahooDailyReader(*args, **kwargs).read()
     41 
     42 


c:\py35\lib\site-packages\pandas_datareader\yahoo\daily.py in read(self)
    113         """ read one data from specified URL """
    114         try:
--> 115             df = super(YahooDailyReader, self).read()
    116             if self.ret_index:
    117                 df['Ret_Index'] = _calc_return_index(df['Adj Close'])


c:\py35\lib\site-packages\pandas_datareader\base.py in read(self)
    179         if isinstance(self.symbols, (compat.string_types, int)):
    180             df = self._read_one_data(self.url,
--> 181                                      params=self._get_params(self.symbols))
    182         # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    183         elif isinstance(self.symbols, DataFrame):


c:\py35\lib\site-packages\pandas_datareader\base.py in _read_one_data(self, url, params)
     77         """ read one data from specified URL """
     78         if self._format == 'string':
---> 79             out = self._read_url_as_StringIO(url, params=params)
     80         elif self._format == 'json':
     81             out = self._get_response(url, params=params).json()


c:\py35\lib\site-packages\pandas_datareader\base.py in _read_url_as_StringIO(self, url, params)
     88         Open url (and retry)
     89         """
---> 90         response = self._get_response(url, params=params)
     91         text = self._sanitize_response(response)
     92         out = StringIO()


c:\py35\lib\site-packages\pandas_datareader\base.py in _get_response(self, url, params, headers)
    137         if params is not None and len(params) > 0:
    138             url = url + "?" + urlencode(params)
--> 139         raise RemoteDataError('Unable to read URL: {0}'.format(url))
    140 
    141     def _get_crumb(self, *args):


RemoteDataError: Unable to read URL: https://query1.finance.yahoo.com/v7/finance/download/IBM?crumb=%5Cu002FUftz31NJjj&period1=946656000&interval=1d&period2=1262361599&events=history

处理缺失数据

from numpy import nan as NA
data=Series([1,NA,3.5,NA,7])
data.dropna()
#dropna返回一个仅含非空数据和索引值的Series
0    1.0
2    3.5
4    7.0
dtype: float64
data=DataFrame([
    [1.,6.5,3.],[1.,NA,NA],
    [NA,NA,NA],[NA,6.5,3.]
])
cleaned=data.dropna()#对于DataFrame,dropna默认丢弃任何含有缺失值的行;
#传入how='all'将只丢弃全为NA的行
data
0 1 2
0 1.0 6.5 3.0
1 1.0 NaN NaN
2 NaN NaN NaN
3 NaN 6.5 3.0
cleaned
0 1 2
0 1.0 6.5 3.0
data.fillna(0)
#填充缺失数据
0 1 2
0 1.0 6.5 3.0
1 1.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 6.5 3.0

层次化索引

在一个轴上有多个索引级别,也就是说能以低纬度形式处理高维度数据。
以Series为例

data=Series(np.random.randn(10),
           index=[
               ['a','a','a','b','b','b','c','c','d','d'],
               [1,2,3,1,2,3,1,2,2,3]
           ])
data
#MultiIndex索引
a  1    0.704940
   2    1.034785
   3   -0.575555
b  1    1.465815
   2   -2.065133
   3   -0.191078
c  1    2.251724
   2   -1.282849
d  2    0.270976
   3    1.014202
dtype: float64
data['b']
1    1.465815
2   -2.065133
3   -0.191078
dtype: float64
data.unstack()
#多维度的Series能够经过unstack方法从新安排到一个DataFrame中:其逆运算是stack
1 2 3
a 0.704940 1.034785 -0.575555
b 1.465815 -2.065133 -0.191078
c 2.251724 -1.282849 NaN
d NaN 0.270976 1.014202

对于一个DataFrame,每条轴均可以有分层索引:

frame=DataFrame(np.arange(12).reshape(4,3),
               index=[
                   ['a','a','b','b'],[1,2,1,2]
               ],
               columns=[
                   ['Ohio','Ohio','Colorado'],['Green','Red','Green']
               ])
frame
Ohio Colorado
Green Red Green
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
#各层均可以有名字
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11

重排分级顺序

frame.swaplevel('key1','key2')
#swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象。
frame.sort_index(level=1)
#sort_index能够根据单个级别中的值进行排序。
state Ohio Colorado
color Green Red Green
key1 key2
a 1 0 1 2
b 1 6 7 8
a 2 3 4 5
b 2 9 10 11
frame.sum(level='key2')
state Ohio Colorado
color Green Red Green
key2
1 6 8 10
2 12 14 16

若是您以为感兴趣的话,能够添加个人微信公众号:一步一步学Python

相关文章
相关标签/搜索