pandas的数据结构介绍
主要包含两个数据结构,Series和DataFramepython
Series
相似于一维数组,有数据和索引。默认建立整数型索引。
能够经过values和index获取数据和索引。web
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
obj=Series([4,7,-5,3])
obj
0 4
1 7
2 -5
3 3
dtype: int64
若是想要自定义索引,举例以下,index就是一个列表:json
obj2=Series([4,7,-5,3],index=['b','d','a','c'])
obj2
b 4
d 7
a -5
c 3
dtype: int64
经过索引选择Series中单个或者一组值,输入的参数是一个索引或者一个索引的list数组
obj2[['a','b','c']]
a -5
b 4
c 3
dtype: int64
Series相似与一个Dict,索引和数据之间存在映射关系。能够直接使用Dict建立一个Series。微信
'b' in obj2
True
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=Series(sdata)
obj3
Ohio 35000
Oregon 16000
Texas 71000
Utah 5000
dtype: int64
上述例子中只传入一个字典,那么Series的索引就是原来Dict中的key,若是设置的index不一样的话,会出现NaN的状况,后面会详细讲解一下NaN的处理。数据结构
states=['California','Ohio','Oregon','Texas']
obj4=Series(sdata,index=states)
obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
pd.isnull(obj4)
California True
Ohio False
Oregon False
Texas False
dtype: bool
DataFrame
DataFrame是一个表格型的数据结构,含有一组有序的列,每列能够使不一样的值类。
DataFrame既有行索引也有列索引。
构建DataFrame的经常使用方法是直接传入一个由等长列表或者Numpy数组组成的Dict:app
data={
'state':['Ohio','Ohio','Ohio','Nevada','Nevada'],
'year':[2000,2001,2002,2001,2002],
'pop':[1.5,1.7,3.6,2.4,2.9]
}
frame=DataFrame(data)
frame
|
pop |
state |
year |
0 |
1.5 |
Ohio |
2000 |
1 |
1.7 |
Ohio |
2001 |
2 |
3.6 |
Ohio |
2002 |
3 |
2.4 |
Nevada |
2001 |
4 |
2.9 |
Nevada |
2002 |
若是指定列序列,那么DataFrame的列会按照制定顺序排列:dom
DataFrame(data,columns=['year','state','pop'])
|
year |
state |
pop |
0 |
2000 |
Ohio |
1.5 |
1 |
2001 |
Ohio |
1.7 |
2 |
2002 |
Ohio |
3.6 |
3 |
2001 |
Nevada |
2.4 |
4 |
2002 |
Nevada |
2.9 |
若是传入的列找不到对应的数据,那么就会产生NA值:函数
frame2=DataFrame(data,columns=['year','state','pop','debt'],
index=['one','two','three','four','five'])
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
NaN |
two |
2001 |
Ohio |
1.7 |
NaN |
three |
2002 |
Ohio |
3.6 |
NaN |
four |
2001 |
Nevada |
2.4 |
NaN |
five |
2002 |
Nevada |
2.9 |
NaN |
frame2['state']或者frame2.year的方式,能够获取一个Series,也就是一列。
获取行的方法是用索引字段ix,好比frame2.ix['three']。url
frame2.year
one 2000
two 2001
three 2002
four 2001
five 2002
Name: year, dtype: int64
frame2.ix['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
列能够经过赋值的方式进行修改,若是将列表或数组赋值给某个列,长度须要跟DataFrame的长度匹配,若是赋值的是一个Series,就是精确匹配DataFrame的索引,全部的空位都会填上缺失值:
frame2['debt']=16.5
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
16.5 |
two |
2001 |
Ohio |
1.7 |
16.5 |
three |
2002 |
Ohio |
3.6 |
16.5 |
four |
2001 |
Nevada |
2.4 |
16.5 |
five |
2002 |
Nevada |
2.9 |
16.5 |
frame2['debt']=np.arange(5.)
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
0.0 |
two |
2001 |
Ohio |
1.7 |
1.0 |
three |
2002 |
Ohio |
3.6 |
2.0 |
four |
2001 |
Nevada |
2.4 |
3.0 |
five |
2002 |
Nevada |
2.9 |
4.0 |
val=Series([-1.2,-1.5,-1.7],
index=['two','four','five'])
frame2['debt']=val
frame2
|
year |
state |
pop |
debt |
one |
2000 |
Ohio |
1.5 |
NaN |
two |
2001 |
Ohio |
1.7 |
-1.2 |
three |
2002 |
Ohio |
3.6 |
NaN |
four |
2001 |
Nevada |
2.4 |
-1.5 |
five |
2002 |
Nevada |
2.9 |
-1.7 |
为不存在的列赋值会建立出一个新列,使用del关键字能够删除列:
frame2['eastern']=frame2.state=='Ohio'
frame2
|
year |
state |
pop |
debt |
eatern |
eastern |
one |
2000 |
Ohio |
1.5 |
NaN |
True |
True |
two |
2001 |
Ohio |
1.7 |
-1.2 |
True |
True |
three |
2002 |
Ohio |
3.6 |
NaN |
True |
True |
four |
2001 |
Nevada |
2.4 |
-1.5 |
False |
False |
five |
2002 |
Nevada |
2.9 |
-1.7 |
False |
False |
del frame2['eastern']
frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')
若是使用嵌套字典来建立DataFrame,那么外层字典的key做为列,内层字典的key做为行索引:
pop={
'Nevada':{2001:2.4,2002:2.9},
'Ohio':{2000:1.5,2001:1.7,2002:3.6}
}
frame3=DataFrame(pop)
frame3
|
Nevada |
Ohio |
2000 |
NaN |
1.5 |
2001 |
2.4 |
1.7 |
2002 |
2.9 |
3.6 |
frame3.T
|
2000 |
2001 |
2002 |
Nevada |
NaN |
2.4 |
2.9 |
Ohio |
1.5 |
1.7 |
3.6 |
DataFrame(pop,index=[2001,2002,2003])
|
Nevada |
Ohio |
2001 |
2.4 |
1.7 |
2002 |
2.9 |
3.6 |
2003 |
NaN |
NaN |
索引对象
构建Series和DataFrame时,所用到的任何数组或其余序列的标签都会转换成一个Index对象,Index对象是不能修改的,于是才能使Index对象在多个数据结构中共享。
index=pd.Index(np.arange(3))
obj2=Series([1.5,-2.5,0],index=index)
obj2.index is index
True
Index的方法和属性:
append,链接另外一个Index对象,产生一个新的Index
diff,计算差集,并获得一个Index
delete,删除索引i处的元素,并获得新的Index
drop,删除传入的值,并获得新的Index
基本功能
从新索引
reindex方法,建立一个适应新索引的新对象.
调用该Series的reindex将会根据新索引进行重拍,若是某个索引值不存在,就引入缺失值,fill_value。
method选项能够进行插值填充,ffill或pad,向前填充,bfill或backfill,向后填充。
好比:
obj=Series([4.5,7.2,-5.3,3.6],index=['d','b','a','c'])
obj2=obj.reindex(['a','b','c','d','e'],fill_value=0)
obj2
a -5.3
b 7.2
c 3.6
d 4.5
e 0.0
dtype: float64
obj3=Series(['blue','purple','yellow'],index=[0,2,4])
obj3.reindex(range(6),method='ffill')
0 blue
1 blue
2 purple
3 purple
4 yellow
5 yellow
dtype: object
使用columns关键字能够从新索引列,可是插值只能按行应用,也就是index方向。
frame.reindex(index=['a','b','c','d'],method='ffill',columns=states)
丢弃指定轴上的项
使用drop方法,给出一个索引数据或者列表,就能够删除。
obj=Series(np.arange(5.),index=['a','b','c','d','e'])
new_obj=obj.drop(['b','c'])
new_obj
a 0.0
d 3.0
e 4.0
dtype: float64
索引、选取和过滤
Series的索引相似于Numpy数组的索引,只不过不是整数,好比:
obj=Series(np.arange(4.),index=['a','b','c','d'])
obj['b']
1.0
obj[1]
1.0
obj[2:4]#这种切片使不包含末端的
c 2.0
d 3.0
dtype: float64
obj[['b','a','d']]
b 1.0
a 0.0
d 3.0
dtype: float64
obj[[1,3]]
b 1.0
d 3.0
dtype: float64
obj[obj>2]
d 3.0
dtype: float64
obj['b':'c']#若是是利用标签的切片,是包含末端的。
b 1.0
c 2.0
dtype: float64
obj['b':'c']=5#设置值的方式很简单
obj
a 0.0
b 5.0
c 5.0
d 3.0
dtype: float64
对DataFrame进行索引就是得到一个或者多个列:
data=DataFrame(np.arange(16).reshape(4,4),
index=['Ohio','Colorado','Utah','New York'],
columns=['one','two','three','four'])
data
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
data['two']#获取标签为two的那一列
Ohio 1
Colorado 5
Utah 9
New York 13
Name: two, dtype: int32
data[:2]#获取前两行
|
one |
two |
three |
four |
Ohio |
0 |
1 |
2 |
3 |
Colorado |
4 |
5 |
6 |
7 |
data[data['three']>5]#获取three这一列中大于5的那几行
|
one |
two |
three |
four |
Colorado |
4 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
data<5#布尔方法,计算每一个元素与5的大小
|
one |
two |
three |
four |
Ohio |
True |
True |
True |
True |
Colorado |
True |
False |
False |
False |
Utah |
False |
False |
False |
False |
New York |
False |
False |
False |
False |
data[data<5]=0#将全部小于5的元素值设置为0
data
|
one |
two |
three |
four |
Ohio |
0 |
0 |
0 |
0 |
Colorado |
0 |
5 |
6 |
7 |
Utah |
8 |
9 |
10 |
11 |
New York |
12 |
13 |
14 |
15 |
DataFrame在行上进行索引时,能够使用专门的.loc索引基于标签的字段,.iloc索引基于位置的字段
data.loc['Colorado',['two','three']]
two 5
three 6
Name: Colorado, dtype: int32
DataFrame和Series之间的运算
arr=np.arange(12.).reshape(3,4)
arr-arr[0]
#默认状况下DataFrame和Series之间的算术运算会将Series的索引匹配到DataFrame的列,而后沿着行一直向下广播。
array([[ 0., 0., 0., 0.],
[ 4., 4., 4., 4.],
[ 8., 8., 8., 8.]])
函数应用和映射
frame=DataFrame(np.random.randn(4,3),
columns=list('bde'),
index=['Utah','Ohio','Texas','Oregon'])
np.abs(frame)
|
b |
d |
e |
Utah |
0.855613 |
1.696205 |
0.503547 |
Ohio |
1.086818 |
1.448180 |
1.568419 |
Texas |
0.360607 |
0.674741 |
0.590972 |
Oregon |
1.270708 |
0.461014 |
0.427092 |
f=lambda x: x.max()-x.min()
frame.apply(f)#默认axis=0,也就是在列方向上,竖直方向上应用函数,能够设置axis=1
b 0.910101
d 2.370946
e 2.071966
dtype: float64
排序和排名
要对行或者列索引进行排序,能够用sort_index方法:
obj=Series(range(4),index=['d','a','c','b'])
obj.sort_index()
#按照index排序
a 1
b 3
c 2
d 0
dtype: int64
frame=DataFrame(np.arange(8).reshape(2,4),
index=['three','one'],
columns=['d','a','b','c'])
frame.sort_index()
#原本three在上,排序后one在上了,也就是默认为竖直方向排序,axis=0.还能够添加ascending=False进行降序排列
|
d |
a |
b |
c |
one |
4 |
5 |
6 |
7 |
three |
0 |
1 |
2 |
3 |
frame.sort_index(axis=1)
|
a |
b |
c |
d |
three |
1 |
2 |
3 |
0 |
one |
5 |
6 |
7 |
4 |
若是须要按值对Series排序,能够使用sort_values方法:
obj=pd.Series(np.random.randn(8))
obj.sort_values()
6 -0.896499
2 -0.827439
3 -0.520070
5 -0.216063
7 0.353973
1 0.400870
0 0.902996
4 1.854120
dtype: float64
汇总和计算描述统计
df=DataFrame(np.arange(8.).reshape(4,2),
index=['a','b','c','d'],
columns=['one','two'])
df.sum()
#默认计算列方向上的和,axis=0,能够设置axis=1计算行方向,设置skipna=True自动排除NA值,默认是true
one 12.0
two 16.0
dtype: float64
df.describe()
#针对Series或DataFrame各列计算汇总统计
|
one |
two |
count |
4.000000 |
4.000000 |
mean |
3.000000 |
4.000000 |
std |
2.581989 |
2.581989 |
min |
0.000000 |
1.000000 |
25% |
1.500000 |
2.500000 |
50% |
3.000000 |
4.000000 |
75% |
4.500000 |
5.500000 |
max |
6.000000 |
7.000000 |
df.cumsum()
#样本值的累计和
|
one |
two |
a |
0.0 |
1.0 |
b |
2.0 |
4.0 |
c |
6.0 |
9.0 |
d |
12.0 |
16.0 |
相关系数与协方差
from pandas_datareader import data as web
all_data={}
for ticker in ['AAPL','IBM','MSFT','GOOG']:
all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})
returns=price.pct_change()
returns.tail()
#这个例子不演示了,打不开雅虎的网页了。。。。
---------------------------------------------------------------------------
RemoteDataError Traceback (most recent call last)
<ipython-input-45-5ca20168c7a5> in <module>()
2 all_data={}
3 for ticker in ['AAPL','IBM','MSFT','GOOG']:
----> 4 all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2000','1/1/2010')
5 price=DataFrame({tic:data['Adj Close'] for tic,data in all_data.items()})
6 volume=DataFrame({tic:data['Volume'] for tic,data in all_data.items()})
c:\py35\lib\site-packages\pandas_datareader\data.py in get_data_yahoo(*args, **kwargs)
38
39 def get_data_yahoo(*args, **kwargs):
---> 40 return YahooDailyReader(*args, **kwargs).read()
41
42
c:\py35\lib\site-packages\pandas_datareader\yahoo\daily.py in read(self)
113 """ read one data from specified URL """
114 try:
--> 115 df = super(YahooDailyReader, self).read()
116 if self.ret_index:
117 df['Ret_Index'] = _calc_return_index(df['Adj Close'])
c:\py35\lib\site-packages\pandas_datareader\base.py in read(self)
179 if isinstance(self.symbols, (compat.string_types, int)):
180 df = self._read_one_data(self.url,
--> 181 params=self._get_params(self.symbols))
182 # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
183 elif isinstance(self.symbols, DataFrame):
c:\py35\lib\site-packages\pandas_datareader\base.py in _read_one_data(self, url, params)
77 """ read one data from specified URL """
78 if self._format == 'string':
---> 79 out = self._read_url_as_StringIO(url, params=params)
80 elif self._format == 'json':
81 out = self._get_response(url, params=params).json()
c:\py35\lib\site-packages\pandas_datareader\base.py in _read_url_as_StringIO(self, url, params)
88 Open url (and retry)
89 """
---> 90 response = self._get_response(url, params=params)
91 text = self._sanitize_response(response)
92 out = StringIO()
c:\py35\lib\site-packages\pandas_datareader\base.py in _get_response(self, url, params, headers)
137 if params is not None and len(params) > 0:
138 url = url + "?" + urlencode(params)
--> 139 raise RemoteDataError('Unable to read URL: {0}'.format(url))
140
141 def _get_crumb(self, *args):
RemoteDataError: Unable to read URL: https://query1.finance.yahoo.com/v7/finance/download/IBM?crumb=%5Cu002FUftz31NJjj&period1=946656000&interval=1d&period2=1262361599&events=history
处理缺失数据
from numpy import nan as NA
data=Series([1,NA,3.5,NA,7])
data.dropna()
#dropna返回一个仅含非空数据和索引值的Series
0 1.0
2 3.5
4 7.0
dtype: float64
data=DataFrame([
[1.,6.5,3.],[1.,NA,NA],
[NA,NA,NA],[NA,6.5,3.]
])
cleaned=data.dropna()#对于DataFrame,dropna默认丢弃任何含有缺失值的行;
#传入how='all'将只丢弃全为NA的行
data
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
NaN |
NaN |
2 |
NaN |
NaN |
NaN |
3 |
NaN |
6.5 |
3.0 |
cleaned
data.fillna(0)
#填充缺失数据
|
0 |
1 |
2 |
0 |
1.0 |
6.5 |
3.0 |
1 |
1.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
0.0 |
3 |
0.0 |
6.5 |
3.0 |
层次化索引
在一个轴上有多个索引级别,也就是说能以低纬度形式处理高维度数据。
以Series为例
data=Series(np.random.randn(10),
index=[
['a','a','a','b','b','b','c','c','d','d'],
[1,2,3,1,2,3,1,2,2,3]
])
data
#MultiIndex索引
a 1 0.704940
2 1.034785
3 -0.575555
b 1 1.465815
2 -2.065133
3 -0.191078
c 1 2.251724
2 -1.282849
d 2 0.270976
3 1.014202
dtype: float64
data['b']
1 1.465815
2 -2.065133
3 -0.191078
dtype: float64
data.unstack()
#多维度的Series能够经过unstack方法从新安排到一个DataFrame中:其逆运算是stack
|
1 |
2 |
3 |
a |
0.704940 |
1.034785 |
-0.575555 |
b |
1.465815 |
-2.065133 |
-0.191078 |
c |
2.251724 |
-1.282849 |
NaN |
d |
NaN |
0.270976 |
1.014202 |
对于一个DataFrame,每条轴均可以有分层索引:
frame=DataFrame(np.arange(12).reshape(4,3),
index=[
['a','a','b','b'],[1,2,1,2]
],
columns=[
['Ohio','Ohio','Colorado'],['Green','Red','Green']
])
frame
|
|
Ohio |
Colorado |
|
|
Green |
Red |
Green |
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
frame.index.names=['key1','key2']
frame.columns.names=['state','color']
frame
#各层均可以有名字
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
2 |
3 |
4 |
5 |
b |
1 |
6 |
7 |
8 |
2 |
9 |
10 |
11 |
重排分级顺序
frame.swaplevel('key1','key2')
#swaplevel接受两个级别编号或名称,并返回一个互换了级别的新对象。
frame.sort_index(level=1)
#sort_index能够根据单个级别中的值进行排序。
|
state |
Ohio |
Colorado |
|
color |
Green |
Red |
Green |
key1 |
key2 |
|
|
|
a |
1 |
0 |
1 |
2 |
b |
1 |
6 |
7 |
8 |
a |
2 |
3 |
4 |
5 |
b |
2 |
9 |
10 |
11 |
frame.sum(level='key2')
state |
Ohio |
Colorado |
color |
Green |
Red |
Green |
key2 |
|
|
|
1 |
6 |
8 |
10 |
2 |
12 |
14 |
16 |
若是您以为感兴趣的话,能够添加个人微信公众号:一步一步学Python
