Pandas提供了大量与数据探索相关的函数。这些统计特征函数能反映出数据的总体分布,主要做为Pandas的对象DataFrame或Series的方法出现。
sum():计算数据样本的总和(按列计算) html
mean():计算数据样本的算术平均数 python
var():计算数据样本的方差 app
std():计算数据样本的标准差 函数
corr():计算数据样本的Spearman(Pearson)相关系数矩阵 spa
cov():计算数据样本的协方差矩阵 .net
skew():样本值的偏度(三阶矩) code
kurt():样本值的峰度(四阶矩)
describe():给出样本的基本描述(基本统计量如均值、标准差等)orm
import pandas as pd import sys
print('Python version ' + sys.version) print('Pandas version ' + pd.__version__)
# 建立一个以日期为索引的数据帧
States = ['NY', 'NY', 'NY', 'NY', 'FL', 'FL', 'GA', 'GA', 'FL', 'FL']
data = [1.0, 2, 3, 4, 5, 6, 7, 8, 9, 10]
idx = pd.date_range('1/1/2012', periods=10, freq='MS')
df1 = pd.DataFrame(data, index=idx, columns=['Revenue'])
df1['State'] = States
#建立第二个数据帧
data2 = [10.0, 10.0, 9, 9, 8, 8, 7, 7, 6, 6]
idx2 = pd.date_range('1/1/2013', periods=10, freq='MS')
df2 = pd.DataFrame(data2, index=idx2, columns=['Revenue']) df2['State'] = States
请参考pandas中时间序列——date_range函数
# 合并数据帧
df = pd.concat([df1,df2]) df
注意:平均误差和标准误差仅适用于高斯分布。htm
In [5]:对象
# 方法 1
# 建立df的一个拷贝 newdf = df.copy() newdf['x-Mean'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) newdf['1.96*std'] = 1.96*newdf['Revenue'].std() newdf['Outlier'] = abs(newdf['Revenue'] - newdf['Revenue'].mean()) > 1.96*newdf['Revenue'].std() newdf
Out[5]:
# 方法 2
# 按项分组 #建立df的一个拷贝
newdf = df.copy() State = newdf.groupby('State') newdf['Outlier'] = State.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() ) newdf['x-Mean'] = State.transform( lambda x: abs(x-x.mean()) ) newdf['1.96*std'] = State.transform( lambda x: 1.96*x.std() ) newdf
# Method 2
# Group by multiple items # make a copy of original df newdf = df.copy() StateMonth = newdf.groupby(['State', lambda x: x.month]) newdf['Outlier'] = StateMonth.transform( lambda x: abs(x-x.mean()) > 1.96*x.std() ) newdf['x-Mean'] = StateMonth.transform( lambda x: abs(x-x.mean()) ) newdf['1.96*std'] = StateMonth.transform( lambda x: 1.96*x.std() ) newdf
# Method 3
# Group by item # make a copy of original df newdf = df.copy() State = newdf.groupby('State') def s(group): group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean()) group['1.96*std'] = 1.96*group['Revenue'].std() group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std() return group Newdf2 = State.apply(s) Newdf2
# Method 3
# Group by multiple items # make a copy of original df newdf = df.copy() StateMonth = newdf.groupby(['State', lambda x: x.month]) def s(group): group['x-Mean'] = abs(group['Revenue'] - group['Revenue'].mean()) group['1.96*std'] = 1.96*group['Revenue'].std() group['Outlier'] = abs(group['Revenue'] - group['Revenue'].mean()) > 1.96*group['Revenue'].std() return group Newdf2 = StateMonth.apply(s) Newdf2
假设一个非高斯分布(若是你绘制它,它看起来不像正态分布)
# make a copy of original df
newdf = df.copy() State = newdf.groupby('State') newdf['Lower'] = State['Revenue'].transform( lambda x: x.quantile(q=.25) - (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) ) newdf['Upper'] = State['Revenue'].transform( lambda x: x.quantile(q=.75) + (1.5*(x.quantile(q=.75)-x.quantile(q=.25))) ) newdf['Outlier'] = (newdf['Revenue'] < newdf['Lower']) | (newdf['Revenue'] > newdf['Upper']) newdf
This tutorial wasrewrited by CDS