转载自https://www.cnblogs.com/jkmiao/p/5200635.htmlhtml
这个文档说了如下内容,对python如何作统计分析感兴趣的人能够看看,毕竟Python的库也有点乱。有的看上去应该在一块儿的内容分散在scipy,pandas,sympy等库中。这里是通常统计功能的使用,在scipy库中。像什么时间序列之类的固然在其余地方,并且它们反过来就没这些功能。python
随机变量样本抽取
84个连续性分布(告诉你有那么多,没具体介绍)
12个离散型分布
分布的密度分布函数,累计分布函数,残存函数,分位点函数,逆残存函数
分布的统计量:均值,方差,峰度,偏度,矩
分布的线性变换生成
数据的分布拟合
分布构造
描述统计
t检验,ks检验,卡方检验,正态性检,同分布检验
核密度估计(从样本估计几率密度分布函数)数组
Statistics (scipy.stats)
Introduction
介绍
In this tutorial we discuss many, but certainly not all, features of scipy.stats. The intention here is to provide a user with a working knowledge of this package. We refer to the reference manual for further details.
在这个教程咱们讨论一些而非所有的scipy.stats模块的特性。这里咱们的意图是提供给使用者一个关于这个包的实用性知识。咱们推荐reference manual来介绍更多的细节。
Note: This documentation is work in progress.
注意:这个文档还在发展中。
Random Variables
随机变量
There are two general distribution classes that have been implemented for encapsulating continuous random variables anddiscrete random variables . Over 80 continuous random variables (RVs) and 10 discrete random variables have been implemented using these classes. Besides this, new routines and distributions can easily added by the end user. (If you create one, please contribute it).
有一些通用的分布类被封装在continuous random variables以及discrete random variables中。有80多个连续性随机变量(RVs)以及10个离散随机变量已经用这些类创建。一样,新的程序和分布能够被用户新建立(若是你建立了一个,请提供它帮助发展这个包)。
All of the statistics functions are located in the sub-package scipy.stats and a fairly complete listing of these functions can be obtained using info(stats). The list of the random variables available can also be obtained from the docstring for the stats sub-package.
全部统计函数被放在子包scipy.stats中,且有这些函数的一个几乎完整的列表可使用info(stats)得到。这个列表里的随机变量也能够从stats子包的docstring中得到介绍。
In the discussion below we mostly focus on continuous RVs. Nearly all applies to discrete variables also, but we point out some differences here: Specific Points for Discrete Distributions.
在接下来的讨论中,沃恩着重于连续性随机变量(RVs)。几乎全部离散变量也符合下面的讨论,可是咱们也要指出一些区别在Specific Points for Discrete Distributions中。app
Getting Help
得到帮助
First of all, all distributions are accompanied with help functions. To obtain just some basic information we can call
在开始前,全部分布可使用help函数获得解释。为得到这些信息只须要使用简单的调用:
>>>
>>> from scipy import stats
>>> from scipy.stats import norm
>>> print norm.__doc__less
To find the support, i.e., upper and lower bound of the distribution, call:
为了找到支持,做为例子,咱们用这种方式找分布的上下界
>>>
>>> print 'bounds of distribution lower: %s, upper: %s' % (norm.a,norm.b)
bounds of distribution lower: -inf, upper: infdom
We can list all methods and properties of the distribution with dir(norm). As it turns out, some of the methods are private methods although they are not named as such (their name does not start with a leading underscore), for example veccdf, are only available for internal calculation (those methods will give warnings when one tries to use them, and will be removed at some point).
咱们能够经过调用dir(norm)来得到关于这个(正态)分布的全部方法和属性。应该看到,一些方法是私有方法尽管其并无以名称表示出来(好比它们前面没有如下划线开头),好比veccdf就只用于内部计算(试图使用那些方法将引起警告,它们可能会在后续开发中被移除)
To obtain the real main methods, we list the methods of the frozen distribution. (We explain the meaning of a frozen distribution below).
为了得到真正的主要方法,咱们列举冻结分布的方法(咱们将在下文解释何谓“冻结分布”)
>>>
>>> rv = norm()
>>> dir(rv) # reformatted
['__class__', '__delattr__', '__dict__', '__doc__', '__getattribute__',
'__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__setattr__', '__str__', '__weakref__', 'args', 'cdf', 'dist',
'entropy', 'isf', 'kwds', 'moment', 'pdf', 'pmf', 'ppf', 'rvs', 'sf', 'stats']ide
Finally, we can obtain the list of available distribution through introspection:
最后,咱们能经过内省得到全部的可用分布。
>>>
>>> import warnings
>>> warnings.simplefilter('ignore', DeprecationWarning)
>>> dist_continu = [d for d in dir(stats) if
... isinstance(getattr(stats,d), stats.rv_continuous)]
>>> dist_discrete = [d for d in dir(stats) if
... isinstance(getattr(stats,d), stats.rv_discrete)]
>>> print 'number of continuous distributions:', len(dist_continu)
number of continuous distributions: 84
>>> print 'number of discrete distributions: ', len(dist_discrete)
number of discrete distributions: 12函数
Common Methods
通用方法
The main public methods for continuous RVs are:
连续随机变量的主要公共方法以下:
rvs: Random Variates
pdf: Probability Density Function
cdf: Cumulative Distribution Function
sf: Survival Function (1-CDF)
ppf: Percent Point Function (Inverse of CDF)
isf: Inverse Survival Function (Inverse of SF)
stats: Return mean, variance, (Fisher’s) skew, or (Fisher’s) kurtosis
moment: non-central moments of the distribution
rvs:随机变量
pdf:几率密度函。
cdf:累计分布函数
sf:残存函数(1-CDF)
ppf:分位点函数(CDF的逆)
isf:逆残存函数(sf的逆)
stats:返回均值,方差,(费舍尔)偏态,(费舍尔)峰度。
moment:分布的非中心矩。
Let’s take a normal RV as an example.
让咱们取得一个标准的RV做为例子。
>>>
>>> norm.cdf(0)
0.5this
To compute the cdf at a number of points, we can pass a list or a numpy array.
为了计算在一个点上的cdf,咱们能够传递一个列表或一个numpy数组。
>>>
>>> norm.cdf([-1., 0, 1])
array([ 0.15865525, 0.5 , 0.84134475])
>>> import numpy as np
>>> norm.cdf(np.array([-1., 0, 1]))
array([ 0.15865525, 0.5 , 0.84134475])rest
Thus, the basic methods such as pdf, cdf, and so on are vectorized with np.vectorize.
Other generally useful methods are supported too:
相应的,像pdf,cdf之类的简单方法能够被矢量化经过np.vectorize.
其余游泳的方法能够像这样使用。
>>>
>>> norm.mean(), norm.std(), norm.var()
(0.0, 1.0, 1.0)
>>> norm.stats(moments = "mv")
(array(0.0), array(1.0))
To find the median of a distribution we can use the percent point function ppf, which is the inverse of the cdf:
为了找到一个分部的中心,咱们可使用分位数函数ppf,其是cdf的逆。
>>>
>>> norm.ppf(0.5)
0.0
To generate a set of random variates:
为了产生一个随机变量集合。
>>>
>>> norm.rvs(size=5)
array([-0.35687759, 1.34347647, -0.11710531, -1.00725181, -0.51275702])
Don’t think that norm.rvs(5) generates 5 variates:
不要认为norm.rvs(5)产生了五个变量。
>>>
>>> norm.rvs(5)
7.131624370075814
This brings us, in fact, to the topic of the next subsection.
这个引导咱们能够得以进入下一部分的内容。
Shifting and Scaling
位移与缩放(线性变换)
All continuous distributions take loc and scale as keyword parameters to adjust the location and scale of the distribution, e.g. for the standard normal distribution the location is the mean and the scale is the standard deviation.
全部连续分布能够操纵loc以及scale参数做为修正location和scale的方式。做为例子,标准正态分布的location是均值而scale是标准差。
>>>
>>> norm.stats(loc = 3, scale = 4, moments = "mv")
(array(3.0), array(16.0))
In general the standardized distribution for a random variable X is obtained through the transformation (X - loc) / scale. The default values are loc = 0 and scale = 1.
一般经标准化的分布的随机变量X能够经过变换(X-loc)/scale得到。它们的默认值是loc=0以及scale=1.
Smart use of loc and scale can help modify the standard distributions in many ways. To illustrate the scaling further, the cdf of an exponentially distributed RV with mean 1/λ is given by
F(x)=1−exp(−λx)
By applying the scaling rule above, it can be seen that by taking scale = 1./lambda we get the proper scale.
聪明的使用loc与scale能够帮助以灵活的方式调整标准分布。为了进一步说明缩放的效果,下面给出指望为1/λ指数分布的cdf。
F(x)=1−exp(−λx)
经过像上面那样使用scale,能够看到获得想要的指望值。
>>>
>>> from scipy.stats import expon
>>> expon.mean(scale=3.)
3.0
The uniform distribution is also interesting:
均匀分布也是使人感兴趣的:
>>>
>>> from scipy.stats import uniform
>>> uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)
array([ 0. , 0. , 0.25, 0.5 , 0.75, 1. ])
Finally, recall from the previous paragraph that we are left with the problem of the meaning of norm.rvs(5). As it turns out, calling a distribution like this, the first argument, i.e., the 5, gets passed to set the loc parameter. Let’s see:
最后,联系起咱们在前面段落中留下的norm.rvs(5)的问题。事实上,像这样调用一个分布,其第一个参数,在这里是5,是把loc参数调到了5,让咱们看:
>>>
>>> np.mean(norm.rvs(5, size=500))
4.983550784784704
Thus, to explain the output of the example of the last section: norm.rvs(5) generates a normally distributed random variate with mean loc=5.I prefer to set the loc and scale parameter explicitly, by passing the values as keywords rather than as arguments. This is less of a hassle as it may seem. We clarify this below when we explain the topic of freezing a RV.在这里,为解释最后一段的输出:norm.rvs(5)产生了一个正态分布变量,其指望,即loc=5.我倾向于明确的使用loc,scale做为关键字而非参数。这看上去只是个小麻烦。咱们澄清这一点在咱们解释冻结RV的主题以前。