Python做统计分析Scipy.stats的文档_ python scipy.stats

2471℃

84个连续性分布（告诉你有那么多，没具体介绍）
12个离散型分布

t检验，ks检验，卡方检验，正态性检，同分布检验

Statistics (scipy.stats)
Introduction

In this tutorial we discuss many, but certainly not all, features of scipy.stats. The intention here is to provide a user with a working knowledge of this package. We refer to the reference manual for further details.

Note: This documentation is work in progress.

Random Variables

There are two general distribution classes that have been implemented for encapsulating continuous random variables anddiscrete random variables . Over 80 continuous random variables (RVs) and 10 discrete random variables have been implemented using these classes. Besides this, new routines and distributions can easily added by the end user. (If you create one, please contribute it).

All of the statistics functions are located in the sub-package scipy.stats and a fairly complete listing of these functions can be obtained using info(stats). The list of the random variables available can also be obtained from the docstring for the stats sub-package.

In the discussion below we mostly focus on continuous RVs. Nearly all applies to discrete variables also, but we point out some differences here: Specific Points for Discrete Distributions.

Getting Help

First of all, all distributions are accompanied with help functions. To obtain just some basic information we can call

>>>
>>> from scipy import stats
>>> from scipy.stats import norm
>>> print norm.__doc__

To find the support, i.e., upper and lower bound of the distribution, call:

>>>
>>> print ‘bounds of distribution lower: %s, upper: %s’ % (norm.a,norm.b)
bounds of distribution lower: -inf, upper: inf

We can list all methods and properties of the distribution with dir(norm). As it turns out, some of the methods are private methods although they are not named as such (their name does not start with a leading underscore), for example veccdf, are only available for internal calculation (those methods will give warnings when one tries to use them, and will be removed at some point).

To obtain the real main methods, we list the methods of the frozen distribution. (We explain the meaning of a frozen distribution below).

>>>
>>> rv = norm()
>>> dir(rv)  # reformatted
[‘__class__’, ‘__delattr__’, ‘__dict__’, ‘__doc__’, ‘__getattribute__’,
‘__hash__’, ‘__init__’, ‘__module__’, ‘__new__’, ‘__reduce__’, ‘__reduce_ex__’,
‘__repr__’, ‘__setattr__’, ‘__str__’, ‘__weakref__’, ‘args’, ‘cdf’, ‘dist’,
‘entropy’, ‘isf’, ‘kwds’, ‘moment’, ‘pdf’, ‘pmf’, ‘ppf’, ‘rvs’, ‘sf’, ‘stats’]

Finally, we can obtain the list of available distribution through introspection:

>>>
>>> import warnings
>>> warnings.simplefilter(‘ignore’, DeprecationWarning)
>>> dist_continu = [d for d in dir(stats) if
…                 isinstance(getattr(stats,d), stats.rv_continuous)]
>>> dist_discrete = [d for d in dir(stats) if
…                  isinstance(getattr(stats,d), stats.rv_discrete)]
>>> print ‘number of continuous distributions:’, len(dist_continu)
number of continuous distributions: 84
>>> print ‘number of discrete distributions:  ‘, len(dist_discrete)
number of discrete distributions:   12

Common Methods

The main public methods for continuous RVs are:

rvs: Random Variates
pdf: Probability Density Function
cdf: Cumulative Distribution Function
sf: Survival Function (1-CDF)
ppf: Percent Point Function (Inverse of CDF)
isf: Inverse Survival Function (Inverse of SF)
stats: Return mean, variance, (Fisher’s) skew, or (Fisher’s) kurtosis
moment: non-central moments of the distribution
rvs:随机变量
pdf：概率密度函。
cdf：累计分布函数
sf：残存函数（1-CDF）
ppf：分位点函数（CDF的逆）
isf：逆残存函数（sf的逆）
stats:返回均值，方差，（费舍尔）偏态，（费舍尔）峰度。
moment:分布的非中心矩。
Let’s take a normal RV as an example.

>>>
>>> norm.cdf(0)
0.5

To compute the cdf at a number of points, we can pass a list or a numpy array.

>>>
>>> norm.cdf([-1., 0, 1])
array([ 0.15865525,  0.5       ,  0.84134475])
>>> import numpy as np
>>> norm.cdf(np.array([-1., 0, 1]))
array([ 0.15865525,  0.5       ,  0.84134475])

Thus, the basic methods such as pdf, cdf, and so on are vectorized with np.vectorize.
Other generally useful methods are supported too:

>>>
>>> norm.mean(), norm.std(), norm.var()
(0.0, 1.0, 1.0)
>>> norm.stats(moments = “mv”)
(array(0.0), array(1.0))

To find the median of a distribution we can use the percent point function ppf, which is the inverse of the cdf:

>>>
>>> norm.ppf(0.5)
0.0

To generate a set of random variates:

>>>
>>> norm.rvs(size=5)
array([-0.35687759,  1.34347647, -0.11710531, -1.00725181, -0.51275702])

Don’t think that norm.rvs(5) generates 5 variates:

>>>
>>> norm.rvs(5)
7.131624370075814

This brings us, in fact, to the topic of the next subsection.

Shifting and Scaling

All continuous distributions take loc and scale as keyword parameters to adjust the location and scale of the distribution, e.g. for the standard normal distribution the location is the mean and the scale is the standard deviation.

>>>
>>> norm.stats(loc = 3, scale = 4, moments = “mv”)
(array(3.0), array(16.0))

In general the standardized distribution for a random variable X is obtained through the transformation (X – loc) / scale. The default values are loc = 0 and scale = 1.

Smart use of loc and scale can help modify the standard distributions in many ways. To illustrate the scaling further, the cdf of an exponentially distributed RV with mean 1/λ is given by
F(x)=1−exp(−λx)
By applying the scaling rule above, it can be seen that by taking scale  = 1./lambda we get the proper scale.

F(x)=1−exp(−λx)

>>>
>>> from scipy.stats import expon
>>> expon.mean(scale=3.)
3.0

The uniform distribution is also interesting:

>>>
>>> from scipy.stats import uniform
>>> uniform.cdf([0, 1, 2, 3, 4, 5], loc = 1, scale = 4)
array([ 0.  ,  0.  ,  0.25,  0.5 ,  0.75,  1.  ])

Finally, recall from the previous paragraph that we are left with the problem of the meaning of norm.rvs(5). As it turns out, calling a distribution like this, the first argument, i.e., the 5, gets passed to set the loc parameter. Let’s see:

>>>
>>> np.mean(norm.rvs(5, size=500))
4.983550784784704

Thus, to explain the output of the example of the last section: norm.rvs(5) generates a normally distributed random variate with mean loc=5.
I prefer to set the loc and scale parameter explicitly, by passing the values as keywords rather than as arguments. This is less of a hassle as it may seem. We clarify this below when we explain the topic of freezing a RV.

Shape Parameters

While a general continuous random variable can be shifted and scaled with the loc and scale parameters, some distributions require additional shape parameters. For instance, the gamma distribution, with density
γ(x,a)=λ(λx)a−1Γ(a)e−λx,
requires the shape parameter a. Observe that setting λ can be obtained by setting the scale keyword to 1/λ.

γ(x,a)=λ(λx)a−1Γ(a)e−λx,

Let’s check the number and name of the shape parameters of the gamma distribution. (We know from the above that this should be 1.)

>>>
>>> from scipy.stats import gamma
>>> gamma.numargs
1
>>> gamma.shapes
‘a’

Now we set the value of the shape variable to 1 to obtain the exponential distribution, so that we compare easily whether we get the results we expect.

>>>
>>>  gamma(1, scale=2.).stats(moments=”mv”)
(array(2.0), array(4.0))

Notice that we can also specify shape parameters as keywords:

>>>
>>> gamma(a=1, scale=2.).stats(moments=”mv”)
(array(2.0), array(4.0))

Freezing a Distribution

Passing the loc and scale keywords time and again can become quite bothersome. The concept of freezing a RV is used to solve such problems.

>>>
>>> rv = gamma(1, scale=2.)

By using rv we no longer have to include the scale or the shape parameters anymore. Thus, distributions can be used in one of two ways, either by passing all distribution parameters to each method call (such as we did earlier) or by freezing the parameters for the instance of the distribution. Let us check this:

>>>
>>> rv.mean(), rv.std()
(2.0, 2.0)

This is indeed what we should get.

The basic methods pdf and so on satisfy the usual numpy broadcasting rules. For example, we can calculate the critical values for the upper tail of the t distribution for different probabilites and degrees of freedom.

>>>
>>> stats.t.isf([0.1, 0.05, 0.01], [[10], [11]])
array([[ 1.37218364,  1.81246112,  2.76376946],
[ 1.36343032,  1.79588482,  2.71807918]])

Here, the first row are the critical values for 10 degrees of freedom and the second row for 11 degrees of freedom (d.o.f.). Thus, the broadcasting rules give the same result of calling isf twice:

>>>
>>> stats.t.isf([0.1, 0.05, 0.01], 10)
array([ 1.37218364,  1.81246112,  2.76376946])
>>> stats.t.isf([0.1, 0.05, 0.01], 11)
array([ 1.36343032,  1.79588482,  2.71807918])

If the array with probabilities, i.e, [0.1, 0.05, 0.01] and the array of degrees of freedom i.e., [10, 11, 12], have the same array shape, then element wise matching is used. As an example, we can obtain the 10% tail for 10 d.o.f., the 5% tail for 11 d.o.f. and the 1% tail for 12 d.o.f. by calling

>>>
>>> stats.t.isf([0.1, 0.05, 0.01], [10, 11, 12])
array([ 1.37218364,  1.79588482,  2.68099799])

Specific Points for Discrete Distributions

Discrete distribution have mostly the same basic methods as the continuous distributions. However pdf is replaced the probability mass function pmf, no estimation methods, such as fit, are available, and scale is not a valid keyword parameter. The location parameter, keyword loc can still be used to shift the distribution.

The computation of the cdf requires some extra attention. In the case of continuous distribution the cumulative distribution function is in most standard cases strictly monotonic increasing in the bounds (a,b) and has therefore a unique inverse. The cdf of a discrete distribution, however, is a step function, hence the inverse cdf, i.e., the percent point function, requires a different definition:
ppf(q) = min{x : cdf(x) >= q, x integer}

Cdf的计算要求一些额外的关注。在连续分布的情况下，累积分布函数在大多数标准情况下是严格递增的，所以有唯一的逆。而cdf在离散分布，无论如何，是阶跃函数，所以cdf的逆，分位点函数，要求一个不同的定义：
ppf(q) = min{x : cdf(x) >= q, x integer}
For further info, see the docs here.

We can look at the hypergeometric distribution as an example
>>>
>>> from scipy.stats import hypergeom
>>> [M, n, N] = [20, 7, 12]

>>>
>>> from scipy.stats import hypergeom
>>> [M, n, N] = [20, 7, 12]

If we use the cdf at some integer points and then evaluate the ppf at those cdf values, we get the initial integers back, for example

>>>
>>> x = np.arange(4)*2
>>> x
array([0, 2, 4, 6])
>>> prb = hypergeom.cdf(x, M, n, N)
>>> prb
array([ 0.0001031991744066,  0.0521155830753351,  0.6083591331269301,
0.9897832817337386])
>>> hypergeom.ppf(prb, M, n, N)
array([ 0.,  2.,  4.,  6.])

If we use values that are not at the kinks of the cdf step function, we get the next higher integer back:

>>>
>>> hypergeom.ppf(prb + 1e-8, M, n, N)
array([ 1.,  3.,  5.,  7.])
>>> hypergeom.ppf(prb – 1e-8, M, n, N)
array([ 0.,  2.,  4.,  6.])

Fitting Distributions

The main additional methods of the not frozen distribution are related to the estimation of distribution parameters:

fit: maximum likelihood estimation of distribution parameters, including location
and scale
fit：分布参数的极大似然估计，包括location与scale
fit_loc_scale: estimation of location and scale when shape parameters are given
fit_loc_scale:估计location与scale当形态参数给定时
nnlf: negative log likelihood function
nnlf:负对数似然函数
expect: Calculate the expectation of a function against the pdf or pmf
expect:计算函数pdf或pmf的期望值。
Performance Issues and Cautionary Remarks

The performance of the individual methods, in terms of speed, varies widely by distribution and method. The results of a method are obtained in one of two ways: either by explicit calculation, or by a generic algorithm that is independent of the specific distribution.

Explicit calculation, on the one hand, requires that the method is directly specified for the given distribution, either through analytic formulas or through special functions in scipy.special or numpy.random for rvs. These are usually relatively fast calculations.

The generic methods, on the other hand, are used if the distribution does not specify any explicit calculation. To define a distribution, only one of pdf or cdf is necessary; all other methods can be derived using numeric integration and root finding. However, these indirect methods can be very slow. As an example, rgh = stats.gausshyper.rvs(0.5, 2, 2, 2, size=100)creates random variables in a very indirect way and takes about 19 seconds for 100 random variables on my computer, while one million random variables from the standard normal or from the t distribution take just above one second.

Remaining Issues

The distributions in scipy.stats have recently been corrected and improved and gained a considerable test suite, however a few issues remain:
scipy.stats里的分布最近进行了升级并且被仔细的检查过了，不过仍有一些问题存在。
the distributions have been tested over some range of parameters, however in some corner ranges, a few incorrect results may remain.
the maximum likelihood estimation in fit does not work with default starting parameters for all distributions and the user needs to supply good starting parameters. Also, for some distribution using a maximum likelihood estimator might inherently not be the best choice.
分布在很多参数区间上的值被测试过了，然而在一些奇葩的边界，仍然可能有错误的值存在。
fit的极大似然估计以默认值作为初始参数将不会工作的很好，用户必须指派合适的初始参数。并且，对于一些分布使用极大似然估计本质上就不是一个好的选择。
Building Specific Distributions

The next examples shows how to build your own distributions. Further examples show the usage of the distributions and some statistical tests.

Making a Continuous Distribution, i.e., Subclassing rv_continuous

Making continuous distributions is fairly simple.

>>>
>>> from scipy import stats
>>> class deterministic_gen(stats.rv_continuous):
…     def _cdf(self, x):
…         return np.where(x < 0, 0., 1.)
…     def _stats(self):
…         return 0., 0., 0., 0.

>>>
>>> deterministic = deterministic_gen(name=”deterministic”)
>>> deterministic.cdf(np.arange(-3, 3, 0.5))
array([ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  1.,  1.,  1.,  1.,  1.])

Interestingly, the pdf is now computed automatically:

>>>
>>> deterministic.pdf(np.arange(-3, 3, 0.5))
array([  0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
5.83333333e+04,   4.16333634e-12,   4.16333634e-12,
4.16333634e-12,   4.16333634e-12,   4.16333634e-12])

Be aware of the performance issues mentions in Performance Issues and Cautionary Remarks. The computation of unspecified common methods can become very slow, since only general methods are called which, by their very nature, cannot use any specific information about the distribution. Thus, as a cautionary example:

>>>
(4.163336342344337e-13, 0.0)

But this is not correct: the integral over this pdf should be 1. Let’s make the integration interval smaller:

>>>
>>> quad(deterministic.pdf, -1e-3, 1e-3)  # warning removed
(1.000076872229173, 0.0010625571718182458)

This looks better. However, the problem originated from the fact that the pdf is not specified in the class definition of the deterministic distribution.

• 在线沟通，请点我在线咨询

• 咨询热线：
000-000-0000
客服qq：
3285157825