Python - 将 SciPy Beta Distribution 应用于 Pandas DataFrame 的所有行答案

【问题标题】：Python - Apply SciPy Beta Distribution to all rows of Pandas DataFramePython - 将 SciPy Beta Distribution 应用于 Pandas DataFrame 的所有行
【发布时间】：2017-10-22 00:25:26
【问题描述】：

在 SciPy 中，可以按如下方式实现 beta 分发：

x=640495496
alpha=1.5017096
beta=628.110247
A=0
B=148000000000 
p = scipy.stats.beta.cdf(x, alpha, beta, loc=A, scale=B-A)

现在，假设我有一个 Pandas 数据框，其中包含 x、alpha、beta、A、B 列。如何将 beta 分布应用于每一行，并将结果附加为新列？

【问题讨论】：

标签： python pandas scipy beta-distribution

【解决方案1】：

需要apply 和函数scipy.stats.beta.cdf 和axis=1：

df['p'] = df.apply(lambda x:  scipy.stats.beta.cdf(x['x'], 
                                                   x['alpha'], 
                                                   x['beta'], 
                                                   loc=x['A'], 
                                                   scale=x['B']-x['A']), axis=1)

示例：

import scipy.stats

df = pd.DataFrame({'x':[640495496, 640495440],
                   'alpha':[1.5017096,1.5017045],
                   'beta':[628.110247, 620.110],
                   'A':[0,0],
                   'B':[148000000000,148000000000]})
print (df)
   A             B     alpha        beta          x
0  0  148000000000  1.501710  628.110247  640495496
1  0  148000000000  1.501704  620.110000  640495440

df['p'] = df.apply(lambda x:  scipy.stats.beta.cdf(x['x'], 
                                                   x['alpha'], 
                                                   x['beta'], 
                                                   loc=x['A'], 
                                                   scale=x['B']-x['A']), axis=1)
print (df)
   A             B     alpha        beta          x         p
0  0  148000000000  1.501710  628.110247  640495496  0.858060
1  0  148000000000  1.501704  620.110000  640495440  0.853758

【讨论】：

我已经导入了 scipy，但是当我使用 apply 时它返回一个错误：NameError: ("global name 'scipy' is not defined", u'occurred at index 0')
仅使用 import scipy 不会导入 scipy.stats。要使用scipy.stats，您必须使用import scipy.stats。
是的，澄清一下，我正在使用 import scipy.stats，但它似乎仍然无法正常工作。然而，下面的答案确实有效。
嗯，我在 spyder 的 python 3 中对其进行了测试，它对我有用。但也许我错了。
@Cameron - Warren Weckesser 的解决方案对您不起作用？ import scipy.stats ?

【解决方案2】：

鉴于我怀疑 pandas apply 只是遍历所有行，并且 scipy.stats 分布在每次调用中都有相当多的开销，我会使用矢量化版本：

>>> from scipy import stats
>>> df['p'] = stats.beta.cdf(df['x'], df['alpha'], df['beta'], loc=df['A'], scale=df['B']-df['A'])
>>> df
   A             B     alpha        beta          x         p
0  0  148000000000  1.501710  628.110247  640495496  0.858060
1  0  148000000000  1.501704  620.110000  640495440  0.853758

【讨论】：