具有平方误差和 (SSE) 的 Python 分布拟合答案

【问题标题】：Python Distribution Fitting with Sum of Square Error (SSE)具有平方误差和 (SSE) 的 Python 分布拟合
【发布时间】：2017-08-26 08:29:20
【问题描述】：

我正在尝试找到适合我的数据的最佳分布曲线，其中包含

y-axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 
          0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 
          0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

y 轴是事件在 x 轴时间段中发生的概率：

x-axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
          12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 
          22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 
          32.0, 33.0, 34.0]

我在Fitting empirical distribution to theoretical ones with Scipy (Python)?上提供的示例下面的python中这样做

具体来说，我正在尝试重新创建名为“带有平方误差和 (SSE) 的分布拟合”的部分，您可以在其中运行不同的分布以找到与数据的正确拟合。

如何修改该示例以使其适用于我的数据输入？回答了

根据比尔的回应更新版本，但现在尝试根据数据绘制拟合曲线并看到一些东西：

%matplotlib inline
import matplotlib.pyplot as plt
import scipy
import scipy.stats
import numpy as np
from scipy.stats import gamma, lognorm, loglaplace
from scipy.optimize import curve_fit

x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

matplotlib.rcParams['figure.figsize'] = (16.0, 12.0)
matplotlib.style.use('ggplot')

def f(x, a, loc, scale):
    return gamma.pdf(x, a, loc, scale)

result, pcov = curve_fit(f, x_axis, y_axis)

# get curve shape, location, scale
shape = result[:-2]
loc = result[-2]
scale = result[-1]

# construct the curve
x = np.linspace(0, 36, 100)
y = f(x, *result)

plt.bar(x_axis, y_axis, width, alpha=0.75)
plt.plot(x, y, c='g')

【问题讨论】：

你为什么不向我们展示你尝试过的东西，并解释它为什么没有按照你想要的方式工作。
其实很多东西在这里看起来很混乱。如果 y-s 是来自 [0, 1] 内的随机变量的样本，那么为什么要构建具有范围（48）的直方图？这没有任何意义，因为所有样本都将始终位于第一个 bin 中。如果否则 y 是 x 的函数，那么它实际上不是随机变量的样本，我不明白你想用这段代码拟合什么分布
所以我有一个假设的 x y 函数，我试图找到/拟合一条能够最好地塑造数据的分布曲线。在这种情况下，y 是彩票游戏中一个期限（47 个月期限）的月份，其中 x 是在该特定月份赢得彩票的概率。我查看了历史数据并以这种方式对数据进行分组以获得每个月的概率。现在我想通过分布曲线的形状找到适合我的数据的简单方程。

标签： python numpy scipy statistics distribution

【解决方案1】：

您的情况与您引用的问题中处理的情况不同。您同时拥有数据点的纵坐标和横坐标，而不是通常的 i.i.d。样本。我建议你使用scipy curve_fit。这是一个示例。

x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 32.0, 33.0, 34.0]
y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

## y_axis values must be normalised
sum_ys = sum(y_axis)
y_axis = [_/sum_ys for _ in y_axis]
print (sum(y_axis))

from scipy.stats import gamma, norm
from scipy.optimize import curve_fit

def gamma_f(x, a, loc, scale):
    return gamma.pdf(x, a, loc, scale)

def norm_f(x, loc, scale):
    return norm.pdf(x, loc, scale)

fitting = norm_f

result = curve_fit(fitting, x_axis, y_axis)
print (result)

import matplotlib.pyplot as plt

plt.plot(x_axis, y_axis, 'ro')
plt.plot(x_axis, [fitting(_, *result[0]) for _ in x_axis], 'b-')
plt.axis([0,35,0,.5])
plt.show()

这个版本展示了如何绘制一个图，以正常拟合数据。（伽玛拟合不佳。）法线只需要两个参数。一般来说，您只需要输出结果的第一部分，即参数、形状、位置和比例的估计值。

(array([  2.3352639 ,  -3.08105104,  10.15024823]), array([[   5954.86532869,  -27818.92220973,  -19675.22421994],
       [ -27818.92220973,  133161.76500251,   90741.43608615],
       [ -19675.22421994,   90741.43608615,   66054.79087992]]))

请注意，伽马分布的 pdf 也可以在 scipy 中获得，我认为您需要的其他文件也是如此，这样可以节省您编写它们的工作。

我从第一个代码中省略的最重要的事情是需要对 y 值进行归一化，也就是说，使它们总和为 1，因为它们应该近似于直方图的高度。

【讨论】：

谢谢，这有帮助！现在，如果我想使用三个参数重新创建曲线并根据初始数据绘制曲线，我是否会执行以下操作（某些事情无法解决）： result = curve_fit(f, x_axis, y_axis) shape = result[:-2] loc = result[-2] scale = result[-1] x = np.linspace(0, 36, 100) y = f(x, *popt) * 10 plt.bar(x_axis, y_axis, width, alpha=0.75 ) plt.plot(x, y, c='g')

【解决方案2】：

我使用OpenTURNS 平台尝试了您的示例这是我得到的。

在导入 openturns 和 openturs.viewer.View 进行绘图后，我开始使用与您相同的数据

    import openturns as ot
    from openturns.viewer import View

    x_axis = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 
          12.0, 13.0, 14.0, 15.0, 16.0, 17.0, 18.0, 19.0, 20.0, 21.0, 
          22.0, 23.0, 24.0, 25.0, 26.0, 27.0, 28.0, 29.0, 30.0, 31.0, 
          32.0, 33.0, 34.0]

    y_axis = [0, 0, 0, 0, 0.24, 0.53, 0.49, 0.64, 0.54, 0.78, 0.59, 0.44, 
          0.34, 0.88, 0.2, 0.49, 0.39, 0.39, 0.29, 0.2, 0.05, 0.05, 
          0.25, 0.05, 0.1, 0.15, 0.1, 0.1, 0.1, 0, 0, 0, 0, 0]

第一步：我们可以定义对应的分布

    distribution = ot.UserDefined(ot.Sample([[s] for s in x_axis]), y_axis)
    graph = distribution.drawPDF()
    graph.setColors(["black"])
    graph.setLegends(["your input"])

在这个阶段，如果你View(graph)你会得到：

第二步：我们可以从得到的分布中推导出一个样本

    sample = distribution.getSample(10000)

此样本将用于拟合任何类型的分布。我尝试了 WeibullMin 和 Gamma 分布

    # WeibullMin Factory
    distribution2 = ot.WeibullMinFactory().build(sample)
    print(distribution2)
    graph2 = distribution2.drawPDF() ; graph2.setLegends(["Best WeibullMin"])
    >>> WeibullMin(beta = 8.83969, alpha = 1.48142, gamma = 4.76832)

    # Gamma Factory
    distribution3 = ot.GammaFactory().build(sample)
    print(distribution3)
    >>> Gamma(k = 2.08142, lambda = 0.25157, gamma = 4.9995)
    graph3 = distribution3.drawPDF() ; graph3.setLegends(["Best Gamma"]) ; 
    graph3.setColors(["blue"])

    # plotting all the results
    graph.add(graph2) ; graph.add(graph3)
    View(graph)

【讨论】：

【解决方案3】：

我认为它是计算平方误差和的最好和最简单的方法：

#编写函数

def SSE(y_true, y_pred):

     sse= np.sum((y_true-y_pred)**2)

     print(sse)

#现在调用函数并获取结果
SSE(y_true, y_pred)

【讨论】：