【问题标题】:What's the proper distribution for the following data以下数据的正确分布是什么
【发布时间】:2021-01-25 03:22:19
【问题描述】:

我有以下示例数据。
它看起来像正态分布的右半部分。

假设数据是一篇博客文章的read时间。 我想做的是找出每篇博客文章在read时间方面的表现。

使用常规的正态分布,我会找到流行的meanstd,然后给定一个样本(博客),找到博客的平均时间read 并计算 p 值样本均值。

但是由于分布不正常..我该怎么办?

以下是数据。

tds_ = [28.965,
 12.172,
 17.042,
 36.98,
 20.323,
 3.481,
 18.43,
 5.638,
 20.763,
 48.104,
 8.015,
 21.2,
 48.122,
 32.51,
 16.87,
 10.402,
 7.896,
 3.827,
 0.078,
 18.63,
 42.428,
 0.975,
 11.392,
 15.937,
 4.531,
 44.635,
 10.457,
 53.821,
 43.046,
 39.572,
 6.31,
 52.039,
 36.726,
 19.67,
 43.719,
 9.421,
 2.798,
 20.013,
 32.888,
 43.622,
 13.093,
 38.688,
 57.199,
 13.627,
 42.571,
 34.076,
 18.812,
 49.251,
 57.412,
 35.089,
 8.093,
 15.141,
 58.05,
 17.936,
 4.673,
 5.475,
 11.731,
 46.649,
 12.403,
 6.442,
 22.542,
 44.069,
 7.893,
 26.484,
 4.199,
 6.575,
 3.209,
 32.125,
 40.202,
 37.918,
 27.567,
 22.634,
 43.355,
 44.481,
 17.854,
 29.538,
 2.39,
 16.52,
 34.321,
 8.003,
 28.034,
 20.963,
 16.509,
 26.279,
 13.541,
 22.654,
 32.074,
 9.474,
 1.054,
 11.612,
 2.108,
 19.015,
 0.864,
 7.577,
 9.927,
 7.295,
 6.689,
 13.908,
 2.063,
 31.57]

这里我展示的是分布和正常的.. (从样本中,我创建了它的负拷贝并附加到样本中) 然后看起来像正态分布

from scipy.stats import norm
import matplotlib.pyplot as plt


fig, ax = plt.subplots(1, 1)

tds_half = pd.Series(tds_)
tds_inverse = tds_half * -1

tds = np.append(tds_half, tds_inverse)

mean = np.mean(tds)
std = np.std(tds)


mean, var, skew, kurt = norm(mean, std).stats(moments='mvsk')
x = np.linspace(norm(mean, std).ppf(0.01),
                norm(mean, std).ppf(0.99), 100)
ax.plot(x, norm(mean, std).pdf(x),
        'r-', lw=5, alpha=0.6, label='norm pdf')



rv = norm(mean, std)
ax.plot(x, rv.pdf(x), 'k-', lw=2, label='frozen pdf')

vals = norm.ppf([0.001, 0.5, 0.999])
np.allclose([0.001, 0.5, 0.999], norm.cdf(vals))

r = tds

ax.hist(r, density=True, histtype='stepfilled', alpha=0.2)
ax.legend(loc='best', frameon=False)
plt.show()

【问题讨论】:

  • ...这不是一个真正的编程问题,是吗? (要求找到一些数据分布的程序是可能的,但这是一项艰巨的任务。Mathematica 有FindDistribution
  • (顺便说一句,你已经知道了,显然有一个叫做Half-normal distribution的东西。但它也可能是几何形状或其他东西)
  • 好吧,我想你是对的,不是编程问题,但显然有一个程序 mathematica 可以帮助我哈哈! @user202729

标签: python normal-distribution


【解决方案1】:

我生成一个随机选择来创建 10,000 个样本的正态分布。接下来,我使用数据的均值和标准差绘制直方图和概率分布函数。

tds = [28.965, 12.172, 17.042, 36.98, 20.323, 3.481, 18.43, 5.638,20.763, 48.104, 8.015, 21.2, 48.122, 32.51, 16.87, 
    10.402, 7.896, 3.827, 0.078, 18.63, 42.428, 0.975, 11.392, 15.937, 4.531, 44.635, 10.457, 53.821, 43.046, 39.572,
    6.31, 52.039, 36.726, 19.67, 43.719, 9.421, 2.798, 20.013, 32.888, 43.622, 13.093, 38.688, 57.199, 13.627, 42.571,
    34.076, 18.812, 49.251, 57.412, 35.089, 8.093, 15.141, 58.05, 17.936, 4.673, 5.475, 11.731, 46.649, 12.403, 6.442,
    22.542, 44.069, 7.893, 26.484, 4.199, 6.575, 3.209, 32.125, 40.202, 37.918, 27.567, 22.634, 43.355, 44.481, 17.854,
    29.538, 2.39, 16.52, 34.321, 8.003, 28.034, 20.963, 16.509, 26.279, 13.541, 22.654, 32.074, 9.474, 1.054, 11.612,
    2.108, 19.015, 0.864, 7.577, 9.927, 7.295, 6.689, 13.908, 2.063, 31.57]

 mean=np.mean(tds)
 std=np.std(tds)
 N=10000

 fig,ax=plt.subplots(figsize=(10,8))
 results=norm.rvs(mean,std, size=N)
 ax.hist(results,bins=100)
 twin_ax=ax.twinx()
 x = np.linspace(norm(mean, std).ppf(0.01),
            norm(mean, std).ppf(0.99), N)
 twin_ax.plot(x, norm(mean, std).pdf(x),   'r-', lw=5, alpha=0.6, label='norm pdf')
 plt.legend()
 plt.show()

您应该测试一下数据是否为正态分布。

正态分布:

  1. 68% 在 1 个标准差以内
  2. 95% 在 2 个标准差以内
  3. 99.7% 在 3 个标准差以内

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2018-02-08
    • 2017-01-25
    • 2015-04-08
    • 1970-01-01
    • 2020-06-07
    • 2020-12-19
    • 2016-04-10
    相关资源
    最近更新 更多