如何在 Python 中使用 numpy.percentile() 计算置信区间答案

【问题标题】：How to calculate a Confidence Interval using numpy.percentile() in Python如何在 Python 中使用 numpy.percentile() 计算置信区间
【发布时间】：2019-09-15 09:13:10
【问题描述】：

一个家庭作业问题要求我计算平均值的置信区间。当我使用传统方法并使用 numpy.percentile() 时，我得到了不同的答案。

我认为我可能误解了如何或何时使用 np.percentile()。我的两个问题是： 1. 我用错了吗——输入错误等等。 2. 我是否在错误的地方使用它 - 应该用于引导 CI 而不是传统方法？

我已经通过传统公式和 np.percentile() 计算了 CI


price = np.random.normal(11427, 5845, 30)
# u = mean of orginal vector
# s = std of original vector
print(price)

[14209.99205723 7793.06283131 10403.87407888 10910.59681669 14427.87437741 4426.8122023 13890.22030853 5652.39284669 22436.9686157 9591.28194843 15543.24262609 11951.15170839 16242.64433138 3673.40741792 18962.90840397 11320.92073514 12984.61905211 8716.97883291 15539.80873528 19324.24734807 12507.9268783 11226.36772026 8869.27092532 9117.52393498 11786.21064418 11273.61893921 17093.20022578 10163.75037277 13962.10004709 17094.70579814]

x_bar = np.mean(price) # mean of vector
s = np.std(price) # std of vector
n = len(price) # number of obs
z = 1.96 # for a 95% CI

lower = x_bar - (z * (s/math.sqrt(n)))
upper = x_bar + (z * (s/math.sqrt(n)))
med = np.median(price)

print(lower, med, upper)

10838.458908888499 11868.68117628698 13901.386475143861

np.percentile(price, [2.5, 50, 97.5])

[4219.6258866 11868.68117629 20180.24569667]

ss.scoreatpercentile(price, [2.5, 50, 97.5])

[4219.6258866 11868.68117629 20180.24569667]

我希望 lower、med 和 upper 等于 np.percentile() 的输出。

虽然中值是相同的——上限和下限相差很大。

此外，scipy.stats.percentile 提供与 numpy.percentile 相同的输出。

有什么想法吗？

谢谢！

已编辑以显示价格向量。

【问题讨论】：

能否提供数组price？
@kmario23 我将其编辑为“显示”价格数组。它是来自 DF 的列，但我只是用它的参数制作了一个随机法线向量。错误仍然存在并且仍然很大。任何帮助都会很棒！
您将得到比我在stats.stackexchange.com 给出的置信区间与百分位数更好的解释

标签： python numpy confidence-interval percentile

【解决方案1】：

置信区间和百分位数不是一回事。这两件事的公式很不一样

您拥有的样本数量会影响您的置信区间，但不会（太多）改变百分位数。

例如

price = np.random.normal(0, 1, 10000)
print (np.percentile(price, [2.5, 50, 97.5])

给予

[-1.97681778  0.01808908  1.93659551]

和

price = np.random.normal(0, 1, 100000000)
print (np.percentile(price, [2.5, 50, 97.5]))

给出的几乎相同：

[-1.96012643  9.82108813e-05  1.96030460]

但是运行您的 CI 计算代码，如果您大量增加样本数量，您的置信区间将会缩小 - 因为您现在有 95% 的把握认为分布的均值位于较小的范围内。

使用包含 10 个样本和 10,000 个样本的相同 2 个价格数组（均值 = 0，sd = 1），您的结果是：

-0.5051688819759096 0.17504324224822834 0.744716862363091 # 10 samples
-0.02645090158517636 -0.006759616493022626 0.012353106820212557 # 10000 samples

如您所见，CI 更小，样本更多（正如您所料，给出 CI 的公式！）

【讨论】：

感谢您的回答！我将百分位数与 CI 混淆了。面掌。再次感谢