Pandas DataFrame 和 numpy 标准差不同答案

【问题标题】：Pandas DataFrame and numpy standard deviation are differentPandas DataFrame 和 numpy 标准差不同
【发布时间】：2020-10-14 14:52:36
【问题描述】：

只是问，为什么这个标准不一样？

>>> import numpy
>>> import pandas as pd
>>>
>>> arr = [10, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 617, 577, 471, 615, 583, 441, 562, 5
63, 527, 453, 530, 433, 541, 585, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 566, 554, 472, 3
35, 440, 579, 341, 545, 615, 548, 604, 439, 556, 442, 461, 624, 611, 444, 578, 405, 487, 490, 496, 3
98, 512, 422, 455, 449, 432, 607, 679, 434, 597, 639, 565, 415, 486, 668, 414, 665, 763, 557, 304, 4
04, 454, 689, 610, 483, 441, 657, 590, 492, 476, 437, 483, 529, 363, 711, 543]
>>> elements = numpy.asarray(arr)
>>> arr_D = {"A":arr}
>>> df = pd.DataFrame(arr_D)
>>>
>>> print(numpy.std(elements, axis=0))
118.51857760182034
>>> print(numpy.std(df['A']))
118.5185776018204
>>> print(df['A'].std(axis=0))
119.15407050904474

我对主题的理解有问题吗？据我所知，熊猫使用 numpy。同一列的datafram std和numpy std应该相同。

这是一个错误吗？

【问题讨论】：

标签： python pandas numpy data-science

【解决方案1】：

pandas 默认使用 Unbiased estimation 而 numpy 默认不使用，所以它们都不是错误的，它们使用不同的方法来计算 std强>
为了使 numpy 使用 无偏估计 将 ddof=1 传递给 std

>>> import numpy
>>> import pandas

>>> df = pandas.DataFrame(numpy.random.rand(100))

>>> numpy.std(df[0]) #default std biased estimation
0.2877601644414916

>>> numpy.std(df[0],ddof=1) #with ddof=1 i.e unbiased estimation
0.2892098469889083

>>> df[0].std() # unbiased estimation match with numpy std with ddof=1
0.2892098469889083

【讨论】：

【解决方案2】：

Numpy 使用有偏见的 std 和无偏见的 pandas。换句话说，numpy 除以n（元素数），pandas 除以n-1。尝试以下以查看是否匹配：

print(df['A'].std(axis=0)/np.sqrt(len(arr))*np.sqrt((len(arr)-1)))
#118.51857760182033

【讨论】：