为什么“中位数”比使用统计包的“平均值”快 2 倍？答案

【问题标题】：How come "median" is 2x faster than "mean" using statistics package?为什么“中位数”比使用统计包的“平均值”快 2 倍？
【发布时间】：2016-10-27 12:41:14
【问题描述】：

这让我很吃惊...为了说明我已经使用这个小代码来计算 1M 随机数的平均值和中位数：

import numpy as np
import statistics as st

import time

listofrandnum = np.random.rand(1000000,)

t = time.time()
print('mean is:', st.mean(listofrandnum))
print('time to calc mean:', time.time()-t)

print('\n')

t = time.time()
print('median is:', st.median(listofrandnum))
print('time to calc median:', time.time()-t)

结果是：

mean is: 0.499866595037
time to calc mean: 2.0767598152160645


median is: 0.499721597395
time to calc median: 0.9687695503234863

我的问题：为什么均值比中值慢？中位数需要一些排序算法（即比较），而平均值需要求和。总和比比较慢有意义吗？

感谢您对此的深入了解。

【问题讨论】：

仅供参考，无需对整个数组进行排序即可找到中位数。平均而言，您可以使用快速选择在 O(n) 中完成此操作。
我从标题中删除了numpy，因为这个问题是关于statistics 模块，而不是numpy 性能。我会把它留在标签中。

标签： python python-3.x numpy statistics

【解决方案1】：

statistics 不是 NumPy 的一部分。它是一个 Python 标准库模块，具有完全不同的设计理念；它不惜一切代价追求准确性，即使对于不寻常的输入数据类型和条件极差的输入也是如此。以statistics 模块的方式执行求和，这真的成本很高，比执行排序更昂贵。

如果您想要 NumPy 数组的有效均值或中位数，请使用 NumPy 例程：

numpy.mean(whatever)
numpy.median(whatever)

如果您想查看statistics 模块所经历的昂贵工作，您可以查看source code：

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    If optional argument ``start`` is given, it is added to the total.
    If ``data`` is empty, ``start`` (defaulting to 0) is returned.


    Examples
    --------

    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
    (<class 'float'>, Fraction(11, 1), 5)

    Some sources of round-off error will be avoided:

    >>> _sum([1e50, 1, -1e50] * 1000)  # Built-in sum returns zero.
    (<class 'float'>, Fraction(1000, 1), 3000)

    Fractions and Decimals are also supported:

    >>> from fractions import Fraction as F
    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)

    >>> from decimal import Decimal as D
    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
    >>> _sum(data)
    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)

    Mixed types are currently treated as an error, except that int is
    allowed.
    """
    count = 0
    n, d = _exact_ratio(start)
    partials = {d: n}
    partials_get = partials.get
    T = _coerce(int, type(start))
    for typ, values in groupby(data, type):
        T = _coerce(T, typ)  # or raise TypeError
        for n,d in map(_exact_ratio, values):
            count += 1
            partials[d] = partials_get(d, 0) + n
    if None in partials:
        # The sum will be a NAN or INF. We can ignore all the finite
        # partials, and just look at this special one.
        total = partials[None]
        assert not _isfinite(total)
    else:
        # Sum all the partial sums using builtin sum.
        # FIXME is this faster if we sum them in order of the denominator?
        total = sum(Fraction(n, d) for d, n in sorted(partials.items()))
    return (T, total, count)

【讨论】：

使用 numpy 方法，我的计算机平均需要 2 毫秒，中位数需要 16 毫秒。