为什么使用 np.mean() 和 mean() 给了我不同的输出数？答案

【问题标题】：Why using np.mean() and mean() gave me different output number?为什么使用 np.mean() 和 mean() 给了我不同的输出数？
【发布时间】：2020-08-06 22:29:42
【问题描述】：

有趣的是，使用 np.mean() 或 mean() 给了我不同的输出。

from statistics import mean
import numpy as np
import matplotlib.pyplot as plt

xs = np.array([1, 2, 3, 4, 5, 6])
ys = np.array([5, 4, 6, 5, 6, 7])

def best_fit_slope(xs, ys):
    numerator = (mean(xs)*mean(ys)) - mean(xs*ys)
    denominator = mean(xs)**2 - mean(xs**2)
    return numerator/denominator 

m = best_fit_slope(xs, ys)
print(m)

输出>>> 0.8333333333333334

但是如果我用 np.mean() 替换 mean() 输出 >>> 0.42857142857142866.

我关注了这个视频：this video。他只是使用 mean() 并给出了 0.42857 的输出。谁能解释为什么有区别？我知道大多数线性代数运算或涉及数组的运算，我更喜欢使用 np.mean()。

【问题讨论】：

请提供预期的minimal, reproducible example。显示中间结果与您的预期不同的地方。
你有五个 mean 操作。哪些返回值您没有预料到？函数定义有什么区别？我们希望您进行基本的诊断跟踪以确定混淆点。 “我的程序给出了不同的输出”比我们对你的期望更笼统——输入一些中间的prints 来找出你在哪里感到困惑。
有趣的mean 似乎是截断/舍入。也许它会在结果上调用np.int64？

标签： python arrays numpy linear-regression

【解决方案1】：

这是由于 statistics 包如何根据您传入的数字类型尝试为您提供一致的输出，因此它可以处理 int、float、decimal.Decimal、fractions.Fraction，就像您希望的那样.不幸的是，numpy 类型不能很好地与 python 数字类型层次结构配合使用。所以我们可以查看源代码（这是 Python 版本，您的运行时可能使用的是快速的 C 版本，但它们应该可以等效地工作......）：

def mean(data):
    """Return the sample arithmetic mean of data.
    >>> mean([1, 2, 3, 4, 4])
    2.8
    >>> from fractions import Fraction as F
    >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
    Fraction(13, 21)
    >>> from decimal import Decimal as D
    >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
    Decimal('0.5625')
    If ``data`` is empty, StatisticsError will be raised.
    """
    if iter(data) is data:
        data = list(data)
    n = len(data)
    if n < 1:
        raise StatisticsError('mean requires at least one data point')
    T, total, count = _sum(data)
    assert count == n
    return _convert(total/n, T)

所以，本质上它使用了一个类型感知的sum，它返回类型、总数和计数。本质上，total/count 被强制转换为T。注意：

In [28]: T, total, count = statistics._sum(np.array([1,2,3]))

In [29]: T, total, count
Out[29]: (numpy.int64, Fraction(6, 1), 3)

In [30]: total / count
Out[30]: Fraction(2, 1)

In [31]: T(total / count)
Out[31]: 2

注意，您在这里看到的所有对象整数实际上都是numpy.int64，而不是普通的int 对象。但是为什么当我们做statistics.mean([1,2,3,4]) 时不会发生这种情况呢？好吧，因为该库是假设正常的 python 数字类型构建的，所以偷看_convert 函数：

def _convert(value, T):
    """Convert value to given numeric type T."""
    if type(value) is T:
        # This covers the cases where T is Fraction, or where value is
        # a NAN or INF (Decimal or float).
        return value
    if issubclass(T, int) and value.denominator != 1:
        T = float
    try:
        # FIXME: what do we do if this overflows?
        return T(value)
    except TypeError:
        if issubclass(T, Decimal):
            return T(value.numerator)/T(value.denominator)
        else:
            raise

您会注意到，它是特殊情况：if issubclass(T, int) and value.denominator != 1，即您有一个 int，而分母不是一个，所以您需要一个浮点数：

        T = float

但是：

In [36]: issubclass(np.int64, int)
Out[36]: False

所以，T 就是 np.int64，并且：

In [37]: total / count
Out[37]: Fraction(2, 1)

In [38]: np.int64(total / count)
Out[38]: 2

【讨论】：

哦，很高兴在生产就绪版本中找到# FIXME :)
@juanpa.arrivillaga，什么是香草int 对象？
@juanpa.arrivillaga & DeepSpace 你们摇滚！
@Yi.D 我的意思是普通的int、numpy 使用完全不同的类型，实际上，数组实际上并不包含这些类型的对象 , numpy 本质上是一个围绕原始、数字、类 c 数组的面向对象的包装器。当您从 numpy 数组访问元素时，例如array[0] 它每次创建一个新的 Python 对象，它不会是 int，而是像 numpy.int64 这样的东西。请注意，这就是为什么 array[0] is array[0] 永远不会为真......但对于 python 列表，mylist[0] is mylist[0] 始终为真。

【解决方案2】：

有趣的是，这种细微差别没有明确记录在正式的 docs 中，但可以从提供的示例中推断出来。

statistics.mean 尽力提供与输入相同类型的输出。当你给它np.array([1, 2, 3, 4, 5, 6])（np.int32 的数组）时，它会假定int 的输出是预期的：

xs = np.array([1, 2, 3, 4, 5, 6])
print(mean(xs))
# 3
print(type(mean(xs)))
# <class 'numpy.int32'>

将数组中的一个值强制为 float 就足以“说服”它我们想要一个 float 返回：

xs = np.array([1.0, 2, 3, 4, 5, 6])
# or np.array([1,2,3,4,5,6],dtype=np.float64) or anyother way that gives `dtype` np.float
print(mean(xs))
# 3.5
print(type(mean(xs)))
# <class 'numpy.float64'>

如果我们深入研究它的实现，我们可以看到这种行为的来源。它使用_sum 函数，记录如下：

def _sum(data, start=0):
    """_sum(data [, start]) -> (type, sum, count)

    Return a high-precision sum of the given numeric data as a fraction,
    together with the type to be converted to and the count of items.

    If optional argument ``start`` is given, it is added to the total.
    If ``data`` is empty, ``start`` (defaulting to 0) is returned.


    Examples
    --------

    >>> _sum([3, 2.25, 4.5, -0.5, 1.0], 0.75)
    (<class 'float'>, Fraction(11, 1), 5)

    Some sources of round-off error will be avoided:

    # Built-in sum returns zero.
    >>> _sum([1e50, 1, -1e50] * 1000)
    (<class 'float'>, Fraction(1000, 1), 3000)

    Fractions and Decimals are also supported:

    >>> from fractions import Fraction as F
    >>> _sum([F(2, 3), F(7, 5), F(1, 4), F(5, 6)])
    (<class 'fractions.Fraction'>, Fraction(63, 20), 4)

    >>> from decimal import Decimal as D
    >>> data = [D("0.1375"), D("0.2108"), D("0.3061"), D("0.0419")]
    >>> _sum(data)
    (<class 'decimal.Decimal'>, Fraction(6963, 10000), 4)

    Mixed types are currently treated as an error, except that int is
    allowed.
    """

【讨论】：

这不是int 的数组，而是dtype=numpy.int64 的数组，但是是的，基本上就是这样
@juanpa.arrivillaga 正确，但这一点仍然成立。无论如何我会更新答案
嗯，不完全是，因为它为 int 对象列表提供了正确的值，或者就此而言，int 对象的 numpy 数组（使用 dtype=object），所以它是更微妙一点
np.array([1., 2, 3, 4, 5, 6]) 只是令人困惑，你应该做np.array([1,2,3,4,5,6],dtype=np.float64) 之类的，或者让它们都浮动。
denominator = mean(xs)**2 - mean(xs**2) --> 检查这一行，括号可能会造成差异，而 np.mean(number) 可能是区别。