创建直方图时考虑错误答案

【问题标题】：Accounting for errors when creating a histogram创建直方图时考虑错误
【发布时间】：2014-11-30 19:23:27
【问题描述】：

我有一组N 观察分布为二维空间中的(x[i], y[i]), i=0..N 点。每个点在坐标 (e_x[i], e_y[i], i=0..N) 和附加的权重 (w[i], i=0..N) 中都有相关的误差。

我想生成这些N 点的二维直方图，不仅要考虑权重，还要考虑错误，这会导致每个点散布可能在许多箱中如果错误值足够大（假设错误是标准的Gaussian distribution，尽管可能会考虑其他分布）。

我看到numpy.histogram2d 有一个weights 参数，因此已得到处理。问题是如何解释每个N 观察点中的错误。

有没有可以让我这样做的功能？我对numpy 和scipy 中的任何内容持开放态度。

【问题讨论】：

这些错误值代表什么？这些标准偏差是沿主轴的吗？
好的，这组参数构成了一个多元 GMM，具有给定的权重 (\pi_i)，样本作为平均值 (\mu_i)，协方差矩阵 (\Sigma_i) 由 [[e_x[i] **2,0][0,e_y[i]**2]]。与您假设的标准正常情况（对应于所有 e_x 和 e_y 都等于 1.0）不同，您有协方差矩阵，其中对角线可以具有不同的值。这对应于长轴沿着主轴的椭圆，而不是圆。这对您前进有帮助吗？

标签： python numpy scipy histogram histogram2d

【解决方案1】：

根据 user1415946 的评论，您可以假设每个点代表一个 bi-variate normal distribution，协方差矩阵由 [[e_x[i]**2,0][0,e_y[i]**2]] 给出。但是，生成的分布不是正态分布 - 运行示例后，您会看到直方图根本不像高斯分布，而是一组。

要从这组分布中创建直方图，我看到的一种方法是使用numpy.random.multivariate_normal 从每个点生成随机样本。请参阅下面的示例代码，其中包含一些人工数据。

import numpy as np
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt


# This is a function I like to use for plotting histograms
def plotHistogram3d(hist, xedges, yedges):
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    hist = hist.transpose()
    # Transposing is done so that bar3d x and y match hist shape correctly
    dx = np.mean(np.diff(xedges))
    dy = np.mean(np.diff(yedges))

    # Computing the number of elements
    elements = (len(xedges) - 1) * (len(yedges) - 1)
    # Generating mesh grids.
    xpos, ypos = np.meshgrid(xedges[:-1]+dx/2.0, yedges[:-1]+dy/2.0)

    # Vectorizing matrices
    xpos = xpos.flatten()
    ypos = ypos.flatten()
    zpos = np.zeros(elements)
    dx = dx * np.ones_like(zpos) * 0.5  # 0.5 factor to give room between bars.
# Use 1.0 if you want all bars 'glued' to each other
    dy = dy * np.ones_like(zpos) * 0.5
    dz = hist.flatten()

    ax.bar3d(xpos, ypos, zpos, dx, dy, dz, color='b')
    ax.set_xlabel('x')
    ax.set_ylabel('y')
    ax.set_zlabel('Count')
    return

"""
INPUT DATA
"""
#                 x  y ex ey  w
data = np.array([[1, 2, 1, 1, 1],
                 [3, 0, 1, 1, 2],
                 [0, 1, 2, 1, 5],
                 [7, 7, 1, 3, 1]])

"""
Generate samples
"""
# Sample size (100 samples will be generated for each data point)
SAMPLE_SIZE = 100
# I want to fill in a table with columns [x, y, w]. Each data point generates SAMPLE_SIZE
# samples, so we have SAMPLE_SIZE * (number of data points) generated points
points = np.zeros((SAMPLE_SIZE * data.shape[0], 3))  # Initializing this matrix

for i, element in enumerate(data):  # For each row in the data set
    meanVector = element[:2]
    covarianceMatrix = np.diag(element[2:4]**2)  # Diagonal matrix with elements equal to error^2
    # For columns 0 and 1, add generated x and y samples
    points[SAMPLE_SIZE*i:SAMPLE_SIZE*(i+1), :2] = \
        np.random.multivariate_normal(meanVector, covarianceMatrix, SAMPLE_SIZE)
    # For column 2, simply copy original weight
    points[SAMPLE_SIZE*i:SAMPLE_SIZE*(i+1), 2] = element[4]  # weights

hist, xedges, yedges = np.histogram2d(points[:, 0], points[:, 1], weights=points[:, 2])
plotHistogram3d(hist, xedges, yedges)
plt.show()

结果如下图：

【讨论】：

Gabriel，您能否添加一些 cmets 来描述您的示例中每行的作用？另外，您正在运行哪个版本的matplotlib？我有 1.3.1 版本，尝试运行您的示例给了我一个 ValueError: Unknown projection '3d'；这很奇怪，因为这里给出的示例 stackoverflow.com/q/3810865/1391441 确实可以正常工作。
我使用的版本和你的一样，但是我在回答之前错误地删除了一个导入行。这个应该可以的。谢谢