Python：对一个坐标进行分箱并根据这些箱对另一个坐标进行平均答案

【问题标题】：Python: Binning one coordinate and averaging another based on these binsPython：对一个坐标进行分箱并根据这些箱对另一个坐标进行平均
【发布时间】：2016-03-12 01:53:17
【问题描述】：

我有两个向量 rev_count 和 stars。这些表单对的元素（假设rev_count 是x 坐标，stars 是y 坐标）。

我想按rev_count 对数据进行分箱，然后在单个rev_count bin 中平均stars（我想沿x 轴分箱并计算该箱中的平均y 坐标）。

这是我尝试使用的代码（灵感来自我的 matlab 背景）：

import matplotlib.pyplot as plt
import numpy

binwidth = numpy.max(rev_count)/10
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)

for i in range(0, len(revbin)-1):
    revbinnedstars[i] = numpy.mean(stars[numpy.argwhere((revbin[i]-binwidth/2) < rev_count < (revbin[i]+binwidth/2))])

print('Plotting binned stars with count')
plt.figure(3)
plt.plot(revbin, revbinnedstars, '.')
plt.show()

但是，这似乎非常缓慢/低效。有没有更自然的方式在 python 中做到这一点？

【问题讨论】：

标签： python numpy matplotlib binning

【解决方案1】：

Scipy 有一个function for this:

from scipy.stats import binned_statistic

revbinnedstars, edges, _ = binned_statistic(rev_count, stars, 'mean', bins=10)
revbin = edges[:-1]

如果你不想使用 scipy，numpy 中还有一个 histogram 函数：

sums, edges = numpy.histogram(rev_count, bins=10, weights=stars)
counts, _ = numpy.histogram(rev_count, bins=10)
revbinnedstars = sums / counts

【讨论】：

会尝试，看起来很有希望，我已经在代码的另一部分使用 scipy。

【解决方案2】：

我想您使用的是 Python 2，但如果不是，您应该在计算步长时将除法更改为 //（地板除法），否则 numpy 会因为无法将浮点数解释为步长而烦恼。

binwidth = numpy.max(rev_count)//10 # Changed this to floor division
revbin = range(0, numpy.max(rev_count), binwidth)
revbinnedstars = [None]*len(revbin)

for i in range(0, len(revbin)-1):
    # I actually don't know what you wanted to do but I guess you wanted the
    # "logical and" combination in that bin (you don't need to use np.where here)
    # You can put that all in one statement but it gets crowded so I'll split it:
    index1 = revbin[i]-binwidth/2 < rev_count
    index2 = rev_count < revbin[i]+binwidth/2)
    revbinnedstars[i] = numpy.mean(stars[np.logical_and(index1, index2)])

这至少应该有效并给出正确的结果。如果您拥有庞大的数据集并需要 10 个以上的 bin，那将非常低效。

一个非常重要的要点：

如果你想索引一个数组，不要使用np.argwhere。该结果应该是人类可读的。如果你真的想要坐标使用np.where。这可以用作索引，但如果您有多维输入，阅读起来就不太美观了。

numpy documentation 在这一点上支持我：

argwhere 的输出不适合索引数组。为此，请改用 where(a)。

这也是您的代码如此缓慢的原因。它试图做一些你不希望它做的事情，而这在内存和 cpu 使用方面可能非常昂贵。没有给你正确的结果。

我在这里所做的称为boolean masks。比np.where(condition)写的更短，计算量也少。

可以通过定义一个知道哪些星星在哪个 bin 中的网格来使用完全矢量化的方法：

bins = 10
binwidth = numpy.max(rev_count)//bins
revbin = np.arange(0, np.max(rev_count)+binwidth+1, binwidth)

定义垃圾箱的更好方法是。请注意，您必须将最大值添加到最大值，因为您想将它包含在内，并且将一到 bin 的数量，因为您对 bin-start 和 end-points 感兴趣，而不是 bin 的中心：

number_of_bins = 10
revbin = np.linspace(np.min(rev_count), np.max(rev_count)+1, number_of_bins+1)

然后你就可以设置网格了：

grid = np.logical_and(rev_count[None, :] >= revbin[:-1, None], rev_count[None, :] < revbin[1:, None])

网格是bins x rev_count 大（因为广播，我将每个数组的维度增加了一个但不一样）。这实质上检查一个点是否大于下 bin 范围并小于上 bin 范围（因此是 [:-1] 和 [1:] 索引）。这是多维完成的，其中计数在第二维（numpy 轴 = 1）和箱在第一维（numpy 轴 = 0）

所以我们可以通过将这些与这个网格相乘来获得适当 bin 中星星的 Y 坐标：

stars * grid

要计算平均值，我们需要将这个 bin 中的坐标总和除以该 bin 中的星数（bin 沿axis=1，不在这个 bin 中的星只有零值沿着这个轴）：

revbinnedstars = np.sum(stars * grid, axis=1) / np.sum(grid, axis=1)

我实际上不知道这是否更有效。它在内存上会贵很多，但在 CPU 上可能会便宜一些。

【讨论】：

这是python 3，numpy没有抱怨，但我会换成地板师。我没有意识到 python 支持布尔掩码，我现在就试试。看起来代码仍然很慢。一旦第一种方法完成执行，我将尝试您的第二种方法。感谢您的帮助！编辑：哦，我阅读的 a
@Ilya - 第二种方法仍有一个小错误。我已经更新了答案。根据您的样本大小和箱数，这些方法在执行时间和内存使用方面完全不同。你有这些尺寸的数字吗？
有几十亿行。我会尝试 scipy.stats.binned_statistics 另一个海报建议。

【解决方案3】：

我用于分箱 (x,y) 数据和确定汇总统计信息（例如这些箱中的平均值）的函数基于 scipy.stats.statistic() 函数。我已经为它写了一个包装器，因为我经常使用它。您可能会发现这很有用...

def binXY(x,y,statistic='mean',xbins=10,xrange=None):
    """
    Finds statistical value of x and y values in each x bin. 
    Returns the same type of statistic for both x and y.
    See scipy.stats.binned_statistic() for options.
    
    Parameters
    ----------
    x : array
        x values.
    y : array
        y values.
    statistic : string or callable, optional
        See documentation for scipy.stats.binned_statistic(). Default is mean.
    xbins : int or sequence of scalars, optional
        If xbins is an integer, it is the number of equal bins within xrange.
        If xbins is an array, then it is the location of xbin edges, similar
        to definitions used by np.histogram. Default is 10 bins.
        All but the last (righthand-most) bin is half-open. In other words, if 
        bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1, but 
        excluding 2) and the second [2, 3). The last bin, however, is [3, 4], 
        which includes 4.    
        
    xrange : (float, float) or [(float, float)], optional
        The lower and upper range of the bins. If not provided, range is 
        simply (x.min(), x.max()). Values outside the range are ignored.
    
    Returns
    -------
    x_stat : array
        The x statistic (e.g. mean) in each bin. 
    y_stat : array
        The y statistic (e.g. mean) in each bin.       
    n : array of dtype int
        The count of y values in each bin.
        """
    x_stat, xbin_edges, binnumber = stats.binned_statistic(x, x, 
                                 statistic=statistic, bins=xbins, range=xrange)
    
    y_stat, xbin_edges, binnumber = stats.binned_statistic(x, y, 
                                 statistic=statistic, bins=xbins, range=xrange)
    
    n, xbin_edges, binnumber = stats.binned_statistic(x, y, 
                                 statistic='count', bins=xbins, range=xrange)
            
    return x_stat, y_stat, n

【讨论】：