如何仅使用 Pandas 执行矢量化分箱答案

【问题标题】：How to Perform Vectorised Binning Using Only Pandas如何仅使用 Pandas 执行矢量化分箱
【发布时间】：2018-08-29 09:57:40
【问题描述】：

我正在尝试找到正确的语法来选择 Pandas DataFrame 中的行切片，以多维切片为条件。

我想执行直方图分箱，方法是在多维 numpy 数组中提供箱，并以矢量方式比较记录是否适合一个箱或另一个箱。结果应该是一个一维 numpy 数组，其中包含每个 bin 中的项目数。

我的初始模型尝试如下，供参考，尽管我在下面的答案中提供了部分实现（使用循环代替）：

import numpy as np
import pandas as pd

## Generate Random Data
X = np.random.normal(0.5,0.1,100)

## Populate a Pandas DataFrame
DF = pd.DataFrame({'x':X})

## Some example, hardcoded 1D bins. 
bins = np.array([
                [[0.0,0.2]],
                [[0.2,0.4]],
                [[0.4,0.6]],
                [[0.6,0.8]],
                [[0.8,1.0]]
                ])

hist = np.zeros(shape=(4,))
hist[:] = np.sum(
                 DF.loc[   (DF >= bins[:,:,0]) &
                           (DF > bins[:,:,1])
                        ].dropna(how='all')
                 )

一般来说，数据是 n 维的，bin 遵循上面的模式，有：

[[x_min, x_max], [y_min, ymax], [z_min, z_max]]

对于每个箱（因此在上面的一维示例中明显的“额外”嵌套层）。因此，切片应该适用于多列的 DataFrame，这样

DF['x'] >= x_min and DF['x'] < x_max and 
DF['y'] >= y_min and DF['y'] < y_max

等等。因此需要与维度无关；切片方法似乎是实现这一目标的最自然方式，如果可以实现的话，计算效率应该更高。

如果没有，可以尝试我的答案中的列表理解方法 - 但我在多维性方面遇到了麻烦。

【问题讨论】：

您的代码没有运行，可能是因为bins[:][0] 和bins[:][1] 仍然是数组。但是，请确保运行代码或仅在直接解释您的问题所在的情况下添加有错误的代码。
谢谢；我知道代码没有运行——这就是问题所在！我看不到如何编写该行（以hist[:] = 开头），以便正确填充 hist 对象。我将不胜感激有关如何正确编写此切片的建议，或者如果不可能，请解释为什么或如何以不同的方式编写。

标签： python pandas histogram vectorization slice

【解决方案1】：

我不确定你是否真的需要 pandas，但 numpy 有一个名为 histogramdd 的多维直方图函数。

这是一个测试循环，它生成三个列数不断增加的数组，所有 100 行长和相应的 bin 数组，所有这些数组都带有上面的示例边框。

看看这是不是你要找的东西：

for i in range(1, 4):
    data = np.random.random([100, i])
    bins = np.linspace(0, 1, 6)
    bins = [bins for _ in range(i)]
    print('shape of data: ', np.shape(data))
    print('bin borders: ',bins)
    print('\nresult: ', np.histogramdd(data, bins), '\n\n')

结果：

shape of data:  (100, 1)
bin borders:  [array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])]

result:  (array([ 14.,  26.,  21.,  24.,  15.]), [array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])]) 


shape of data:  (100, 2)
bin borders:  [array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ]), array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])]

result:  (array([[  5.,   7.,   5.,   2.,   3.],
       [  5.,   4.,   5.,   3.,   1.],
       [  5.,   3.,   7.,   1.,   3.],
       [  2.,   6.,   4.,   3.,   7.],
       [  1.,  11.,   3.,   2.,   2.]]), [array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ]), array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])]) 


shape of data:  (100, 3)
bin borders:  [array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ]), array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ]), array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])]

result:  (array([[[ 1.,  0.,  0.,  0.,  2.],
        [ 0.,  1.,  1.,  1.,  0.],
        [ 0.,  1.,  1.,  2.,  1.],
        [ 2.,  2.,  0.,  2.,  0.],
        [ 1.,  1.,  1.,  2.,  1.]],

       [[ 2.,  0.,  1.,  1.,  1.],
        [ 0.,  0.,  0.,  1.,  0.],
        [ 1.,  2.,  2.,  0.,  1.],
        [ 0.,  1.,  1.,  2.,  0.],
        [ 0.,  0.,  1.,  1.,  0.]],

       [[ 1.,  0.,  0.,  0.,  1.],
        [ 1.,  0.,  2.,  0.,  4.],
        [ 0.,  1.,  0.,  1.,  1.],
        [ 2.,  0.,  0.,  0.,  0.],
        [ 1.,  1.,  0.,  1.,  0.]],

       [[ 1.,  2.,  1.,  1.,  0.],
        [ 0.,  1.,  1.,  0.,  2.],
        [ 2.,  1.,  1.,  0.,  1.],
        [ 2.,  0.,  1.,  1.,  0.],
        [ 0.,  2.,  0.,  2.,  1.]],

       [[ 1.,  3.,  0.,  1.,  0.],
        [ 1.,  1.,  0.,  0.,  0.],
        [ 1.,  1.,  0.,  0.,  0.],
        [ 1.,  1.,  2.,  1.,  1.],
        [ 1.,  1.,  1.,  0.,  1.]]]), [array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ]), array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ]), array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])])

【讨论】：

谢谢 - 我熟悉 numpy 自己的直方图规定及其局限性，实际上我希望用更通用的公式来“替换”这些。如前所述，我现在已经部分解决了该问题，因此我将其发布为答案；但它不是矢量化的并且似乎不是最佳的，所以我会等待更好的方法 - 希望使用我的答案来获得更多灵感。

【解决方案2】：

正如我在对 SpghttCd 答案的评论中提到的，我发现了一种在填充直方图时使用列表理解而不是切片的工作方法。它似乎可以准确地计算每个 bin 中的记录数（在 1D 和 2D 中测试），但不优雅，我会感谢那些更熟悉 pandas 库的人的改进。由于整数舍入，它看起来可能有点狡猾。

我展示下面的代码，上面的例子扩展到二维。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as patches

## Generate Random Data
X = np.random.normal(0.5,0.1,150)
Y = np.random.normal(0.5,0.2,150)

## Populate a Pandas DataFrame
DF = pd.DataFrame({'x':X,'y':Y})

## Some example, hardcoded 2D bins. 
bins = np.array([
            [[0.0,0.2],[0.0,1.5]],
            [[0.2,0.4],[0.0,1.5]],
            [[0.4,0.6],[0.0,1.5]],
            [[0.6,0.8],[0.0,1.5]],
            [[0.8,1.0],[0.0,1.5]]
            ])


hist = np.array([  
                np.product(  
                          np.sum(     (DF.iloc[:,:] >= bins[:,:,0][i][:]) & 
                                      (DF.iloc[:,:] <  bins[:,:,1][i][:])
                          ))/len(DF) 
                 for i in range(len(bins)) ], dtype=np.int32)[:,0]


print(hist)    
print(sum(hist))

## 2D Plot
plt.style.use('seaborn')
fig, axes = plt.subplots(figsize=(4, 3.5))
plt.scatter(DF['x'],DF['y'], 5, 'k')
axes.set_xlabel('x')
axes.set_xlabel('y')
axes.set_xlim(-0.5,1.5)
axes.set_ylim(-0.5,2)

# Create a Rectangle patch for each bin and plot
for i,bin in enumerate(bins):

    rect = patches.Rectangle(   (bin[0][0],bin[1][0]),
                                bin[0][1]-bin[0][0],
                                bin[1][1]-bin[1][0],
                                linewidth=1,
                                edgecolor='r',facecolor='none')
    # Add the patch to the Axes
    axes.add_patch(rect)

plt.show()

这是在 Python 中重新发明 N 维直方图的个人项目的一部分，参考 SciComp question 中的讨论。

【讨论】：