Numpy：通过分箱从关联中查找不同值的计数答案

【问题标题】：Numpy: Finding count of distinct values from associations through binningNumpy：通过分箱从关联中查找不同值的计数
【发布时间】：2018-11-29 00:24:16
【问题描述】：

先决条件

这是一个问题是post 的扩展。所以，一些问题的介绍会和那篇帖子差不多。

问题

假设result 是一个二维数组，values 是一个一维数组。 values 包含与result 中的每个元素相关联的一些值。 values 中的元素到result 的映射存储在x_mapping 和y_mapping 中。 result 中的位置可以与不同的值相关联。来自x_mapping 和y_mapping 的(x,y) 对与results[-y,x] 相关联。我必须找到按关联分组的值的唯一计数。

一个更清楚的例子。

result数组：

[[ 0.,  0.],
[ 0.,  0.],
[ 0.,  0.],
[ 0.,  0.]]

values数组：

[ 1.,  2.,  1.,  1.,  5.,  6.,  7.,  1.]

注意：这里result 数组和values 具有相同数量的元素。但情况可能并非如此。大小之间根本没有关系。

x_mapping 和 y_mapping 具有从一维 values 到二维 result 的映射。 x_mapping、y_mapping 和 values 的大小将相同。

x_mapping - [0, 1, 0, 0, 0, 0, 0, 0]

y_mapping - [0, 3, 2, 2, 0, 3, 2, 0]

这里，第 1 个值（values[0]）、第 5 个值（values[4]）和第 8 个值（values[7]）的 x 为 0，y 为 0（x_mapping[0] 和 y_mapping[0]）因此与结果 [0, 0] 相关联。如果我们计算来自该组的不同值的计数 - (1,5,1)，我们将得到 2 作为结果。 @WarrenWeckesser 让我们看看来自x_mapping 和y_mapping 的[1, 3] (x,y) 对如何影响results。由于只有一个值，即 2，与该特定组相关联，因此results[-3,1] 将具有一个，因为与该单元格相关联的不同值的数量为 1。

另一个例子。让我们计算results[-1,1] 的值。从映射来看，由于没有与单元格关联的值，results[-1,1] 的值将为零。

同样，results 中的位置 [-2, 0] 的值为 2。

请注意，如果根本没有关联，那么result 的默认值将为零。

计算后的result，

[[ 2.,  0.],
[ 1.,  1.],
[ 2.,  0.],
[ 0.,  0.]]

当前可行的解决方案

使用来自@Divakar 的answer，我找到了一个可行的解决方案。

x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1.,  2.,  1.,  1.,  5.,  6.,  7.,  1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32) 

m,n = result.shape
out_dtype = result.dtype
lidx = ((-y_mapping)%m)*n + x_mapping

sidx = lidx.argsort()
idx = lidx[sidx]
val = values[sidx]

m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
unq_ids = idx[m_idx]

r_res = np.zeros(m_idx.size, dtype=np.float32)
for i in range(0, m_idx.shape[0]):
    _next = None
    arr = None
    if i == m_idx.shape[0]-1:
        _next = val.shape[0]
    else:
        _next = m_idx[i+1]
    _start = m_idx[i]

    if _start >= _next:
        arr = val[_start]
    else:
        arr = val[_start:_next]
    r_res[i] = np.unique(arr).size
result.flat[unq_ids] = r_res

问题

现在，上述解决方案需要 15 毫秒才能对 19943 个值进行操作。我正在寻找一种更快地计算结果的方法。有没有更高效的方法来做到这一点？

旁注

我正在使用 Numpy 版本 1.14.3 和 Python 3.5.2

编辑

感谢@WarrenWeckesser，指出我没有解释results 中的元素如何与映射中的(x,y) 相关联。为了清楚起见，我更新了帖子并添加了示例。

【问题讨论】：

我无法将您对如何计算result[0,0] 的描述与result 中的其余值（由您所说的有效代码生成）相协调。例如，在x_mapping 和y_mapping 数组中，(x, y) 对[1, 3] 出现一次。我的理解是这些是result 的列和行索引。那么为什么result[3, 1] 不等于 1？在计算出的result 中，您有result[1, 0] = 1 和result[1, 1] = 1，但映射数组中都没有出现(x, y) 对[0, 1] 和[1, 1]。
@WarrenWeckesser，感谢您指出。对于没有添加有关 (x,y) 对如何与 results 中的元素相关联的详细信息，我深表歉意。每对(x,y) 都与results[-y,x] 相关联。为了清楚起见，我更新了帖子并添加了示例。谢谢。

标签： python arrays numpy

【解决方案1】：

这是一种解决方案

import numpy as np

x_mapping = np.array([0, 1, 0, 0, 0, 0, 0, 0])
y_mapping = np.array([0, 3, 2, 2, 0, 3, 2, 0])
values = np.array([ 1.,  2.,  1.,  1.,  5.,  6.,  7.,  1.], dtype=np.float32)
result = np.zeros([4, 2], dtype=np.float32)

# Get flat indices
idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
# Sort flat indices and reorders values accordingly
reorder = np.argsort(idx_mapping)
idx_mapping = idx_mapping[reorder]
values = values[reorder]
# Get unique values
val_uniq = np.unique(values)
# Find where each unique value appears
val_uniq_hit = values[:, np.newaxis] == val_uniq
# Find reduction indices (slices with the same flat index)
reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
# Reduce slices
reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
# Count distinct values on each slice
counts = np.count_nonzero(reduced, axis=1)
# Put counts in result
result.flat[idx_mapping[reduce_idx]] = counts

print(result)
# [[2. 0.]
#  [1. 1.]
#  [2. 0.]
#  [0. 0.]]

此方法占用更多内存 (O(len(values) * len(np.unique(values))))，但与您的原始解决方案相比，一个小型基准测试显示显着加速（尽管这取决于问题的实际规模）：

import numpy as np

np.random.seed(100)
result = np.zeros([400, 200], dtype=np.float32)
values = np.random.randint(100, size=(20000,)).astype(np.float32)
x_mapping = np.random.randint(result.shape[1], size=values.shape)
y_mapping = np.random.randint(result.shape[0], size=values.shape)

res1 = solution_orig(x_mapping, y_mapping, values, result)
res2 = solution(x_mapping, y_mapping, values, result)
print(np.allclose(res1, res2))
# True

# Original solution
%timeit solution_orig(x_mapping, y_mapping, values, result)
# 76.2 ms ± 623 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

# This solution
%timeit solution(x_mapping, y_mapping, values, result)
# 13.8 ms ± 51.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

基准函数的完整代码：

import numpy as np

def solution(x_mapping, y_mapping, values, result):
    result = np.array(result)
    idx_mapping = np.ravel_multi_index((-y_mapping, x_mapping), result.shape, mode='wrap')
    reorder = np.argsort(idx_mapping)
    idx_mapping = idx_mapping[reorder]
    values = values[reorder]
    val_uniq = np.unique(values)
    val_uniq_hit = values[:, np.newaxis] == val_uniq
    reduce_idx = np.concatenate([[0], np.nonzero(np.diff(idx_mapping))[0] + 1])
    reduced = np.logical_or.reduceat(val_uniq_hit, reduce_idx)
    counts = np.count_nonzero(reduced, axis=1)
    result.flat[idx_mapping[reduce_idx]] = counts
    return result

def solution_orig(x_mapping, y_mapping, values, result):
    result = np.array(result)
    m,n = result.shape
    out_dtype = result.dtype
    lidx = ((-y_mapping)%m)*n + x_mapping

    sidx = lidx.argsort()
    idx = lidx[sidx]
    val = values[sidx]

    m_idx = np.flatnonzero(np.r_[True,idx[:-1] != idx[1:]])
    unq_ids = idx[m_idx]

    r_res = np.zeros(m_idx.size, dtype=np.float32)
    for i in range(0, m_idx.shape[0]):
        _next = None
        arr = None
        if i == m_idx.shape[0]-1:
            _next = val.shape[0]
        else:
            _next = m_idx[i+1]
        _start = m_idx[i]

        if _start >= _next:
            arr = val[_start]
        else:
            arr = val[_start:_next]
        r_res[i] = np.unique(arr).size
    result.flat[unq_ids] = r_res
    return result

【讨论】：

感谢您的回答。我使用您使用np.logical_or.reduceat 的逻辑修改了现有解决方案。它的速度更快。谢谢。