在numpy中按数字对数组求和答案

【问题标题】：Sum array by number in numpy在numpy中按数字对数组求和
【发布时间】：2010-12-07 05:16:43
【问题描述】：

假设我有一个像这样的 numpy 数组： [1,2,3,4,5,6] 和另一个数组： [0,0,1,2,2,1] 我想按组（第二个数组）对第一个数组中的项目求和，并按组号顺序获得 n 组结果（在这种情况下，结果将是 [3, 9, 9]）。我如何在 numpy 中执行此操作？

【问题讨论】：

为什么需要 numpy 呢？你不只是使用香草python列表吗？如果没有，您使用的是什么 numpy 类型？
为此我需要 numpy，因为我不想为 n 组循环遍历数组 n 次，因为我的数组大小可以任意大。我没有使用 python 列表，我只是在括号中显示了一个示例数据集。数据类型是 int。
相关stackoverflow.com/questions/7089379/…

标签： python numpy

【解决方案1】：

numpy 函数 bincount 正是为此目的而设计的，我相信它对于所有大小的输入都会比其他方法快得多：

data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]

np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array

输出的第 i 个元素是与“id”i 对应的所有 data 元素的总和。

希望对您有所帮助。

【讨论】：

可以确认这是非常快的。比 Bi Rico 在小输入上提供的 sum_by_group 方法快大约 10 倍。
如果data 是向量呢？
看起来 weights 参数必须是一维的。一种解决方案是对向量的每个维度运行一次 bincount（即，如果数据是一组二维向量，则运行两次）。对彼得的答案稍作修改也应该有效。
好方法。注意 bincount 需要int ids。
并非您希望出现的所有 id 都需要出现才能最有意义。

【解决方案2】：

我尝试了不同的方法来做到这一点，我发现确实使用np.bincount 是最快的。查看亚历克斯的回答

    import numpy as np
    import random
    import time
    
    size = 10000
    ngroups = 10
    
    groups = np.random.randint(low=0,high=ngroups,size=size)
    values = np.random.rand(size)
    
    
    # Test 1                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for i in range(size):
        result[groups[i]] += values[i]
    print('Test 1 took:',time.time()-beg)
    
    # Test 2                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g,v in zip(groups,values):
        result[g] += v
    print('Test 2 took:',time.time()-beg)
    
    # Test 3                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g in np.unique(groups):
        wh = np.where(groups == g)
        result[g] = np.sum(values[wh[0]])
    print('Test 3 took:',time.time()-beg)
    
    
    # Test 4                                                                                                                                                                                                           
    beg = time.time()
    result = np.zeros(ngroups)
    for g in np.unique(groups):
        wh = groups == g
        result[g] = np.sum(values, where = wh)
    print('Test 4 took:',time.time()-beg)
    
    # Test 5                                                                                                                                                                                                           
    beg = time.time()
    result = np.array([np.sum(values[np.where(groups == g)[0]]) for g in np.unique(groups) ])
    print('Test 5 took:',time.time()-beg)
    
    # Test 6                                                                                                                                                                                                           
    beg = time.time()
    result = np.array([np.sum(values, where = groups == g) for g in np.unique(groups) ])
    print('Test 6 took:',time.time()-beg)
    
    # Test 7                                                                                                                                                                                                           
    beg = time.time()
    result = np.bincount(groups, weights = values)
    print('Test 7 took:',time.time()-beg)

结果：

    Test 1 took: 0.005615234375
    Test 2 took: 0.004812002182006836
    Test 3 took: 0.0006084442138671875
    Test 4 took: 0.0005099773406982422
    Test 5 took: 0.000499725341796875
    Test 6 took: 0.0004980564117431641
    Test 7 took: 1.9073486328125e-05

【讨论】：

【解决方案3】：

有不止一种方法可以做到这一点，但这里有一种方法：

import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)
sums = []
for group in unique_groups:
    sums.append(data[groups == group].sum())

您可以对事物进行矢量化，这样就根本没有 for 循环，但我建议不要这样做。它变得不可读，并且需要几个 2D 临时数组，如果您有大量数据，这可能需要大量内存。

编辑：这是一种完全矢量化的方法。请记住，这可能（并且可能会）比上述版本慢。（并且可能有更好的方法来矢量化它，但是已经很晚了，我很累，所以这只是我脑海中浮现的第一件事......）

但是，请记住，这是一个不好的示例...使用上面的循环确实会更好（在速度和可读性方面）...

import numpy as np
data = np.arange(1, 7)
groups = np.array([0,0,1,2,2,1])

unique_groups = np.unique(groups)

# Forgive the bad naming here...
# I can't think of more descriptive variable names at the moment...
x, y = np.meshgrid(groups, unique_groups)
data_stack = np.tile(data, (unique_groups.size, 1))

data_in_group = np.zeros_like(data_stack)
data_in_group[x==y] = data_stack[x==y]

sums = data_in_group.sum(axis=1)

【讨论】：

谢谢！内存不是问题，我想避免循环。你将如何对其进行矢量化？
@Scribble Master - 查看编辑...不过，循环遍历独特的组并没有错。第二个版本可能会很慢，而且很难阅读。使用循环，您只需循环（无论如何在 python 中）唯一组的数量。内部比较data[groups == group]会相当快。
data[groups == group] 构造是什么黑魔法？将数组与标量进行比较会产生某种切片或视图？ o_O
@Karl - groups == group 产生一个布尔数组。您可以在 numpy 中按数组进行索引。这是 numpy（和 Matlab）中非常常见的习语。我发现它非常易读（将其视为“位置”）并且非常有用。
@Joe: 很不错，但对我来说可能有点太神奇了。我没有用 Numpy 做太多事情（还没有像我想象的那样需要它）——这需要一些时间来适应。

【解决方案4】：

这是一种基于 numpy.unique 实现的向量化方法。根据我的时序，它比循环方法快 500 倍，比直方图方法快 100 倍。

def sum_by_group(values, groups):
    order = np.argsort(groups)
    groups = groups[order]
    values = values[order]
    values.cumsum(out=values)
    index = np.ones(len(groups), 'bool')
    index[:-1] = groups[1:] != groups[:-1]
    values = values[index]
    groups = groups[index]
    values[1:] = values[1:] - values[:-1]
    return values, groups

【讨论】：

【解决方案5】：

你们都错了！最好的方法是：

a = [1,2,3,4,5,6]
ix = [0,0,1,2,2,1]
accum = np.zeros(np.max(ix)+1)
np.add.at(accum, ix, a)
print accum
> array([ 3.,  9.,  9.])

【讨论】：

其实你应该只用Alex的np.bincount答案

【解决方案6】：

我尝试了每个人的脚本，我的考虑是：

User	Comment
Joe	Will only work if you have few groups.
kevpie	Too slow because of loops (this is not pythonic way).
Bi_Rico and Sven	Nice performance, but will only work for Int32 (if the sum goes over 2^32/2 it will fail
Alex	Is the fastest one, the best solution for sum.

但如果您想要更大的灵活性以及按其他统计数据分组的可能性，请使用SciPy：

import numpy as np
from scipy import ndimage

data = np.arange(10000000)
unique_groups = np.arange(1000)
groups = unique_groups.repeat(10000)

ndimage.sum(data, groups, unique_groups)

这很好，因为您有许多统计数据要分组（总和、均值、方差……）。

【讨论】：

这个解决方案很简洁。

【解决方案7】：

我注意到了numpy 标签，但如果你不介意使用pandas，这个任务就变成了单行：

import pandas as pd
import numpy as np

data = np.arange(1, 7)
groups = np.array([0, 0, 1, 2, 2, 1])

df = pd.DataFrame({'data': data, 'groups': groups})

所以df 看起来像这样：

   data  groups
0     1       0
1     2       0
2     3       1
3     4       2
4     5       2
5     6       1

现在您可以使用函数groupby() 和sum()

print(df.groupby(['groups'], sort=False).sum())

给你想要的输出

        data
groups      
0          3
1          9
2          9

默认情况下，数据帧会被排序，因此我使用标志sort=False，这可能会提高处理大型数据帧的速度。

【讨论】：

【解决方案8】：

如果组被连续整数索引，你可以滥用numpy.histogram()函数来获取结果：

data = numpy.arange(1, 7)
groups = numpy.array([0,0,1,2,2,1])
sums = numpy.histogram(groups, 
                       bins=numpy.arange(groups.min(), groups.max()+2), 
                       weights=data)[0]
# array([3, 9, 9])

这将避免任何 Python 循环。

【讨论】：

【解决方案9】：

纯python实现：

l = [1,2,3,4,5,6]
g = [0,0,1,2,2,1]

from itertools import izip
from operator import itemgetter
from collections import defaultdict

def group_sum(l, g):
    groups = defaultdict(int)
    for li, gi in izip(l, g):
        groups[gi] += li
    return map(itemgetter(1), sorted(groups.iteritems()))

print group_sum(l, g)

[3, 9, 9]

【讨论】：

【解决方案10】：

另外，请注意亚历克斯的回答：

data = [1,2,3,4,5,6]
ids  = [0,0,1,2,2,1]
np.bincount(ids, weights=data) #returns [3,9,9] as a float64 array

如果您的索引不是连续，您可能会陷入思考为什么会不断得到很多零。

例如：

data = [1,2,3,4,5,6]
ids  = [1,1,3,5,5,3]
np.bincount(ids, weights=data)

会给你：

array([0, 3, 0, 9, 0, 9])

这显然意味着它会在列表中构建从 0 到 max id 的所有唯一箱。然后返回每个 bin 的总和。

【讨论】：

【解决方案11】：

这是一种用于对任何维度的对象求和的方法，按任何类型的值（不仅是 int）分组：

grouping = np.array([1.1, 10, 1.1, 15])
to_sum = np.array([
    [1, 0],
    [0, 1],
    [0.5, 0.3],
    [2, 5],
])

groups, element_group_ixs = np.unique(grouping, return_inverse=True)
accum = np.zeros((groups.shape[0], *to_sum.shape[1:]))
np.add.at(accum, element_group_ixs, to_sum)

结果：

groups = array([ 1.1, 10. , 15. ])
accum = array([
    [1.5, 0.3],
    [0. , 1. ],
    [2. , 5. ]
])

（np.add.at 的想法取自彼得的回答）

【讨论】：