NumPy：通过将列中的值按另一列值分组来创建 dict()答案

【问题标题】：NumPy: Create a dict() by grouping values in a column by another column valuesNumPy：通过将列中的值按另一列值分组来创建 dict()
【发布时间】：2017-09-12 12:43:04
【问题描述】：

假设我有一个像下面这样的二维 NumPy 数组：

arr = numpy.array([[1,0], [1, 4.6], [2, 10.1], [2, 0], [2, 3.53]])
arr
Out[39]: 
array([[  1.  ,   0.  ],
       [  1.  ,   4.6 ],
       [  2.  ,  10.1 ],
       [  2.  ,   0.  ],
       [  2.  ,   3.53]])

根据第一列中的值对第二列中的值进行分组并从中创建字典的最快方法是什么（所需的输出如下）

{1: [0, 4.6], 2: [10.1, 0, 3.53]}

目前我使用循环，因为我拥有的实际数组超过 100 万行，第一列有超过 5000 个唯一值，所以速度很慢。我更喜欢 not 使用 pandas。

【问题讨论】：

标签： python arrays numpy dictionary grouping

【解决方案1】：

这是一种方法-

def create_dict(arr):
    a = arr[arr[:,0].argsort()] # sort by col-0 if not already sorted
    s0 = np.r_[0,np.flatnonzero(a[1:,0] > a[:-1,0])+1,a.shape[0]]
    ids = a[s0[:-1],0]
    return {ids[i]:a[s0[i]:s0[i+1],1].tolist() for i in range(len(s0)-1)}

示例运行 -

In [64]: arr
Out[64]: 
array([[  2.  ,   0.  ],
       [  1.  ,   4.6 ],
       [  2.  ,  10.1 ],
       [  4.  ,   0.5 ],
       [  1.  ,   0.  ],
       [  4.  ,   0.23],
       [  2.  ,   3.53]])

In [65]: create_dict(arr)
Out[65]: {1.0: [4.6, 0.0], 2.0: [0.0, 10.1, 3.53], 4.0: [0.5, 0.23]}

运行时测试

其他方法 -

# @Moinuddin Quadri's soln
def defaultdict_based(arr):
    my_list  = arr.tolist()
    my_dict = defaultdict(list)
    for key, value in my_list:
        my_dict[key].append(value)
    return my_dict

# @Psidom's soln
def numpy_split_based(arr):
    sort_arr = arr[arr[:, 0].argsort(), :]
    split_arr = np.split(sort_arr, np.where(np.diff(sort_arr[:,0]))[0] + 1) 
    return {s[0,0]: s[:,1].tolist() for s in split_arr}

时间安排 -

# Create sample random array with the first col having 1000000 elems
# with 5000 unique ones as stated in the question
In [102]: arr = np.random.randint(0,5000,(1000000,2))

In [103]: %timeit defaultdict_based(arr)
     ...: %timeit numpy_split_based(arr)
     ...: %timeit create_dict(arr)
     ...: 
1 loops, best of 3: 634 ms per loop
1 loops, best of 3: 270 ms per loop
1 loops, best of 3: 260 ms per loop

方法的瓶颈：

似乎使用基于defaultdict 的方法，使用.tolist() 转换为list 被证明是繁重的（> 总运行时间的50%）-

In [104]: %timeit arr.tolist()
1 loops, best of 3: 372 ms per loop

对于其他两种方法，开始时的排序（如果需要）以及结束时的拆分/循环理解是耗时的部分。排序步骤的运行时间（大约占总运行时间的 50%）-

In [106]: %timeit arr[arr[:,0].argsort()]
10 loops, best of 3: 140 ms per loop

【讨论】：

我想知道如果在我的解决方案的函数之外调用arr.tolist() 时的时差，因为这样会更好地进行比较。可以和我分享一下统计数据吗？另外，如果 numpy 数组与 defaultdict 一起使用会怎样？ :)
@MoinuddinQuadri 为该步骤添加了时间。由于问题将 NumPy 数组作为输入，我认为为了公平起见，我们需要保留在函数调用中:)
@MoinuddinQuadri 我用 defaultdict 测试了 numpy 数组，结果在 977 毫秒时变得更糟。
谢谢。我很高兴知道这一点。我当前的设置没有安装 numpy，很抱歉打扰您重新计算统计信息。但是最好在答案中提供其他信息:)
@MoinuddinQuadri 完全不用担心！喜欢计时:)

【解决方案2】：

您可以通过使用collections.defaultdict 在不使用numpy 的情况下执行此操作。事实上，根据您提供的示例，您甚至不需要 numpy 数组。 Python 的list 足以满足您的要求。下面是例子：

from collections import defaultdict
my_list = [[1,0], [1, 4.6], [2, 10.1], [2, 0], [2, 3.53]]

my_dict = defaultdict(list)
for key, value in my_list:
    my_dict[key].append(value)

    # if you want the values as float in the dict, use:
    #     my_dict[float(key)].append(float(value))

my_dict 保存的最终内容将是：

{1: [0, 4.6], 2: [10.1, 0, 3.53]}

【讨论】：

确实而且更简单！当 python 本身可以在单次迭代中以线性复杂度做到这一点时，我看不出使用 numpy 的意义。

【解决方案3】：

你可以使用np.split:

# sort array by the first column if it isn't
sort_arr = arr[arr[:, 0].argsort(), :]

# split the array and construct the dictionary
split_arr = np.split(sort_arr, np.where(np.diff(sort_arr[:,0]))[0] + 1)

{s[0,0]: s[:,1].tolist() for s in split_arr}
# {1.0: [0.0, 4.6], 2.0: [10.1, 0.0, 3.53]}

【讨论】：

【解决方案4】：

假设您的第一列已排序，这将起作用。

In [165]: d = {}

In [166]: uniq, idx, idxinv, counts = np.unique(arr[:, 0], return_index=True, return_inverse=True, return_counts=True)

In [167]: [d.update({arr[:, 0][el]: arr[:, 1][range(ix, counts[ix])]}) for ix, el in enumerate(idx)]
Out[167]: [None, None]

In [168]: d
Out[168]: {1.0: array([ 0. ,  4.6]), 2.0: array([  4.6,  10.1])}

【讨论】：