为非常大的数据集优化 numpy 连接操作答案

【问题标题】：Optimizing numpy concatenate operations for very large data sets为非常大的数据集优化 numpy 连接操作
【发布时间】：2020-02-23 02:19:26
【问题描述】：

我有一本字典，其中有大量的键（~300k 并且还在增长），并且作为值，它的集合也有大量的项目（~20k强>)。

dictionary = {
    1: {1, 2, 3},
    2: {3, 4},
    3: {5, 6},
    4: {1, 5, 12, 13},
    5: set()
}

我想要实现的是创建两个数组：

keys  = [1 1 1 2 2 3 3 4 4 4  4]
items = [1 2 3 3 4 5 6 1 5 12 13]

这基本上代表了每个集合中每个项目及其对应键的映射。

我尝试使用 numpy 完成这项工作，但仍然需要很长时间，我想知道它是否可以优化。

numpy 代码：

keys = np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), dictionary.items())))
items = np.concatenate(list(map(lambda x: list(x), dictionary.values())))

keys = np.array(keys, dtype=np.uint32)
items = np.array(items, dtype=np.uint16)

return keys, items

第二部分是尝试减少这些变量的内存占用以考虑它们各自的数据类型。但我知道它们在前两个操作中仍将默认为 64 位变量（在应用 dtype 更改之前），因此内存将被分配，我可能会用完 RAM。

【问题讨论】：

标签： python numpy optimization out-of-memory large-data

【解决方案1】：

不确定它是否会表现得更好，但直接的方法就是这样

import numpy as np

keys = np.array(list(dictionary.keys()), dtype=np.uint32).repeat([len(s) for s in dictionary.values()])

values = np.concatenate([np.array(list(s), np.uint16) for s in dictionary.values()])

display(keys)
display(values)

【讨论】：

【解决方案2】：

对于这个小样本，纯列表版本比 numpy 版本快得多：

In [14]: timeit list(itertools.chain.from_iterable([[item[0]]*len(item[1]) for item in dictionary.items()]))                                           
2.71 µs ± 18.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [15]: timeit np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), dictionary.items())))                                                    
52.2 µs ± 284 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

和

In [24]: list(itertools.chain.from_iterable(dictionary.values()))               
Out[24]: [1, 2, 3, 3, 4, 5, 6, 1, 13, 12, 5]
In [25]: timeit list(itertools.chain.from_iterable(dictionary.values()))        
876 ns ± 10.4 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [26]: timeit np.concatenate(list(map(lambda x: list(x), dictionary.values())))                                                                      
13.8 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

还有 Paul Panzer 的版本：

In [41]: timeit np.fromiter(itertools.chain.from_iterable(dictionary.values()),'int32')                                                                
3.69 µs ± 9.07 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

【讨论】：

【解决方案3】：

在这里使用np.fromiter 可能会更好。这在内存上肯定更容易，因为它避免了创建所有这些临时对象。

时间安排：

import numpy as np
import itertools as it
from simple_benchmark import BenchmarkBuilder

B = BenchmarkBuilder()

@B.add_function()
def pp(a):
    szs = np.fromiter(map(len,a.values()),int,len(a))
    ks = np.fromiter(a.keys(),np.uint32,len(a)).repeat(szs)
    vls = np.fromiter(it.chain.from_iterable(a.values()),np.uint16,ks.size)
    return ks,vls

@B.add_function()
def OP(a):
    keys = np.concatenate(list(map(lambda x: np.repeat(x[0], len(x[1])), a.items())))
    items = np.concatenate(list(map(list, a.values())))
    return keys, items

@B.add_function()
def DevKhadka(a):
    keys = np.array(list(a.keys()), dtype=np.uint32).repeat([len(s) for s in a.values()])
    values = np.concatenate([np.array(list(s), np.uint16) for s in a.values()])
    return keys,values

@B.add_function()
def hpaulj(a):
    ks = list(it.chain.from_iterable([[item[0]]*len(item[1]) for item in a.items()]))                                           
    vls = list(it.chain.from_iterable(a.values()))
    return ks,vls


@B.add_arguments('total no items')
def argument_provider():
    for exp in range(1,12):
        sz = 2**exp
        a = {j:set(np.random.randint(1,2**16,np.random.randint(1,sz)).tolist())
     for j in range(1,10*sz)}
        yield sum(map(len,a.values())),a

r = B.run()
r.plot()

import pylab
pylab.savefig('dct2np.png')

【讨论】：