带有元组列表的 Numba答案

【问题标题】：Numba with list of tuples带有元组列表的 Numba
【发布时间】：2021-06-04 05:04:18
【问题描述】：

我有一个从数据库查询的list of tuples。

tuple_data = [(1.1,1,"one"),(2.1,2,"two"),(3.1,3,"three")]

每个tuple 将包含不同的数据类型。

从这个列表中，我需要每个元组的第一个元素，所以我这样做了：

data = [result[0] for result in tuple_data]

现在我尝试使用numba module 而不是list comprehension。

所以我尝试了以下方法：

@numba.njit(cache = True)
def loop_faster(results):
    res = []
    for result in results:
        res.append(result[0])

这会抛出 NumbaPendingDeprecationWarning: ，我无法在迭代中使用元组列表（根据 numba 文档）

所以我把它改成了numpy array（From here）：

L_arr = np.array(tuple_data)

现在一切正常。loop_fastermethod 工作正常。

问题是，我的原始数据是 (float, int, str) 而更改为 numpy array 它的全部是 (str,str,str)，这是预期的。

问题是我希望数据本身为float。

所以我的代码如下：

import numba, logging
import numpy as np

numba_logger = logging.getLogger('numba')
numba_logger.setLevel(logging.WARNING)

@numba.njit(cache = True)
def loop_faster_1(results, n):
    res = []
    for result in results:
        res.append(result[0])
    print(res)

t1 = [(1.1,1,"one"),(2.1,2,"two"),(3.1,3,"three")]
L_arr = np.array(t1)
loop_faster_1(L_arr,0)

在实际情况下，我的元组列表很大，我将其转换为 numpy array 为 numba 并且我需要浮点数据，因此我必须将所有 str 转换为浮点数。

基本上用 numba，

元组列表
转换为 numpy 数组
调用numba方法
转换回浮动
用于进一步处理。

但有列表理解，

元组列表
列表理解
用于进一步处理。

有没有更好的方法来使用numba？或者我只是使用 list comprehension 在使用 numba 时删除这些步骤。因为有了这个，我觉得我实际上正在扼杀减少循环时间的目的。

【问题讨论】：

Larr[:,0] 是数组的第一列。不需要麻木。但是将列表转换为数组需要时间。您可以通过制作结构化数组来绕过浮点到字符串的转换。

标签： python list numpy tuples

【解决方案1】：

您不需要numba 来快速获取 numpy 数组的第一列。这是一个基本的索引操作。

您的样本：

In [23]: tuple_data = [(1.1,1,"one"),(2.1,2,"two"),(3.1,3,"three")]

明显的列表理解：

In [24]: [x[0] for x in tuple_data]
Out[24]: [1.1, 2.1, 3.1]

数组方法：

In [25]: np.array(tuple_data)[:,0].astype(float)
Out[25]: array([1.1, 2.1, 3.1])

列表理解要快得多 - 对于这个小样本：

In [26]: timeit [x[0] for x in tuple_data]
349 ns ± 7.76 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
In [27]: timeit np.array(tuple_data)[:,0].astype(float)
15.4 µs ± 55.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

主要消费者是np.array(...)。索引速度很快。理解应该与列表大小成线性关系。但np.array 也是如此，毕竟它是逐个元素地处理列表。可能有一个大小最终会更快，但过去的经验表明，它包含 1000 多个元组。

另一种方法是使用复合 dtype，制作结构化数组。现在浮点数不会转换为字符串。但这并没有改善时机。

In [28]: np.array(tuple_data, dtype='f,f,U10')
Out[28]: 
array([(1.1, 1., 'one'), (2.1, 2., 'two'), (3.1, 3., 'three')],
      dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<U10')])
In [29]: np.array(tuple_data, dtype='f,f,U10')['f0']
Out[29]: array([1.1, 2.1, 3.1], dtype=float32)
In [30]: timeit np.array(tuple_data, dtype='f,f,U10')['f0']
21.1 µs ± 1.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

这是另一种方法 - 创建一个对象 dtype 数组：

In [31]: np.array(tuple_data, dtype=object)[:,0]
Out[31]: array([1.1, 2.1, 3.1], dtype=object)
In [32]: timeit np.array(tuple_data, dtype=object)[:,0]
4.52 µs ± 9.41 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

这比其他数组更好，那些仍然落后于理解。它创建了一个 (3,3) 对象数组，与元组中的对象相同。它非常类似于列表，只是它执行多维索引。

完整的数组：

In [33]: np.array(tuple_data, dtype=object)
Out[33]: 
array([[1.1, 1, 'one'],
       [2.1, 2, 'two'],
       [3.1, 3, 'three']], dtype=object)
In [34]: np.array(tuple_data, dtype='f,f,U10')
Out[34]: 
array([(1.1, 1., 'one'), (2.1, 2., 'two'), (3.1, 3., 'three')],
      dtype=[('f0', '<f4'), ('f1', '<f4'), ('f2', '<U10')])

【讨论】：

我形成一个 numpy 数组的原因是与 numba 方法一起使用。如果列表理解花费的时间比所有这些都少，并且在我的场景中，我不能将 numba 与元组列表一起使用。然后我会坚持列表理解！