在嵌套列表中查找大量数据 python答案

【问题标题】：looking on nested list for lot of data python在嵌套列表中查找大量数据 python
【发布时间】：2016-11-19 20:44:13
【问题描述】：

我必须在嵌套列表中找到哪个列表有一个单词并返回一个布尔型 numpy 数组。

nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words=c
result=[1,0,1,1]

我正在使用这个列表理解来做到这一点并且它有效

np.array([word in x for x in nested_list])

但我正在使用一个包含 700k 列表的嵌套列表，因此需要很长时间。另外，我必须这样做很多次，列表是静态的，但单词可以改变。

带有列表理解的 1 个循环需要 0.36 秒，我需要一种更快的方法，有什么方法可以做到吗？

【问题讨论】：

如果列表是静态的并且您经常这样做，您可以将其索引一次并使用该索引。由于索引本身很昂贵，因此单次通过是不值得的。
一次，作为words，你会只有一个字符还是可以有多个？
其实单词可以有更多的字符。如果 words = ['c','b']，那么我需要 2 个布尔数组：result=[[1,0,1,1],[1,1,1,0]]。

标签： python string performance numpy list-comprehension

【解决方案1】：

我们可以展平所有子列表中的元素，从而为我们提供一维数组。然后，我们只需在展平的一维数组中的每个 sub-list 的范围内查找任何出现的'c'。因此，根据这种理念，我们可以使用两种方法，具体取决于我们如何计算任何 c 的出现次数。

方法 #1： 一种使用 np.bincount 的方法 -

lens = np.array([len(i) for i in nested_list])
arr = np.concatenate(nested_list)
ids = np.repeat(np.arange(lens.size),lens)
out = np.bincount(ids, arr=='c')!=0

因为正如问题中所述，nested_list 不会在迭代中改变，我们可以重用所有内容并循环到最后一步。

方法 #2： 另一种方法是 np.add.reduceat 重用之前的 arr 和 lens -

grp_idx = np.append(0,lens[:-1].cumsum())
out = np.add.reduceat(arr=='c', grp_idx)!=0

当循环遍历words 的列表时，我们可以通过沿轴使用np.add.reduceat 并使用broadcasting 给我们一个2D 数组布尔值，将这种方法向量化以用于最后一步，就像这样 -

np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0

示例运行 -

In [344]: nested_list
Out[344]: [['a', 'b', 'c'], ['a', 'b'], ['b', 'c'], ['c']]

In [345]: words
Out[345]: ['c', 'b']

In [346]: lens = np.array([len(i) for i in nested_list])
     ...: arr = np.concatenate(nested_list)
     ...: grp_idx = np.append(0,lens[:-1].cumsum())
     ...: 

In [347]: np.add.reduceat(arr==np.array(words)[:,None], grp_idx, axis=1)!=0
Out[347]: 
array([[ True, False,  True,  True],    # matches for 'c'
       [ True,  True,  True, False]])   # matches for 'b'

【讨论】：

是否需要在不同的循环上重复 lens 和 arr？
@jevanio 使用方法#1：最后一步np.bincount(ids, arr=='c')!=0 将是唯一循环的东西。使用方法 #2：您不需要像示例运行中所示那样循环。
实际上，它使它更快。现在花了 0.13 秒。
@jevanio 你原来的方法是多少？
原来我用了0.3s，时间减少了一半，但对于我必须使用u.u的所有时间来说仍然很长

【解决方案2】：

生成器表达式在迭代一次时会更好（就性能而言）。
使用numpy.fromiter函数的解决方案：

nested_list = [['a','b','c'],['a','b'],['b','c'],['c']]
words = 'c'
arr = np.fromiter((words in l for l in nested_list), int)

print(arr)

输出：

[1 0 1 1]

https://docs.scipy.org/doc/numpy/reference/generated/numpy.fromiter.html

【讨论】：

快一点，就是一点u.u

【解决方案3】：

完成循环需要多长时间？在我的测试用例中，它只需要几百毫秒。

import random

# generate the nested lists
a = list('abcdefghijklmnop')
nested_list = [ [random.choice(a) for x in range(random.randint(1,30))]
                for n in range(700000)]

%%timeit -n 10
word = 'c'
b = [word in x for x in nested_list]
# 10 loops, best of 3: 191 ms per loop

将每个内部列表减少为一组可以节省一些时间...

nested_sets = [set(x) for x in nested_list]
%%timeit -n 10
word = 'c'
b = [word in s for s in nested_sets]
# 10 loops, best of 3: 132 ms per loop

一旦你把它变成一个集合列表，你就可以建立一个布尔元组列表。虽然没有真正节省时间。

%%timeit -n 10
words = list('abcde')
b = [(word in s for word in words) for s in nested_sets]
# 10 loops, best of 3: 749 ms per loop

【讨论】：

现在每个循环我花了 0.327 seg，这太高了 u.u