通过使用 numpy 或 tabular 将所有项目相互比较来过滤两个列表答案

【问题标题】：Filtering two lists by comparing all items to eachother with numpy or tabular通过使用 numpy 或 tabular 将所有项目相互比较来过滤两个列表
【发布时间】：2011-05-26 19:35:25
【问题描述】：

我有两个元组列表，每个列表中的元组都是唯一的。列表具有以下格式：

[('col1', 'col2', 'col3', 'col4'), ...]

我正在使用嵌套循环从两个列表中查找对于给定 cols、col2 和 col3 具有相同值的成员

temp1 = set([])
temp2 = set([])
for item1 in list1:
    for item2 in list2:
        if item1['col2'] == item2['col2'] and \
            item1['col3'] == item2['col3']:
            temp1.add(item1)
            temp2.add(item2)

只是工作。但是当列表中有数万个项目时，需要几分钟才能完成。

使用表格，我可以过滤 list1 和 col2，list2 的一项，如下所示：

list1 = tb.tabular(records=[...], names=['col1','col2','col3','col4'])
...

for (col1, col2, col3, col4) in list2:
    list1[(list1['col2'] == col2) & (list1['col3'] == col3)]

这显然是“做错了”并且比第一个慢得多。

如何使用 numpy 或表格有效地检查元组列表中的项目与另一个项目的所有项目？

谢谢

【问题讨论】：

标签： python arrays numpy

【解决方案1】：

试试这个：

temp1 = set([])
temp2 = set([])

dict1 = dict()
dict2 = dict()

for key, value in zip([tuple(l[1:3]) for l in list1], list1):
    dict1.setdefault(key, list()).append(value)

for key, value in zip([tuple(l[1:3]) for l in list2], list2):
    dict2.setdefault(key, list()).append(value)

for key in dict1:
    if key in dict2:
        temp1.update(dict1[key])
        temp2.update(dict2[key])

很脏，但应该可以工作。

【讨论】：

效果很好，谢谢。为了测试，使用了两个列表，每个列表有 10000 个元组，每个元组有 4 个随机整数。嵌套循环花费了 59.0425438881 秒，而你的是 0.13046002388 秒

【解决方案2】：

“如何使用 numpy 或表格有效地检查元组列表中的项目与另一个项目的所有项目”

好吧，我没有使用表格的经验，也没有使用 numpy 的经验，所以我不能给你一个确切的“罐头”解决方案。但我想我可以为你指明正确的方向。如果 list1 的长度为 X 而 list2 的长度为 Y，那么您正在进行 X * Y 检查...而您只需要进行 X + Y 检查。

您应该执行以下操作（我将假设这些是常规 Python 元组的列表 - 不是表格记录 - 我相信您可以进行必要的调整）：

common = {}
for item in list1:
    key = (item[1], item[2])
    if key in common:
        common[key].append(item)
    else:
        common[key] = [item]

first_group = []
second_group = []
for item in list2:
    key = (item[1], item[2])
    if key in common:
        first_group.extend(common[key])
        second_group.append(item)

temp1 = set(first_group)
temp2 = set(second_group)

【讨论】：

你说得对，相比之下我完全错过了重点:) 谢谢。

【解决方案3】：

我将创建一个具有特殊 __eq__ 和 __hash__ 方法的元组子类：

>>> class SpecialTuple(tuple):
...     def __eq__(self, t):
...             return self[1] == t[1] and self[2] == t[2]
...     def __hash__(self):
...             return hash((self[1], self[2]))
...

它比较 col1 和 col2 并表示在这些列相同的条件下元组相等。

然后过滤只是在这个特殊元组上使用set交集：

>>> list1 = [ (0, 1, 2, 0), (0, 3, 4, 0), (1, 2, 3, 12) ]
>>> list2 = [ (0, 1, 1, 0), (0, 3, 9, 9), (42, 2, 3, 12) ]
>>> set(map(SpecialTuple, list1)) & set(map(SpecialTuple, list2))
set([(42, 2, 3, 12)])

我不知道它有多快。告诉我。 :)

【讨论】：

hash mkaes 集合函数覆盖（？）集合中的现有项目，这会导致较小的结果:)