Python列表过滤：从列表列表中删除子集答案

【问题标题】：Python list filtering: remove subsets from list of listsPython列表过滤：从列表列表中删除子集
【发布时间】：2009-08-23 16:23:20
【问题描述】：

使用 Python 如何通过有序子集匹配 [[..],[..],..] 减少列表列表？

在这个问题的上下文中，列表 L 是列表 M 的子集，如果 M 包含 L 的所有成员，并且在同一命令。例如，列表 [1,2] 是列表 [1,2,3] 的子集，但不是列表 [2,1,3] 的子集。

示例输入：

a. [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
b. [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

预期结果：

a. [[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]
b. [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69],  [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

更多例子：

L = [[1, 2, 3, 4, 5, 6, 7], [1, 2, 5, 6]] - 没有减少

L = [[1, 2, 3, 4, 5, 6, 7], ~~[1, 2, 3]~~, [1, 2, 4, 8]] - 是减少

L = [[1, 2, 3, 4, 5, 6, 7], [7, 6, 5, 4, 3, 2, 1]] - 没有减少

（很抱歉造成与不正确的数据集混淆。）

【问题讨论】：

什么是超集列表？它是不是作为另一个子集出现的任何集合？
结果中不应该出现 [1,2,4,5,6] 吗？
不，根据问题定义，[1,2,4,5,6] 是 [1, 2, 3, 4, 5, 6, 7] 的“子集”。
我认为您需要生成一组确定的测试用例 - 我很乐意针对它们编写代码。看来我的答案都不是完全正确的。
我不明白。由于 [1,2,3,4,5,6,7] 而在一个测试数据集中省略了 [1,2,4,5,6] 但在这个测试数据中没有？ [[1, 2, 3, 4, 5, 6, 7], [1, 2, 4, 5, 6]] 我读错了“No reduce”评论吗？

标签： python list

【解决方案1】：

这可以简化，但是：

l = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
l2 = l[:]

for m in l:
    for n in l:
        if set(m).issubset(set(n)) and m != n:
            l2.remove(m)
            break

print l2
[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

【讨论】：

集合，列表理解，枚举()：l2 = [m for i, m in enumerate(l) if not any(set(m).issubset(set(n)) for n in ( l[:i] + l[i+1:]))]
输入：[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [ 1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] 输出：[[2, 16, 17], [1, 2 , 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8]]序列 [1, 2, 4, 5, 6] 丢失了？
[1, 2, 4, 5, 6] 应该被删除，因为它是 [1, 2, 3, 4, 5, 6, 7] 的有序子集，对吧？

【解决方案2】：

这段代码应该是相当节省内存的。除了存储您的初始列表列表之外，此代码使用的额外内存可以忽略不计（不会创建临时集合或列表副本）。

def is_subset(needle,haystack):
   """ Check if needle is ordered subset of haystack in O(n)  """

   if len(haystack) < len(needle): return False

   index = 0
   for element in needle:
      try:
         index = haystack.index(element, index) + 1
      except ValueError:
         return False
   else:
      return True

def filter_subsets(lists):
   """ Given list of lists, return new list of lists without subsets  """

   for needle in lists:
      if not any(is_subset(needle, haystack) for haystack in lists
         if needle is not haystack):
         yield needle

my_lists = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], 
            [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]    
print list(filter_subsets(my_lists))

>>> [[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

而且，为了好玩，单线：

def filter_list(L):
    return [x for x in L if not any(set(x)<=set(y) for y in L if x is not y)]

【讨论】：

这一行是个好主意：“index = haystack.index(element, index)”。相反，我每次都缩短列表。
不过，我猜测这段代码会说 [1,1,1,1,1,1] 是 [1] 的子集。你需要“index = 1 + haystack.index(element, index)”。
@hugh，您的示例将通过首先检查长度来处理，但您是对的。 [1,1,1] 是此代码中 [2,1,3] 的子集。立即更改。
与@iElectric 解决方案序列相同的问题丢失。在 :[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] 输出：[[2, 16, 17], [1, 2, 3, 4, 5, 6 , 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8]]
1,2,4,5,6 是 1,2,3,4,5,6,7 的有序子集。根据您的规格，它应该被删除。

【解决方案3】：

如果一个列表不是任何其他列表的子集，它就是一个超级列表。如果列表的每个元素都可以按顺序在另一个列表中找到，那么它就是另一个列表的子集。

这是我的代码：

def is_sublist_of_any_list(cand, lists):
    # Compare candidate to a single list
    def is_sublist_of_list(cand, target):
        try:
            i = 0
            for c in cand:
                i = 1 + target.index(c, i)
            return True
        except ValueError:
            return False
    # See if candidate matches any other list
    return any(is_sublist_of_list(cand, target) for target in lists if len(cand) <= len(target))

# Compare candidates to all other lists
def super_lists(lists):
    return [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]

if __name__ == '__main__':
    lists = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
    superlists = super_lists(lists)
    print superlists

结果如下：

[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

编辑：您以后的数据集的结果。

>>> lists = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17,
 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2,
 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
>>> superlists = super_lists(lists)
>>> expected = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [5
0, 69],  [2, 3, 21], [1, 2, 4, 8]]
>>> assert(superlists == expected)
>>> print superlists
[[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3,
21], [1, 2, 4, 8]]

【讨论】：

同样的问题，序列丢失
和什么一样的问题？ “序列丢失”是什么意思？这是否意味着它不会产生预期的结果？如果不是，请举例说明。上面的代码生成显示的结果。
好的，我在您的新数据集上进行了尝试，结果完全符合您的预期/想要的结果。
深夜问这个不是好事我感觉很糟糕。我在预期的结果中省略了[1,2,4,5,6]。

【解决方案4】：

这似乎有效：

original=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]

target=[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]

class SetAndList:
    def __init__(self,aList):
        self.list=aList
        self.set=set(aList)
        self.isUnique=True
    def compare(self,aList):
        s=set(aList)
        if self.set.issubset(s):
            #print self.list,'superceded by',aList
            self.isUnique=False

def listReduce(lists):
    temp=[]
    for l in lists:
        for t in temp:
            t.compare(l)
        temp.append( SetAndList(l) )

    return [t.list for t in temp if t.isUnique]

print listReduce(original)
print target

这会打印计算出的列表和目标以进行视觉比较。

在 compare 方法中取消注释打印行以查看各种列表是如何被取代的。

用 python 2.6.2 测试

【讨论】：

无法完全减少。如果给定 [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] 输出：[[2, 16, 17], [1, 2, 3, 4, 5, 6 , 7], [1, 2, 3, 4], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]] 未能将 [1,2,3] 归约为较大的组之一
@OP：今天看我的下一个答案，8 月 24 日

【解决方案5】：

我实现了一个不同的issubseq，因为你没有说[1, 2, 4, 5, 6] 是[1, 2, 3, 4, 5, 6, 7] 的子序列，例如（除了非常缓慢之外）。我想出的解决方案是这样的：

 def is_subseq(a, b):
    if len(a) > len(b): return False
    start = 0
    for el in a:
        while start < len(b):
            if el == b[start]:
                break
            start = start + 1
        else:
            return False
    return True

def filter_partial_matches(sets):
     return [s for s in sets if all([not(is_subseq(s, ss)) for ss in sets if s != ss])]

一个简单的测试用例，给定您的输入和输出：

>>> test = [[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
>>> another_test = [[1, 2, 3, 4], [2, 4, 3], [3, 4, 5]]
>>> filter_partial_matches(test)
[[1, 2, 4, 8], [2, 3, 21], [1, 2, 3, 4, 5, 6, 7]]
>>> filter_partial_matches(another_test)
[[1, 2, 3, 4], [2, 4, 3], [3, 4, 5]]

希望对你有帮助！

【讨论】：

与其他解决方案评论相同的问题，序列丢失

【解决方案6】：

list0=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]

for list1 in list0[:]:
    for list2 in list0:
        if list2!=list1:
            len1=len(list1)
            c=0
            for n in list2:
                if n==list1[c]:
                    c+=1
                if c==len1:
                    list0.remove(list1)
                    break

这会使用它的副本过滤 list0。如果预期结果与原始结果大致相同，这很好，只有几个“子集”需要删除。

如果预期结果很小而原始列表很大，您可能会更喜欢这个更友好的内存，因为它不会复制原始列表。

list0=[[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3], [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]
result=[]

for list1 in list0:
    subset=False
    for list2 in list0:
        if list2!=list1:
            len1=len(list1)
            c=0
            for n in list2:
                if n==list1[c]:
                    c+=1
                if c==len1:
                    subset=True
                    break
            if subset:
                break
    if not subset:
        result.append(list1)

【讨论】：

如果您跟踪自己的位置，则不需要比较 list1 和 list2。使用 enumerate() 存储索引并创建省略该列表的子列表：“for i, list1 in enumerate(list0):\n for list2 in (list0[:i] + list0[i+1]):\n\ n"
好吧，不过我不确定是否值得。
与其他解决方案中所述的问题相同。

【解决方案7】：

编辑：我真的需要提高我的阅读理解能力。这是实际问题的答案。它利用了“A is super of B”意味着“len(A) > len(B) or A == B”这一事实。

def advance_to(it, value):
    """Advances an iterator until it matches the given value. Returns False
    if not found."""
    for item in it:
        if item == value:
            return True
    return False

def has_supersequence(seq, super_sequences):
    """Checks if the given sequence has a supersequence in the list of
    supersequences.""" 
    candidates = map(iter, super_sequences)
    for next_item in seq:
        candidates = [seq for seq in candidates if advance_to(seq, next_item)]
    return len(candidates) > 0

def find_supersequences(sequences):
    """Finds the supersequences in the given list of sequences.

    Sequence A is a supersequence of sequence B if B can be created by removing
    items from A."""
    super_seqs = []
    for candidate in sorted(sequences, key=len, reverse=True):
        if not has_supersequence(candidate, super_seqs):
            super_seqs.append(candidate)
    return super_seqs

print(find_supersequences([[1, 2, 4, 8], [1, 2, 4, 5, 6], [1, 2, 3],
    [2, 3, 21], [1, 2, 3, 4], [1, 2, 3, 4, 5, 6, 7]]))
#Output: [[1, 2, 3, 4, 5, 6, 7], [1, 2, 4, 8], [2, 3, 21]]

如果您还需要保留序列的原始顺序，则find_supersequences() 函数需要跟踪序列的位置，然后对输出进行排序。

【讨论】：

这不尊重列表顺序，例如如果给定 [[1,2,3,4],[2,4,3],[3,4,5]] 结果是 [[1,2,3,4],[2,4,3] ] 当我希望它返回初始输入时。
@Triptych：他在最初的问题中没有说明这一点。
我确实说过订单很重要“必须尊重订单”。但这不是投标交易。感谢您提供可能的解决方案。
@Oli_UK：如果顺序不重要，那么使用集合是明显的赢家。迭代解决方案将是一个错误。你能澄清这一点吗？
您的第二个解决方案不应该包含 [1, 2, 4, 5, 6]。 is_superseq() 似乎假设元素必须是连续的，列表才会被取消资格。

【解决方案8】：

新测试用例后的精炼答案：

original= [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

class SetAndList:
    def __init__(self,aList):
        self.list=aList
        self.set=set(aList)
        self.isUnique=True
    def compare(self,other):
        if self.set.issubset(other.set):
            #print self.list,'superceded by',other.list
            self.isUnique=False

def listReduce(lists):
    temp=[]
    for l in lists:
        s=SetAndList(l)
        for t in temp:
            t.compare(s)
            s.compare(t)
        temp.append( s )
        temp=[t for t in temp if t.isUnique]

    return [t.list for t in temp if t.isUnique]

print listReduce(original)

您没有提供所需的输出，但我猜这是对的，因为[1,2,3] 没有出现在输出中。

【讨论】：

再次重新阅读该问题（自上次阅读以来可能已更改）我的解决方案仍然不正确。我错过了：[1,2] 是列表 [1,2,3] 的子集，但不是列表 [2,1,3] 要求的子集。

【解决方案9】：

感谢所有提出解决方案并处理我有时会出错的数据集的人。使用 @hughdbrown 解决方案，我将其修改为我想要的：

修改是在目标上使用滑动窗口以确保找到子集序列。我认为我应该使用比“设置”更合适的词来描述我的问题。

def is_sublist_of_any_list(cand, lists):
    # Compare candidate to a single list
    def is_sublist_of_list(cand, target):
        try:
            i = 0            
            try:
                start = target.index(cand[0])
            except:
                return False

            while start < (len(target) + len(cand)) - start:
                if cand == target[start:len(cand)]:
                    return True
                else:
                    start = target.index(cand[0], start + 1)
        except ValueError:
            return False

    # See if candidate matches any other list
    return any(is_sublist_of_list(cand, target) for target in lists if len(cand) <= len(target))

# Compare candidates to all other lists
def super_lists(lists):
    a = [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]
    return a

lists = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
expect = [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69],  [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

def test():
    out = super_lists(list(lists))

    print "In  : ", lists
    print "Out : ", out

    assert (out == expect)

结果：

In  :  [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [1], [1, 2, 3, 4], [1, 2], [17, 18, 19, 22, 41, 48], [2, 3], [1, 2, 3], [50, 69], [1, 2, 3], [2, 3, 21], [1, 2, 3], [1, 2, 4, 8], [1, 2, 4, 5, 6]]
Out :  [[2, 16, 17], [1, 2, 3, 4, 5, 6, 7], [17, 18, 19, 22, 41, 48], [50, 69], [2, 3, 21], [1, 2, 4, 8], [1, 2, 4, 5, 6]]

【讨论】：

最后一次尝试：我最近提交的代码更简单。

【解决方案10】：

所以你真正想要的是知道一个列表是否是一个子字符串，可以说，另一个，所有匹配的元素都是连续的。这是将候选和目标列表转换为逗号分隔的字符串并进行子字符串比较以查看候选是否出现在目标列表中的代码

def is_sublist_of_any_list(cand, lists):
    def comma_list(l):
        return "," + ",".join(str(x) for x in l) + ","
    cand = comma_list(cand)
    return any(cand in comma_list(target) for target in lists if len(cand) <= len(target))


def super_lists(lists):
    return [cand for i, cand in enumerate(lists) if not is_sublist_of_any_list(cand, lists[:i] + lists[i+1:])]

函数 comma_list() 将前导和尾随逗号放在列表中，以确保整数完全分隔。否则，例如，[1] 将是 [100] 的子集。

【讨论】：