基于共享值的列表集群列表[重复]答案

【问题标题】：Cluster list of lists based on shared values [duplicate]基于共享值的列表集群列表[重复]
【发布时间】：2018-11-05 08:46:15
【问题描述】：

我有一个列表列表，其中每个子列表都包含一些整数：

o = [[1,2],[3,4],[2,3],[5,4]]

我想创建一个新的列表列表，其中o 中任何两个共享一个共同成员的子列表将被合并。这个合并过程应该一直持续到没有两个子列表共享一个公共元素。给定o，我们会将[1,2] 与[2,3] 合并，因为它们共享一个2，然后我们将该组与[3,4] 合并，因为[1,2,3] 和[3,4] 都包含一个3，依此类推。

聚类o 的预期输出为[[1,2,3,4,5]]

我有一种预感，有一种方法可以完成这项任务，它远远优于我目前的方法（见下文）。其他人可以就完成此任务的最有效（时间，然后是空间）方式提供任何建议，我们将不胜感激。

from collections import defaultdict

o = [[1,2],[3,4],[2,3],[5,4]]

def group_lists(list_of_lists):
  '''
  Given a list of lists, continue combining sublist
  elements that share an element until no two sublist
  items share an element.
  '''
  to_cluster = set(tuple(i) for i in list_of_lists)
  keep_clustering = True
  while keep_clustering:
    keep_clustering = False
    d = defaultdict(set)
    for i in to_cluster:
      for j in i:
        d[j].add(i)
    clustered = set()
    for i in d.values():
      # remove duplicate entries from each cluster
      flat = tuple(set([item for sublist in i for item in sublist]))
      clustered.add(flat)
    if not to_cluster == clustered:
      keep_clustering = True
      to_cluster = clustered
  # done clustering!
  return clustered

print(group_lists(o))

【问题讨论】：

标签： python algorithm

【解决方案1】：

你可以使用递归：

def cluster(d, current = []):
  options = [i for i in d if any(c in current for c in i)]
  _flattened = [i for b in options for i in b]
  d = list(filter(lambda x:x not in options, d))
  if not options or not d:
    yield current+_flattened
  if d and not options:
    yield from cluster(d[1:], d[0])
  elif d:
    yield from cluster(d, current+_flattened)

for a, *b in [[[1,2],[6,4],[2,3],[5,4]], [[1,2],[3,4],[2,3],[5,4]], [[1,2],[3,4],[2,3],[5,4], [10, 11, 12], [13, 15], [4,6], [6, 8], [23,25]]]:
  print([list(set(i)) for i in cluster(b, a)])

输出：

[[1, 2, 3], [4, 5, 6]]
[[1, 2, 3, 4, 5]]
[[1, 2, 3, 4, 5, 6, 8], [10, 11, 12], [13, 15], [25, 23]]

【讨论】：

此算法返回不正确的结果。试试o = [[1,2],[6,4],[2,3],[5,4]]...
@duhaime 请看我最近的编辑。

【解决方案2】：

from collections import deque

o = [[1,2],[3,4],[2,3],[5,4], [10, 11, 12], [13, 15], [4,6], [6, 8], [23,25]]

o = sorted(o, key=lambda x:min(x))
queue = deque(o)

grouped = []
while len(queue) >= 2:
    l1 = queue.popleft()
    l2 = queue.popleft()
    s1 = set(l1)
    s2 = set(l2)

    if s1 & s2:
        queue.appendleft(s1 | s2)
    else:
        grouped.append(s1)
        queue.appendleft(s2)
#         print(set(l1).union(set(l2)))
if queue:
    grouped.append(queue.pop())

print(grouped)

输出

[set([1, 2, 3, 4, 5, 6, 8]), set([10, 11, 12]), set([13, 15]), set([25, 23])]

【讨论】：

o = [[1, 4], [2, 3], [4, 5]] 结果到[{1, 4}, {2, 3}, {4, 5}]
@niemmi 你是对的。它在此输入上失败。我想一个简单的解决方法是按最大值排序，看看结果是否与我的最小值排序不同。如果是这样，请使用较少数量的项目。无论如何，您链接的线程中基于图形的解决方案要好得多。
@Asterisk 即使有你提到的改变，这种技术也不能最大限度地合并组。稍后我将发布一些您可以在该线程中运行的测试...
@duhaime 我同意。谢谢你告诉我。