在python中将子列表分组到阈值长度下的函数的优雅和优化解决方案答案

【问题标题】：Elegant and optimized solution to a function that group sublists under a threshold length in python在python中将子列表分组到阈值长度下的函数的优雅和优化解决方案
【发布时间】：2020-03-18 19:49:23
【问题描述】：

给定一个列表L，例如[[1,1,1,1], [1,1,1], [1,1], [1]] 和一个max_len=8，我想创建一个新列表LN，像这样[[[1, 1, 1, 1], [1, 1, 1]], [[1, 1], [1]]]。

所以我有一个列表列表。我想以每个列表的长度总和

我一直在尝试以最 Pythonic 和最有效的方式来做这件事。应该是 O(n)。在某人的帮助下，这是我到目前为止的代码：

def chunks(list_to_chunck, max_len):
    if any(len(sub_list) > max_len for sub_list in list_to_chunck):
        return None

    new_list = []
    while list_to_chunck:
        copy_list = [list_to_chunck.pop(0)]
        while list_to_chunck:
            if len(list_to_chunck[0]) + sum(len(sub_list) for sub_list in copy_list) <= max_len:
                copy_list.append(list_to_chunck.pop(0))
            else:
                break
        new_list.append(copy_list)

    return new_list

【问题讨论】：

请清楚地说明您的代码存在的问题。见minimal, reproducible example。
lst.pop(0) 会扼杀你的效率。它还改变了作为参数传递的列表，几乎在所有情况下都是反模式（也许是一个小的私有帮助函数，这样做是为了提高效率并且不作为公共 API 的一部分公开，这是可以接受的情况） .
@Prune 我编辑了更具可读性的代码。

标签： python list algorithm

【解决方案1】：

您可以使用变量size 来跟踪输出列表中最后一个子列表的当前大小，并且在迭代中添加当前子列表后，每当它超过max_len 时，追加输出的新子列表。使用大于max_len 的值初始化size，以便它始终在第一次迭代中添加一个新的子列表。使用这种方法，时间复杂度将是 O(n):

def chunks(lst, max_len):
    output = []
    size = max_len + 1
    for s in lst:
        size += len(s)
        if size > max_len:
            output.append([])
            size = len(s)
            if size > max_len:
                return
        output[-1].append(s)
    return output

这样chunks([[1, 1, 1, 1], [1, 1, 1], [1, 1], [1]], 8) 返回：

[[[1, 1, 1, 1], [1, 1, 1]], [[1, 1], [1]]]

【讨论】：

但如果其中一个子列表大于max_len，则它会失败，例如chunks([[1, 1, 1, 1], [1, 1, 1], [1, 1], [1]], 3) 给出 [[[1, 1, 1, 1]], [[1, 1, 1]], [[1, 1], [1]]] 而不是 None 或类似的。
@norok2 是的，但是，就像我上面的代码一样，我确保检查我是否有一个大于 max_len 的 sub_list。 ``` if any(len(sub_list) > max_len for sub_list in list_to_chunck): return None ``
我明白了。事实上，我没有注意到 OP 代码的这部分行为。我已经相应地更新了答案。谢谢。

【解决方案2】：

这是一个 O(N) 方法的草图。它会创建一个新列表，并且不会修改原始列表。它不能处理所有边缘情况，但这应该可以帮助您：

In [1]: data = [[1,1,1,1], [1,1,1], [1,1], [1]]
   ...:

In [2]: def chunks(nested, maxlen):
   ...:     total = 0
   ...:     result = []
   ...:     piece = []
   ...:     for sub in nested:
   ...:         length = len(sub)
   ...:         if total + length > maxlen:
   ...:             result.append(piece)
   ...:             piece = [sub]
   ...:             total = length
   ...:         else:
   ...:             piece.append(sub)
   ...:             total += length
   ...:     if piece:
   ...:         result.append(piece)
   ...:     return result
   ...:

In [3]: chunks(data, 8)
Out[3]: [[[1, 1, 1, 1], [1, 1, 1]], [[1, 1], [1]]]

【讨论】：

也谢谢你，但我更喜欢另一个答案，它使用更少的代码行。谢谢！

【解决方案3】：

你似乎对列表的总数没有任何限制，所以贪婪的方法应该没问题：

def chunks(items, max_len):
    ret = [[]]
    remaining = max_len
    for i in items:
        if len(i) > remaining:
            ret.append([])
            remaining = max_len
            if len(i) > remaining:
                return None  # Could raise on impossible
        ret[-1].append(i)
        remaining -= len(i)
    return ret

使用您的示例：

items = [[1,1,1,1], [1,1,1], [1,1], [1]]
assert chunks(items, 8) == [[[1, 1, 1, 1], [1, 1, 1]], [[1, 1], [1]]]

看到其他答案，这几乎是一致的，所以我想折腾一个可读性较差的选项，没有长度保证 =)

def chunks(items, max_len):
    count = [0, 0]
    def group(item): 
        count[1] += len(item)
        if count[1] >= 8:
            count[0] += 1
            count[1] = 0
        return count[0]
    return [list(v) for k, v in itertools.groupby(data, key=group)]

【讨论】：

【解决方案4】：

基于生成器的解决方案：

def group_subseqs(seq, max_len):
    curr_size = 0
    result = []
    for subseq in seq:
        len_subseq = len(subseq)
        if curr_size + len_subseq <= max_len:
            result.append(subseq)
            curr_size += len_subseq
        else:
            if result:
                yield result
                if len_subseq <= max_len:
                    result = [subseq]
                    curr_size = len_subseq
                else:
                    return
            else:
                return
    if result:
        yield result

按预期工作（有点......如果子列表大于max_len，则停止产生，而不是根本不产生任何东西）：

a = [[1,1,1,1], [1,1,1], [1,1], [1]]
print(list(group_subseqs(a, 8)))
# [[[1, 1, 1, 1], [1, 1, 1]], [[1, 1], [1]]]

print(list(group_subseqs(a, 4)))
# [[[1, 1, 1, 1]], [[1, 1, 1]], [[1, 1], [1]]]

print(list(group_subseqs(a, 3)))
# []

print(list(group_subseqs(a[::-1], 3)))
# [[[1], [1, 1]], [[1, 1, 1]]]

【讨论】：