链接 Python 生成器以遍历嵌套数据结构；最佳实践答案

【问题标题】：Chaining of Python generators to traverse a nested data structure; best practices链接 Python 生成器以遍历嵌套数据结构；最佳实践
【发布时间】：2015-04-26 14:43:03
【问题描述】：

假设我有一个要遍历的嵌套数据结构。此数据结构包含节点，这些节点又可以通过node.get_children_generator() 提供他们的孩子。当然，这些孩子也是node 类型，并且以惰性方式进行评估，即由生成器枚举。为简单起见，让我们假设node 没有孩子，函数get_children_generator 只返回一个空列表/生成器（因此我们不必手动检查它是否为空）。

为了遍历嵌套节点的这种数据结构，简单地迭代链接所有生成器是不是一个好主意？那是在创造链条链条等等？还是会产生过多的开销？

我的想法是这样的：

import itertools as it

def traverse_nodes(start_node):
    """Traverses nodes in breadth first manner.
    
    First returns the start node.
    
    For simplicity we require that 
    there are no cycles in the data structure,
    i.e. we are dealing with a simple tree.
    
    """
    node_queue = iter([start_node])
    while True:
        try:
            next_node = node_queue.next()
            yield next_node
            
            # Next get the children
            child_gen = next_node.get_children_generator()
            
            # The next code line is the one I am worried about
            # is it a good idea to make a chain of chains?
            node_queue = it.chain(node_queue, child_gen)
        except StopIteration:
            # There are no more nodes
            break

node_queue = it.chain(node_queue, child_gen) 行是遍历的好方法吗？制作一个链条链条等是个好主意吗？

这样你就有了可以执行的东西，这是一个相当愚蠢的 node 类。生成器有点没用，但假设在现实世界的示例中评估孩子的成本有点高，并且确实需要生成器。

class Node(object):
    """Rather silly example of a nested node.

    The children are actually stored in a list,
    so the generator is actually not needed.
    
    But simply assume that returning a children
    requires a lazy evaluation.
    
    """
    counter = 0 # Counter for node identification
    
    def __init__(self):
        self.children = [] # children list
        self.node_number = Node.counter # identifies the node
        Node.counter += 1
    
    def __repr__(self):
        return 'I am node #%d' % self.node_number
    
    def get_children_generator(self):
        """Returns a generator over children"""
        return (x for x in self.children)

所以下面的代码sn -p

node0 = Node()
node1 = Node()
node2 = Node()
node3 = Node()
node4 = Node()
node5 = Node()
node6 = Node()
node0.children = [node1, node2]
node1.children = [node6]
node2.children = [node3, node5]
node3.children = [node4]

for node in traverse_nodes(node0):
    print(node)

打印

我是节点#0

我是节点 #1

我是节点 #2

我是节点 #6

我是节点 #3

我是节点#5

我是节点 #4

【问题讨论】：

为什么不简单的迭代器堆栈/队列？这样你就可以像 DFS/BFS 一样遍历你的结构。
所以我有两个嵌套循环？一个在堆栈/队列上，另一个在我从中弹出的生成器上？这比上面的方法更好吗？为什么？
它是否真的比你的方法更好是一个品味问题，但可以肯定的是它不会受到你的情况下发生的嵌套调用的影响——你需要遍历所有的链来获得队列的第一个元素。考虑到一个节点有 n 个邻居，那么获取最后一个邻居将花费您通过迭代器的 next() 方法的 n 个递归调用。
老实说，我不完全明白你的意思。你有没有想过通过next 获取node_queue 的第一个元素需要花费O(n) 而不是O(1)？那么它就不再是品味的问题了。除非：>>我喜欢我的排序平方
查看我的答案以获得一些解释。

标签： python tree generator itertools

【解决方案1】：

链接多个chains 会导致递归函数调用开销与链接在一起的chains 的数量成正比。

首先，我们的纯 python chain 实现，这样我们就不会丢失堆栈信息。 C 实现是here，你可以看到它基本上做同样的事情——在底层迭代上调用next()。

from inspect import stack


def chain(it1, it2):
    for collection in [it1, it2]:
        try:
            for el in collection:
                yield el

        except StopIteration:
            pass

我们只关心 chain 的 2-iterables 版本。我们首先使用第一个可迭代对象，然后是另一个。

class VerboseListIterator(object):
    def __init__(self, collection, node):
        self.collection = collection
        self.node = node
        self.idx = 0

    def __iter__(self):
        return self

    def __next__(self):
        print('Printing {}th child of "{}". Stack size: {}'.format(self.idx, self.node, len(stack())))
        if self.idx >= len(self.collection):
            raise StopIteration()

        self.idx += 1
        return self.collection[self.idx - 1]

这是我们方便的列表迭代器，它将告诉我们返回包装列表的下一个元素时有多少堆栈帧。

class Node(object):
    """Rather silly example of a nested node.

    The children are actually stored in a list,
    so the generator is actually not needed.

    But simply assume that returning a children
    requires a lazy evaluation.

    """
    counter = 0 # Counter for node identification

    def __init__(self):
        self.children = [] # children list
        self.node_number = Node.counter # identifies the node
        Node.counter += 1

    def __repr__(self):
        return 'I am node #%d' % self.node_number

    def get_children_generator(self):
        """Returns a generator over children"""
        return VerboseListIterator(self.children, self)


def traverse_nodes(start_node):
    """Traverses nodes in breadth first manner.

    First returns the start node.

    For simplicity we require that
    there are no cycles in the data structure,
    i.e. we are dealing with a simple tree.

    """
    node_queue = iter([start_node])
    while True:
        try:
            next_node = next(node_queue)
            yield next_node

            # Next get the children
            child_gen = next_node.get_children_generator()

            # The next code line is the one I am worried about
            # is it a good idea to make a chain of chains?
            node_queue = chain(node_queue, child_gen)
        except StopIteration:
            # There are no more nodes
            break

这些是您使用的 Python 版本 (3.4) 的实现。

nodes = [Node() for _ in range(10)]
nodes[0].children = nodes[1:6]
nodes[1].children = [nodes[6]]
nodes[2].children = [nodes[7]]
nodes[3].children = [nodes[8]]
nodes[4].children = [nodes[9]]

节点的图形初始化。根连接到前 5 个节点，这些节点依次连接到 i + 5th 节点。

for node in traverse_nodes(nodes[0]):
    print(node)

本次交互的结果如下：

I am node #0
Printing 0th child of "I am node #0". Stack size: 4
I am node #1
Printing 1th child of "I am node #0". Stack size: 5
I am node #2
Printing 2th child of "I am node #0". Stack size: 6
I am node #3
Printing 3th child of "I am node #0". Stack size: 7
I am node #4
Printing 4th child of "I am node #0". Stack size: 8
I am node #5
Printing 5th child of "I am node #0". Stack size: 9
Printing 0th child of "I am node #1". Stack size: 8
I am node #6
Printing 1th child of "I am node #1". Stack size: 9
Printing 0th child of "I am node #2". Stack size: 8
I am node #7
Printing 1th child of "I am node #2". Stack size: 9
Printing 0th child of "I am node #3". Stack size: 8
I am node #8
Printing 1th child of "I am node #3". Stack size: 9
Printing 0th child of "I am node #4". Stack size: 8
I am node #9
Printing 1th child of "I am node #4". Stack size: 9
Printing 0th child of "I am node #5". Stack size: 8
Printing 0th child of "I am node #6". Stack size: 7
Printing 0th child of "I am node #7". Stack size: 6
Printing 0th child of "I am node #8". Stack size: 5
Printing 0th child of "I am node #9". Stack size: 4

如您所见，我们越接近node0 的子列表末尾，堆栈就越大。这是为什么？让我们仔细看看每个步骤 - 枚举每个 chain 调用以进行澄清：

node_queue = [node0]
调用next(node_queue)，产生node0。 node_queue = chain1([node0], [node1, node2, node3, node4, node5])。
呼叫next(node_queue)。列表[node0] 被消费，第二个列表正在被消费。 node1 得到屈服，node_queue = chain2(chain1([node0], [node1, ...]), [node6])。
调用next(node_queue) 向下传播到chain1（来自chain2），并产生node2。 node_queue = chain3(chain2(chain1([node0], [...]), [node6]), [node7])。

这种模式一直持续到我们即将产出node5：

next(chain5(chain4, [node9]))
              |
              V
    next(chain4(chain3, [node8]))
                  |
                  V
        next(chain3(chain2, [node7]))
                      |
                      V
            next(chain2(chain1, [node6]))
                          |
                          V
                next(chain1([node0], [node1, node2, node3, node4, node5]))
                                                               ^
                                                             yield

比简单的 BFS/DFS 更快吗？

而不是。单个next(node_queue) 调用实际上会导致大量递归调用，这与每个 BFS 中常规迭代器队列的大小成正比，或者简单地说 - 图中节点的最大子节点数量。

tl;博士

这是一个展示算法的 gif：http://i.imgur.com/hnPIVG4.gif

【讨论】：