【问题标题】:merge join two generators in python在python中合并连接两个生成器
【发布时间】:2011-02-16 23:05:11
【问题描述】:

我想按键合并两个京都内阁 b-tree 数据库。 (kyoto cabinet python api)。 结果列表应包含两个输入数据库中的任何一个的每个唯一键(及其值)。

以下代码有效,但我认为它很难看。
left_generator/right_generator 是两个cursor 对象。 如果生成器耗尽,get() 返回 None 尤其奇怪。

def merge_join_kv(left_generator, right_generator):
stop = False
while left_generator.get() or right_generator.get():
    try:
        comparison = cmp(right_generator.get_key(), left_generator.get_key())
        if comparison == 0:
            yield left_generator.get_key(), left_generator.get_value()
            left_generator.next()
            right_generator.next()
        elif (comparison < 0) or (not left_generator.get() or not right_generator.get()):
            yield right_generator.get_key(), right_generator.get_value()
            right_generator.next()   
        else:
            yield left_generator.get_key(), left_generator.get_value()
            left_generator.next()    
    except StopIteration:
        if stop:
            raise
        stop = True

一般来说:是否有一个函数/lib 将生成器与 cmp() 合并?

【问题讨论】:

    标签: python generator


    【解决方案1】:

    我认为这就是你所需要的; orderedMerge 基于 Gnibbler 的代码,但增加了自定义键函数和唯一参数,

    import kyotocabinet
    import collections
    import heapq
    
    class IterableCursor(kyotocabinet.Cursor, collections.Iterator):
        def __init__(self, *args, **kwargs):
            kyotocabinet.Cursor.__init__(self, *args, **kwargs)
            collections.Iterator.__init__(self)
    
        def next():
            "Return (key,value) pair"
            res = self.get(True)
            if res is None:
                raise StopIteration
            else:
                return res
    
    def orderedMerge(*iterables, **kwargs):
        """Take a list of ordered iterables; return as a single ordered generator.
    
        @param key:     function, for each item return key value
                        (Hint: to sort descending, return negated key value)
    
        @param unique:  boolean, return only first occurrence for each key value?
        """
        key     = kwargs.get('key', (lambda x: x))
        unique  = kwargs.get('unique', False)
    
        _heapify       = heapq.heapify
        _heapreplace   = heapq.heapreplace
        _heappop       = heapq.heappop
        _StopIteration = StopIteration
    
        # preprocess iterators as heapqueue
        h = []
        for itnum, it in enumerate(map(iter, iterables)):
            try:
                next  = it.next
                data   = next()
                keyval = key(data)
                h.append([keyval, itnum, data, next])
            except _StopIteration:
                pass
        _heapify(h)
    
        # process iterators in ascending key order
        oldkeyval = None
        while True:
            try:
                while True:
                    keyval, itnum, data, next = s = h[0]  # get smallest-key value
                                                          # raises IndexError when h is empty
                    # if unique, skip duplicate keys
                    if unique and keyval==oldkeyval:
                        pass
                    else:
                        yield data
                        oldkeyval = keyval
    
                    # load replacement value from same iterator
                    s[2] = data = next()        # raises StopIteration when exhausted
                    s[0] = key(data)
                    _heapreplace(h, s)          # restore heap condition
            except _StopIteration:
                _heappop(h)                     # remove empty iterator
            except IndexError:
                return    
    

    那么你的功能可以做为

    from operator import itemgetter
    
    def merge_join_kv(leftGen, rightGen):
        # assuming that kyotocabinet.Cursor has a copy initializer
        leftIter = IterableCursor(leftGen)
        rightIter = IterableCursor(rightGen)
    
        return orderedMerge(leftIter, rightIter, key=itemgetter(0), unique=True)
    

    【讨论】:

    • 看起来不错。谢谢。我认为游标包装不是必需的,如果你到达终点,它会引发停止迭代(但在那之后仍然会产生 Nones)
    • @yawniek:很有可能。我没有安装 kyotocabinet;对于测试,我依赖于基于文档的模拟类。文档没有提到 Cursor.get() 抛出 StopIteration,因此我认为它没有;此外,Cursor.__next__() 表示它只返回键。
    • heapq.merge() 在 Python3.5 中会带一个关键函数:docs.python.org/dev/library/heapq.html#heapq.merge
    【解决方案2】:

    Python 2.6 在 heapq 中有一个合并,但它不支持用户定义的 cmp/key func

    def merge(*iterables):
        '''Merge multiple sorted inputs into a single sorted output.
    
        Similar to sorted(itertools.chain(*iterables)) but returns a generator,
        does not pull the data into memory all at once, and assumes that each of
        the input streams is already sorted (smallest to largest).
    
        >>> list(merge([1,3,5,7], [0,2,4,8], [5,10,15,20], [], [25]))
        [0, 1, 2, 3, 4, 5, 5, 7, 8, 10, 15, 20, 25]
    
        '''
        _heappop, _heapreplace, _StopIteration = heappop, heapreplace, StopIteration
    
        h = []
        h_append = h.append
        for itnum, it in enumerate(map(iter, iterables)):
            try:
                next = it.next
                h_append([next(), itnum, next])
            except _StopIteration:
                pass
        heapify(h)
    
        while 1:
            try:
                while 1:
                    v, itnum, next = s = h[0]   # raises IndexError when h is empty
                    yield v
                    s[0] = next()               # raises StopIteration when exhausted
                    _heapreplace(h, s)          # restore heap condition
            except _StopIteration:
                _heappop(h)                     # remove empty iterator
            except IndexError:
                return
    

    【讨论】:

    猜你喜欢
    • 2014-03-23
    • 2014-10-04
    • 1970-01-01
    • 2023-03-10
    • 2022-01-07
    • 2012-01-23
    • 2014-01-04
    • 2015-05-29
    • 2016-07-09
    相关资源
    最近更新 更多