【问题标题】:How can memoization be applied to this algorithm?记忆化如何应用于该算法?
【发布时间】:2011-03-14 07:59:55
【问题描述】:

在 Python 的标准库中发现 difflib.SequenceMatcher 类不适合我的需要后,编写了一个通用的“差异”模块来解决问题空间。经过几个月的时间来更多地考虑它在做什么,递归算法似乎比需要的搜索更多,它按照一个单独的“搜索线程”可能也检查过的顺序重新搜索相同的区域。

diff 模块的目的是计算一对序列(列表、元组、字符串、字节、字节数组等)之间的差异和相似之处。初始版本比代码的当前形式慢得多,速度提高了十倍。如何将 memoization 应用于以下代码?重写算法以进一步提高任何可能的速度的最佳方法是什么?


class Slice:

    __slots__ = 'prefix', 'root', 'suffix'

    def __init__(self, prefix, root, suffix):
        self.prefix = prefix
        self.root = root
        self.suffix = suffix

################################################################################

class Match:

    __slots__ = 'a', 'b', 'prefix', 'suffix', 'value'

    def __init__(self, a, b, prefix, suffix, value):
        self.a = a
        self.b = b
        self.prefix = prefix
        self.suffix = suffix
        self.value = value

################################################################################

class Tree:

    __slots__ = 'nodes', 'index', 'value'

    def __init__(self, nodes, index, value):
        self.nodes = nodes
        self.index = index
        self.value = value

################################################################################

def search(a, b):
    # Initialize startup variables.
    nodes, index = [], []
    a_size, b_size = len(a), len(b)
    # Begin to slice the sequences.
    for size in range(min(a_size, b_size), 0, -1):
        for a_addr in range(a_size - size + 1):
            # Slice "a" at address and end.
            a_term = a_addr + size
            a_root = a[a_addr:a_term]
            for b_addr in range(b_size - size + 1):
                # Slice "b" at address and end.
                b_term = b_addr + size
                b_root = b[b_addr:b_term]
                # Find out if slices are equal.
                if a_root == b_root:
                    # Create prefix tree to search.
                    a_pref, b_pref = a[:a_addr], b[:b_addr]
                    p_tree = search(a_pref, b_pref)
                    # Create suffix tree to search.
                    a_suff, b_suff = a[a_term:], b[b_term:]
                    s_tree = search(a_suff, b_suff)
                    # Make completed slice objects.
                    a_slic = Slice(a_pref, a_root, a_suff)
                    b_slic = Slice(b_pref, b_root, b_suff)
                    # Finish the match calculation.
                    value = size + p_tree.value + s_tree.value
                    match = Match(a_slic, b_slic, p_tree, s_tree, value)
                    # Append results to tree lists.
                    nodes.append(match)
                    index.append(value)
        # Return largest matches found.
        if nodes:
            return Tree(nodes, index, max(index))
    # Give caller null tree object.
    return Tree(nodes, index, 0)

参考:How to optimize a recursive algorithm to not repeat itself?

【问题讨论】:

    标签: python optimization recursion diff memoization


    【解决方案1】:

    问这个问题已经9年多了,但内部缓存结果以加速算法的概念终于应用到今天的代码中。此应用程序的结果如下所示:

    #! /usr/bin/env python3
    """Compute differences and similarities between a pair of sequences.
    
    After finding the "difflib.SequenceMatcher" class unsuitable, this module
    was written and re-written several times into the polished version below."""
    
    __author__ = 'Stephen "Zero" Chappell <Noctis.Skytower@gmail.com>'
    __date__ = '3 September 2019'
    __version__ = '$Revision: 4 $'
    
    
    class Slice:
        __slots__ = 'prefix', 'root', 'suffix'
    
        def __init__(self, prefix, root, suffix):
            self.prefix = prefix
            self.root = root
            self.suffix = suffix
    
    
    class Match:
        __slots__ = 'a', 'b', 'prefix', 'suffix', 'value'
    
        def __init__(self, a, b, prefix, suffix, value):
            self.a = a
            self.b = b
            self.prefix = prefix
            self.suffix = suffix
            self.value = value
    
    
    class Tree:
        __slots__ = 'nodes', 'index', 'value'
    
        def __init__(self, nodes, index, value):
            self.nodes = nodes
            self.index = index
            self.value = value
    
    
    def search(a, b):
        return _search(a, b, {})
    
    
    def _search(a, b, memo):
        # Initialize startup variables.
        nodes, index = [], []
        a_size, b_size = len(a), len(b)
        # Begin to slice the sequences.
        for size in range(min(a_size, b_size), 0, -1):
            for a_addr in range(a_size - size + 1):
                # Slice "a" at address and end.
                a_term = a_addr + size
                a_root = a[a_addr:a_term]
                for b_addr in range(b_size - size + 1):
                    # Slice "b" at address and end.
                    b_term = b_addr + size
                    b_root = b[b_addr:b_term]
                    # Find out if slices are equal.
                    if a_root == b_root:
                        # Create prefix tree to search.
                        key = a_prefix, b_prefix = a[:a_addr], b[:b_addr]
                        if key not in memo:
                            memo[key] = _search(a_prefix, b_prefix, memo)
                        p_tree = memo[key]
                        # Create suffix tree to search.
                        key = a_suffix, b_suffix = a[a_term:], b[b_term:]
                        if key not in memo:
                            memo[key] = _search(a_suffix, b_suffix, memo)
                        s_tree = memo[key]
                        # Make completed slice objects.
                        a_slice = Slice(a_prefix, a_root, a_suffix)
                        b_slice = Slice(b_prefix, b_root, b_suffix)
                        # Finish the match calculation.
                        value = size + p_tree.value + s_tree.value
                        match = Match(a_slice, b_slice, p_tree, s_tree, value)
                        # Append results to tree lists.
                        nodes.append(match)
                        index.append(value)
            # Return largest matches found.
            if nodes:
                return Tree(nodes, index, max(index))
        # Give caller null tree object.
        return Tree(nodes, index, 0)
    

    【讨论】:

      【解决方案2】:

      正如 ~unutbu 所说,尝试memoized decorator 和以下更改:

      @memoized
      def search(a, b):
          # Initialize startup variables.
          nodes, index = [], []
          a_size, b_size = len(a), len(b)
          # Begin to slice the sequences.
          for size in range(min(a_size, b_size), 0, -1):
              for a_addr in range(a_size - size + 1):
                  # Slice "a" at address and end.
                  a_term = a_addr + size
                  a_root = list(a)[a_addr:a_term] #change to list
                  for b_addr in range(b_size - size + 1):
                      # Slice "b" at address and end.
                      b_term = b_addr + size
                      b_root = list(b)[b_addr:b_term] #change to list
                      # Find out if slices are equal.
                      if a_root == b_root:
                          # Create prefix tree to search.
                          a_pref, b_pref = list(a)[:a_addr], list(b)[:b_addr]
                          p_tree = search(a_pref, b_pref)
                          # Create suffix tree to search.
                          a_suff, b_suff = list(a)[a_term:], list(b)[b_term:]
                          s_tree = search(a_suff, b_suff)
                          # Make completed slice objects.
                          a_slic = Slice(a_pref, a_root, a_suff)
                          b_slic = Slice(b_pref, b_root, b_suff)
                          # Finish the match calculation.
                          value = size + p_tree.value + s_tree.value
                          match = Match(a_slic, b_slic, p_tree, s_tree, value)
                          # Append results to tree lists.
                          nodes.append(match)
                          index.append(value)
              # Return largest matches found.
              if nodes:
                  return Tree(nodes, index, max(index))
          # Give caller null tree object.
          return Tree(nodes, index, 0)
      

      对于记忆,字典是最好的,但它们不能被切片,所以它们必须按照上面的 cmets 中的指示更改为列表。

      【讨论】:

      • 感谢您的帮助!如上所述,代码是为[支持切片的]序列(列表、元组、字符串、字节、字节数组和任何定制的容器)编写的。从您的更改来看,正确的记忆似乎需要不可变的序列。
      【解决方案3】:

      您可以使用Python Decorator Library 中的 memoize 装饰器 并像这样使用它:

      @memoized
      def search(a, b):
      

      第一次使用参数a,b 调用search 时,会计算并存储结果(保存在缓存中)。第二次使用相同的参数调用search,结果从缓存中返回。

      请注意,要使 memoized 装饰器工作,参数必须是可散列的。如果ab 是数字元组,那么它们是可散列的。如果它们是列表,那么您可以在将它们传递给search 之前将它们转换为元组。 search 看起来不像 dicts 作为参数,但如果是,那么 they would not be hashable 和 memoization 装饰器将无法将结果保存在缓存中。

      【讨论】:

      • 你的答案可能不是最好的,因为他的算法是使用列表切片...算法本身可能需要重构。
      • @kzh:你能详细说明一下吗?对不起——我不关注你。如果有列表切片会出现什么问题?
      • 他需要将字典改回切片部分的列表。例如,我发布了我的答案。
      猜你喜欢
      • 2010-10-03
      • 2020-08-22
      • 1970-01-01
      • 1970-01-01
      • 2018-12-10
      • 2023-03-04
      • 1970-01-01
      • 2020-10-21
      • 2017-02-20
      相关资源
      最近更新 更多