记忆化如何应用于该算法？答案

【问题标题】：How can memoization be applied to this algorithm?记忆化如何应用于该算法？
【发布时间】：2011-03-14 07:59:55
【问题描述】：

在 Python 的标准库中发现 difflib.SequenceMatcher 类不适合我的需要后，编写了一个通用的“差异”模块来解决问题空间。经过几个月的时间来更多地考虑它在做什么，递归算法似乎比需要的搜索更多，它按照一个单独的“搜索线程”可能也检查过的顺序重新搜索相同的区域。

diff 模块的目的是计算一对序列（列表、元组、字符串、字节、字节数组等）之间的差异和相似之处。初始版本比代码的当前形式慢得多，速度提高了十倍。如何将 memoization 应用于以下代码？重写算法以进一步提高任何可能的速度的最佳方法是什么？

class Slice:

    __slots__ = 'prefix', 'root', 'suffix'

    def __init__(self, prefix, root, suffix):
        self.prefix = prefix
        self.root = root
        self.suffix = suffix

################################################################################

class Match:

    __slots__ = 'a', 'b', 'prefix', 'suffix', 'value'

    def __init__(self, a, b, prefix, suffix, value):
        self.a = a
        self.b = b
        self.prefix = prefix
        self.suffix = suffix
        self.value = value

################################################################################

class Tree:

    __slots__ = 'nodes', 'index', 'value'

    def __init__(self, nodes, index, value):
        self.nodes = nodes
        self.index = index
        self.value = value

################################################################################

def search(a, b):
    # Initialize startup variables.
    nodes, index = [], []
    a_size, b_size = len(a), len(b)
    # Begin to slice the sequences.
    for size in range(min(a_size, b_size), 0, -1):
        for a_addr in range(a_size - size + 1):
            # Slice "a" at address and end.
            a_term = a_addr + size
            a_root = a[a_addr:a_term]
            for b_addr in range(b_size - size + 1):
                # Slice "b" at address and end.
                b_term = b_addr + size
                b_root = b[b_addr:b_term]
                # Find out if slices are equal.
                if a_root == b_root:
                    # Create prefix tree to search.
                    a_pref, b_pref = a[:a_addr], b[:b_addr]
                    p_tree = search(a_pref, b_pref)
                    # Create suffix tree to search.
                    a_suff, b_suff = a[a_term:], b[b_term:]
                    s_tree = search(a_suff, b_suff)
                    # Make completed slice objects.
                    a_slic = Slice(a_pref, a_root, a_suff)
                    b_slic = Slice(b_pref, b_root, b_suff)
                    # Finish the match calculation.
                    value = size + p_tree.value + s_tree.value
                    match = Match(a_slic, b_slic, p_tree, s_tree, value)
                    # Append results to tree lists.
                    nodes.append(match)
                    index.append(value)
        # Return largest matches found.
        if nodes:
            return Tree(nodes, index, max(index))
    # Give caller null tree object.
    return Tree(nodes, index, 0)

参考：How to optimize a recursive algorithm to not repeat itself?

【问题讨论】：

标签： python optimization recursion diff memoization

【解决方案1】：

问这个问题已经9年多了，但内部缓存结果以加速算法的概念终于应用到今天的代码中。此应用程序的结果如下所示：

#! /usr/bin/env python3
"""Compute differences and similarities between a pair of sequences.

After finding the "difflib.SequenceMatcher" class unsuitable, this module
was written and re-written several times into the polished version below."""

__author__ = 'Stephen "Zero" Chappell <Noctis.Skytower@gmail.com>'
__date__ = '3 September 2019'
__version__ = '$Revision: 4 $'


class Slice:
    __slots__ = 'prefix', 'root', 'suffix'

    def __init__(self, prefix, root, suffix):
        self.prefix = prefix
        self.root = root
        self.suffix = suffix


class Match:
    __slots__ = 'a', 'b', 'prefix', 'suffix', 'value'

    def __init__(self, a, b, prefix, suffix, value):
        self.a = a
        self.b = b
        self.prefix = prefix
        self.suffix = suffix
        self.value = value


class Tree:
    __slots__ = 'nodes', 'index', 'value'

    def __init__(self, nodes, index, value):
        self.nodes = nodes
        self.index = index
        self.value = value


def search(a, b):
    return _search(a, b, {})


def _search(a, b, memo):
    # Initialize startup variables.
    nodes, index = [], []
    a_size, b_size = len(a), len(b)
    # Begin to slice the sequences.
    for size in range(min(a_size, b_size), 0, -1):
        for a_addr in range(a_size - size + 1):
            # Slice "a" at address and end.
            a_term = a_addr + size
            a_root = a[a_addr:a_term]
            for b_addr in range(b_size - size + 1):
                # Slice "b" at address and end.
                b_term = b_addr + size
                b_root = b[b_addr:b_term]
                # Find out if slices are equal.
                if a_root == b_root:
                    # Create prefix tree to search.
                    key = a_prefix, b_prefix = a[:a_addr], b[:b_addr]
                    if key not in memo:
                        memo[key] = _search(a_prefix, b_prefix, memo)
                    p_tree = memo[key]
                    # Create suffix tree to search.
                    key = a_suffix, b_suffix = a[a_term:], b[b_term:]
                    if key not in memo:
                        memo[key] = _search(a_suffix, b_suffix, memo)
                    s_tree = memo[key]
                    # Make completed slice objects.
                    a_slice = Slice(a_prefix, a_root, a_suffix)
                    b_slice = Slice(b_prefix, b_root, b_suffix)
                    # Finish the match calculation.
                    value = size + p_tree.value + s_tree.value
                    match = Match(a_slice, b_slice, p_tree, s_tree, value)
                    # Append results to tree lists.
                    nodes.append(match)
                    index.append(value)
        # Return largest matches found.
        if nodes:
            return Tree(nodes, index, max(index))
    # Give caller null tree object.
    return Tree(nodes, index, 0)

【讨论】：

【解决方案2】：

正如 ~unutbu 所说，尝试memoized decorator 和以下更改：

@memoized
def search(a, b):
    # Initialize startup variables.
    nodes, index = [], []
    a_size, b_size = len(a), len(b)
    # Begin to slice the sequences.
    for size in range(min(a_size, b_size), 0, -1):
        for a_addr in range(a_size - size + 1):
            # Slice "a" at address and end.
            a_term = a_addr + size
            a_root = list(a)[a_addr:a_term] #change to list
            for b_addr in range(b_size - size + 1):
                # Slice "b" at address and end.
                b_term = b_addr + size
                b_root = list(b)[b_addr:b_term] #change to list
                # Find out if slices are equal.
                if a_root == b_root:
                    # Create prefix tree to search.
                    a_pref, b_pref = list(a)[:a_addr], list(b)[:b_addr]
                    p_tree = search(a_pref, b_pref)
                    # Create suffix tree to search.
                    a_suff, b_suff = list(a)[a_term:], list(b)[b_term:]
                    s_tree = search(a_suff, b_suff)
                    # Make completed slice objects.
                    a_slic = Slice(a_pref, a_root, a_suff)
                    b_slic = Slice(b_pref, b_root, b_suff)
                    # Finish the match calculation.
                    value = size + p_tree.value + s_tree.value
                    match = Match(a_slic, b_slic, p_tree, s_tree, value)
                    # Append results to tree lists.
                    nodes.append(match)
                    index.append(value)
        # Return largest matches found.
        if nodes:
            return Tree(nodes, index, max(index))
    # Give caller null tree object.
    return Tree(nodes, index, 0)

对于记忆，字典是最好的，但它们不能被切片，所以它们必须按照上面的 cmets 中的指示更改为列表。

【讨论】：

感谢您的帮助！如上所述，代码是为[支持切片的]序列（列表、元组、字符串、字节、字节数组和任何定制的容器）编写的。从您的更改来看，正确的记忆似乎需要不可变的序列。

【解决方案3】：

您可以使用Python Decorator Library 中的 memoize 装饰器并像这样使用它：

@memoized
def search(a, b):

第一次使用参数a,b 调用search 时，会计算并存储结果（保存在缓存中）。第二次使用相同的参数调用search，结果从缓存中返回。

请注意，要使 memoized 装饰器工作，参数必须是可散列的。如果a 和b 是数字元组，那么它们是可散列的。如果它们是列表，那么您可以在将它们传递给search 之前将它们转换为元组。 search 看起来不像 dicts 作为参数，但如果是，那么 they would not be hashable 和 memoization 装饰器将无法将结果保存在缓存中。

【讨论】：

你的答案可能不是最好的，因为他的算法是使用列表切片...算法本身可能需要重构。
@kzh：你能详细说明一下吗？对不起——我不关注你。如果有列表切片会出现什么问题？
他需要将字典改回切片部分的列表。例如，我发布了我的答案。