如何优化此 Python 代码（来自 ThinkPython，练习 10.10）答案

【问题标题】：How to optimize this Python code (from ThinkPython, Exercise 10.10)如何优化此 Python 代码（来自 ThinkPython，练习 10.10）
【发布时间】：2011-04-02 12:10:43
【问题描述】：

我正在研究 Allen Downey 的How To Think Like A Computer Scientist，并且我已经编写了我认为对练习 10.10 功能正确的解决方案。但它只花了 10 多个小时 (!) 来运行，所以我想知道我是否遗漏了一些非常明显且有用的优化。

这是练习：

“如果从每个单词中交替的字母组成一个新单词，则两个单词 'interlock'。例如，'shoe' 和 'cold' 互锁以形成 'schooled'。编写一个程序来查找所有互锁的单词对。提示: 不要枚举所有对！"

（对于这些单词列表问题，Downey 提供了一个包含 113809 个单词的文件。我们可以假设这些单词在一个列表中，列表中每个项目一个单词。）

这是我的解决方案：

from bisect import bisect_left

def index(lst, target):
    """If target is in list, returns the index of target; otherwise returns None"""
    i = bisect_left(lst, target)
    if i != len(lst) and lst[i] == target:
        return i
    else:
        return None

def interlock(str1, str2):
    "Takes two strings of equal length and 'interlocks' them."
    if len(str1) == len(str2):
        lst1 = list(str1)
        lst2 = list(str2)
        result = []
        for i in range(len(lst1)):
            result.append(lst1[i])
            result.append(lst2[i])
        return ''.join(result)
    else:
        return None

def interlockings(word_lst):
    """Checks each pair of equal-length words to see if their interlocking is a word; prints each successful pair and the total number of successful pairs."""
    total = 0
    for i in range(1, 12):  # 12 because max word length is 22
        # to shorten the loops, get a sublist of words of equal length
        sub_lst = filter(lambda(x): len(x) == i, word_lst)
        for word1 in sub_lst[:-1]:
            for word2 in sub_lst[sub_lst.index(word1)+1:]: # pair word1 only with words that come after word1
                word1word2 = interlock(word1, word2) # interlock word1 with word2
                word2word1 = interlock(word2, word1) # interlock word2 with word1
                if index(lst, word1word2): # check to see if word1word2 is actually a word
                    total += 1
                    print "Word 1: %s, Word 2: %s, Interlock: %s" % (word1, word2, word1word2)
                if index(lst, word2word1): # check to see if word2word1 is actually a word
                    total += 1
                    print "Word 2, %s, Word 1: %s, Interlock: %s" % (word2, word1, word2word1)
    print "Total interlockings: ", total

打印语句不是问题；我的程序只找到了 652 对这样的对。问题是嵌套循环，对吧？我的意思是，即使我正在循环仅包含相同长度的单词的列表，也有（例如）21727 个长度为 7 的单词，这意味着我的程序必须检查超过 4 亿个“互锁”以查看它们是否'是实际的单词——这只是长度为 7 的单词。

同样，这段代码运行了 10 个小时（如果您好奇的话，没有发现长度为 5 或更大的单词对）。有没有更好的方法来解决这个问题？

在此先感谢您提供任何和所有见解。我知道“过早的优化是万恶之源”——也许我已经落入了那个陷阱——但总的来说，虽然我通常可以编写正确运行的代码，但我经常难以编写运行良好的代码。

【问题讨论】：

下面的答案很有意义，但是如果您尝试分析此代码以识别实际上使其变慢的原因，我很感兴趣？尝试加速的其他有用的事情是简单地通过 Psyco 或 PyPy 运行它。
@Glenjamin：我没有分析代码，因为我不知道如何。您能否提供一些说明如何执行此操作的文档的链接？谢谢！
docs.python.org/library/profile.html 解释得比我好得多 :) 简短版：python -m cProfile myscript.py

标签： python

【解决方案1】：

反之亦然：遍历所有单词，并通过取奇数和偶数字母将它们分成两个单词。然后在字典里查这两个词。

作为侧节点，互锁的两个词不一定有相同的长度——长度也可能相差 1。

一些（未经测试的）代码：

words = set(line.strip() for line in open("words"))
for w in words:
    even, odd = w[::2], w[1::2]
    if even in words and odd in words:
        print even, odd

【讨论】：

谢谢！我今天会尝试实现它，看看是否有帮助。关于您的旁注：我想到了这一点，并认为它对于第一次通过来说太复杂了，如果我先让全等长的情况下工作，我会回去考虑不同的情况。 .一旦我成功实施了您的建议，我就会将其融入其中。
天哪！经过的时间刚刚从 10 小时到 15.6 秒。这就是包括在新实现中的diff-by-1案例（实现起来很简单）。哇。非常感谢！
我认为这是正确的方法，但结果不正确。即，您将一个单词分解为该单词的偶数和奇数部分。所陈述的问题是取两个单词并将它们组合成一个新单词。
@drewk：问题是：“编写一个程序，找出所有互锁的单词对。”上面的代码正是这样做的。
@drewk：如果所有单词的列表只包含“cold”和“shoe”，那么“schooled”不是定义的单词，“cold”和“shoe”不是联锁。对于这种情况，上面的代码正确地没有打印任何内容。

【解决方案2】：

联锁的替代定义：

import itertools

def interlock(str1, str2):
    "Takes two strings of equal length and 'interlocks' them."
    return ''.join(itertools.chain(*zip(str1, str2)))

【讨论】：

【解决方案3】：

另一个版本：

with open('words.txt') as inf:
    words = set(wd.strip() for wd in inf)

word_gen = ((word, word[::2], word[1::2]) for word in words)
interlocked = [word for word,a,b in word_gen if a in words and b in words]

在我的机器上，它在 0.16 秒内运行并返回 1254 个单词。

编辑：正如@John Machin 在Why is this program faster in Python than Objective-C? 所指出的，这可以通过延迟执行来进一步改进（如果第一个切片产生有效单词，则只执行第二个切片）：

with open('words.txt') as inf:
    words = set(wd.strip() for wd in inf)
interlocked = [word for word in words if word[::2] in words and word[1::2] in words]

这将执行时间缩短了三分之一，降至 0.104 秒。

【讨论】：

这不起作用。如果words 是set(['cold', 'shoe']) - OP 示例，word_gen 然后是 [('cold', 'cl', 'od'), ('shoe', 'so', 'he')]跨度>
@drewk: 如果words 是set(['cold', 'shoe'])，就不会有一对联锁词，上面的代码也找不到。如果words是set(['cold', 'shoe', 'schooled'])，就会有一对环环相扣的词，上面的代码就能找到。

【解决方案4】：

~~一个重要的事情是你的index 函数：它是比任何函数运行得更多的函数。当你不需要找到的单词的索引时，为什么要定义一个函数来查找该索引？~~

if word1word2 in lst: 代替 if index(lst, word1word2): 就足够了。

if index(lst, word2word1): 也一样。

好的。二分法的工作速度比in 语法要快。为了进一步提高速度，我建议直接在 interlockings 函数中使用 bisect_left 函数。

例如代替：

        if index(lst, word1word2): # check to see if word1word2 is actually a word
            total += 1
            print "Word 1: %s, Word 2: %s, Interlock: %s" % (word1, word2, word1word2)

用途：

        q = bisect_left(lst, word1word2)
        if q != len(lst) and lst[q] == word1word2:
            total += 1
            print "Word 1: %s, Word 2: %s, Interlock: %s" % (word1, word2, word1word2)

速度略有提升。

【讨论】：

唐尼在本书前面提到，“列表中的项目”语法比我的“索引”函数使用的二等分算法运行得更慢。我承认我没有测试过他的断言，看看我的索引函数是否真的比内置的“in”语法运行得更快，所以也许我稍后会测试一下。
@Alex：您的index() 函数比Hossein 的建议word in lst 更快是对的。比这两者更快的是使用s = set(words) 和测试word in s。