Python-代码优化帮助-查找单词的所有字典有效字谜答案

【问题标题】：Python- Code Optimization Help- Find all dictionary-valid anagrams of a wordPython-代码优化帮助-查找单词的所有字典有效字谜
【发布时间】：2011-08-30 03:30:14
【问题描述】：

我用这个（非常低效的方法）解决了这个问题：

def createList(word, wordList):
    #Made a set, because for some reason permutations was returning duplicates.  
    #Returns all permutations if they're in the wordList

    return set([''.join(item) for item in itertools.permutations(word) if ''.join(item) in wordList])

def main():

    a = open('C:\\Python32\\megalist.txt', 'r+')
    wordList = [line.strip() for line in a]
    maximum = 0
    length = 0
    maxwords = ""

    for words in wordList:
        permList = createList(words, wordList)
        length = len(permList)
        if length > maximum:
            maximum = length
            maxwords = permList
    print (maximum, maxwords)

大约需要 10 分钟才能找到具有最多字典有效字谜的五个字母单词。我想在没有字母限制的单词上运行它，但这会花费大量时间。有什么办法可以优化吗？

【问题讨论】：

标签： python memory-management performance anagram

【解决方案1】：

以下内容似乎适用于小型词典。通过对单词中的字母进行排序，可以很容易地测试两个单词是否是一个字谜。从那个起点开始，只是以某种方式积累单词的问题。修改它以报告所有匹配项并不难，而不仅仅是第一个匹配项

如果确实需要对字母数量添加限制，使用迭代器是过滤掉某些单词的便捷方式。

def wordIterator(dictionaryFilename):
    with open(dictionaryFilename,'r') as f:
        for line in f:
            word = line.strip()
            yield word

def largestAnagram(words):
    import collections
    d = collections.defaultdict(list)
    for word in words:
        sortedWord = str(sorted(word))
        d[ hash(sortedWord) ].append(word)
    maxKey = max( d.keys(), key = lambda k : len(d[k]) )
    return d[maxKey]

iter = wordIterator( 'C:\\Python32\\megalist.txt' )
#iter = ( word for word in iter if len(word) == 5 )
print largestAnagram(iter)

编辑：

针对评论，hash(sortedWord) 是一种节省空间的优化，在这种情况下可能为时过早，将 sortedWord 减少回整数，因为我们并不真正关心键，只要我们可以总是唯一地恢复所有相关的字谜。仅使用 sortedWord 作为密钥同样有效。

max 的 key 关键字参数可让您根据谓词查找集合中的最大元素。所以语句maxKey = max( d.keys(), key = lambda k : len(d[k]) ) 是一个简洁的python 表达式，用于回答查询，给定字典中的键，哪个键具有最大长度的关联值？。在这种情况下，对max 的调用可以写成（更冗长的）valueWithMaximumLength(d)，其中该函数被定义为：

def valueWithMaximumLength( dictionary ):
    maxKey = None
    for k, v in dictionary.items():
        if not maxKey or len(dictionary[maxKey]) < len(v):
            maxKey = k
    return maxKey

【讨论】：

+1 很好地展示了如何在数据获取步骤中解决实际问题，这使得查找/查询步骤几乎是微不足道的。强调初始数据表示的影响。
这太棒了；我理解这个想法，尽管我对这些行的具体作用感到困惑： d[ hash(sortedWord) ].append(word) maxKey = max( d.keys(), key = lambda k : len(d[k ]) ) 你能解释一下吗
次要评论：max(d.items(), key=lambda k, v: len(v)) 更简洁/可能更快。（否则您必须进行两次查找：一次查找键，然后另一次获取值。）

【解决方案2】：

wordList 应该是set。

测试列表中的成员资格要求您扫描列表中的所有元素，检查它们是否是您生成的单词。测试集合中的成员资格不（平均而言）取决于集合的大小。

下一个明显的优化是，一旦你测试了一个单词，你就可以从wordList 中删除它的所有排列，因为它们将在createList 中给出完全相同的集合。如果一切都使用sets 完成，这是一个非常简单的操作——实际上，您只需使用二进制减号。

【讨论】：

这当然很有帮助，尽管接受的答案要优化得多。

【解决方案3】：

没有必要做所有的排列，这是对内存和 CPU 的浪费。所以，首先你的字典应该保存在一个二叉树中，像这样：

   e.g. Dict = ['alex', 'noise', 'mother', 'xeal', 'apple', 'google', 'question']

   Step 1: find the "middle" word for the dictionary, e.g. "mother", because "m" 
           is somewhere in the middle of the English alphabet 
           (this step is not necessary, but it helps to balance the tree a bit)

   Step 2: Build the binary tree:

               mother 
              /      \
             /        \
         alex          noise
           \            /  \
            \          /    \
           apple   question xeal
             \
              \
             google

  Step 3: start looking for an anagram by permutations:
  alex: 1. "a"... (going deeper into binary tree, we find a word 'apple')
        1.1 # of, course we should omit apple, because len('apple') != len('alex')
            #  but that's enough for an example:
        2. Check if we can build word "pple" from "lex": ("a" is already reserved!)
           -- there is no "p" in "lex", so skipping, GOTO 1            
        ... 
        1. "l"... - nothing begins with "l"...
        1. "l"... - nothing begins with "e"...
        1. "x" - going deeper, "xeal" found
        2. Check if "eal" could be built from "ale" ("x" is already reserved)
           for letter in "eal":
               if not letter in "ale":
                  return False
           return True

就是这样 :) 这个算法应该运行得更快。

编辑：

检查bintrees 包以避免花费时间在二叉树实现上。

【讨论】：