【问题标题】：Is there an easy way generate a probable list of words from an unspaced sentence in python?有没有一种简单的方法从python中的无空格句子生成一个可能的单词列表？
【发布时间】：2013-02-28 04:29:40
【问题描述】：

我有一些文字：

 s="Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"

我想将其解析为单独的单词。我迅速查看了附魔和 nltk，但没有看到任何看起来立即有用的东西。如果我有时间在这方面进行投资，我会考虑编写一个动态程序，该程序具有 enchant 检查单词是否为英文的能力。我本来以为网上会有这样的事情，我错了吗？

【问题讨论】：

您可以将您的单词字典编码为 trie 并使用贪心算法：提取匹配的最长单词，然后继续下一个单词，失败时回溯。可能不是最优的。试试这个以获得有关数据结构的建议：kmike.ru/python-data-structures
有趣的问题。我猜答案（“简单的方法”）将是“不”。
之前问的类似问题运气不好：stackoverflow.com/questions/13034330/…
例如，你的算法怎么知道它不是be roughly divide din to？都是正确的英文单词……
@Tim Pietzcker：因为那不是贪婪的方法。 “贪婪，没有更好的词，是好的。贪婪是对的。贪婪起作用。” en.wikipedia.org/wiki/…

标签： python nlp

【解决方案1】：

这是亚洲 NLP 中经常出现的问题。如果你有字典，那么你可以使用这个http://code.google.com/p/mini-segmenter/（免责声明：我写的，希望你不介意）。

请注意，搜索空间可能非常大，因为英文字母的字符数肯定比音节中文/日文长。

【讨论】：

【解决方案2】：

使用 trie 的贪婪方法

使用Biopython (pip install biopython) 试试这个：

from Bio import trie
import string


def get_trie(dictfile='/usr/share/dict/american-english'):
    tr = trie.trie()
    with open(dictfile) as f:
        for line in f:
            word = line.rstrip()
            try:
                word = word.encode(encoding='ascii', errors='ignore')
                tr[word] = len(word)
                assert tr.has_key(word), "Missing %s" % word
            except UnicodeDecodeError:
                pass
    return tr


def get_trie_word(tr, s):
    for end in reversed(range(len(s))):
        word = s[:end + 1]
        if tr.has_key(word): 
            return word, s[end + 1: ]
    return None, s

def main(s):
    tr = get_trie()
    while s:
        word, s = get_trie_word(tr, s)
        print word

if __name__ == '__main__':
    s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
    s = s.strip(string.punctuation)
    s = s.replace(" ", '')
    s = s.lower()
    main(s)

结果

>>> if __name__ == '__main__':
...     s = "Imageclassificationmethodscan beroughlydividedinto two broad families of approaches:"
...     s = s.strip(string.punctuation)
...     s = s.replace(" ", '')
...     s = s.lower()
...     main(s)
... 
image
classification
methods
can
be
roughly
divided
into
two
broad
families
of
approaches

注意事项

这不适用于英语中的退化情况。您需要使用回溯来处理这些问题，但这应该可以帮助您入门。

强制性测试

>>> main("expertsexchange")
experts
exchange

【讨论】：

太棒了。这正是我想要的！