如何将一串没有空格/分隔符的字符/字母拆分成字典单词？答案

【问题标题】：How to split a string of characters/alphabets without space/separator into dictionary words?如何将一串没有空格/分隔符的字符/字母拆分成字典单词？
【发布时间】：2017-12-04 17:21:42
【问题描述】：

我有一个包含两个或多个字典英语单词的字符串，但单词之间的空格缺失。如何区分 R 或 Python 中的单词？

示例：

Input_string = "thequickbrownfox"

Desired_output_string = "the quick brown fox"

有没有算法来做这样的文本处理？

【问题讨论】：

祝你好运。我相信这对于 SO 来说是题外话，但是您可能会更幸运地询问可以解决此类问题的方法（而不是 packages）......并且这个问题在Cross Validated 上会更合适或（不太可能）Software Recs.
很公平，欢迎使用方法
没有什么是完美的...例如，取字符串"ilovetherapists"；那是"i love therapists" 还是"i love the rapists"。
当然可以。选择越多越好。但更重要的是我们无法拆分它

标签： python r text nlp

【解决方案1】：

这不是线性问题。除其他困难外，一些字符序列可以分成多个合理的单词串。

但是，该方法使用递归例程很简单。浏览您的词典（合法词词典）并找到每个您可以从给定句子的开头组成的词。遍历这些词；对于每个，解析句子的其余部分。如果成功，则返回正确分隔的输入（当前单词 + 解析余数）。

// Parse a character sequence
//   return a list of legal word separations
// Assume a word list, lexicon, as a global
sep_string(str sentence)
    result = <empty list>
    sent_size = length of sentence

    for word_size in 1:sent_size
        word = sentence[0:word_size-1]  // next potential word

        if word in lexicon
            // Found a legal word; remove it and parse
            //   the rest of the sequence
            sep_rest = sep_string(sentence[word_size:sent_size])
            // sep_rest is a list of parsings for
            //   the rest of the sequence

            for each solution in sep_rest
                append (word + " " + solution) to result

    return result

【讨论】：