是否可以标记除预定义单词之外的所有单词？答案

【问题标题】：Is it possible to tokenize all except pre-defined words?是否可以标记除预定义单词之外的所有单词？
【发布时间】：2015-10-19 15:17:47
【问题描述】：

我想标记一个句子，但保持预定义的单词完整。例如

"i went to university of abc and had a wonderful time there!"

进入

["i", "went", "to", "university of abc", "and", "had", "a", "wonderful", "time", "there", "!"]

因为"university of abc"是预定义的词。

我在任何 NLTK 标记器中都找不到这样的参数或控件。有什么办法可以破解来实现这一目标？谢谢！

【问题讨论】：

标签： python regex text nlp

【解决方案1】：

使用这个正则表达式而不是拆分使用匹配：

(university of abc|\w+|[^\w\s]+)

RegEx Demo

您可以在正则表达式的 LHS 中添加更多预定义的单词，如上图所示。

【讨论】：

谢谢。我可以将正则表达式的 LHS 设置为任何预定义的单词吗？即 W = "university of abc" 然后在正则表达式的某处有变量 W？
您可以使用字符串连接构建正则表达式。

【解决方案2】：

您可以使用 regexp 正则表达式标记器并编写一个正则表达式，例如，在不属于 "the university of abc." 的所有空格上拆分，但这会很麻烦 - hack-y 方法可能只是传递文本或编写一个正则表达式，将"the university of abc" 替换为"the-university-of-abc" 或其他一些不会分解为单独标记的字符串（取决于您使用的标记器）。

【讨论】：