为什么 gensim 的 simple_preprocess Python 标记器似乎跳过了“i”标记？

【问题标题】：Why does gensim's simple_preprocess Python tokenizer seem to skip the "i" token?为什么 gensim 的 simple_preprocess Python 标记器似乎跳过了“i”标记？
【发布时间】：2020-04-06 07:42:56
【问题描述】：

list(gensim.utils.simple_preprocess("i you he she I it we you they", deacc=True))

给出结果：

['you', 'he', 'she', 'it', 'we', 'you', 'they']

正常吗？有没有它跳过的单词？我应该使用另一个分词器吗？

额外问题： “deacc=True”参数是什么意思？

【问题讨论】：

这在documentation中有说明，大家应该养成阅读的习惯。
谢谢，原来是min_len参数默认设置为2，很好，非常感谢！

标签： python nlp tokenize gensim

【解决方案1】：

正如@user2357112-supports-monica 在他们的评论中提到的，这是simple_preprocess() 设计行为的一部分，根据其documentation，丢弃任何短于min_len=2 字符的标记。

您的“额外问题”也在同一份文档中得到解答：

deacc（bool，可选）- 使用deaccent()从标记中删除重音符号？

(deaccent() 函数是另一个实用函数，在链接中有记录，它完全按照名称和文档的建议：从字母中删除重音符号，例如，'é' 变成了 'e'。 )

【讨论】：