使用 CountVectorizer 时如何限制令牌长度？答案

【问题标题】：How can I restrict the token length while using CountVectorizer?使用 CountVectorizer 时如何限制令牌长度？
【发布时间】：2018-10-13 14:39:36
【问题描述】：

我不希望长度小于 3 或大于 7 的术语。在 R 中有一种直接的方法，但在 Python 中我不确定。这个我试过了，还是不行

from sklearn.feature_extraction.text import CountVectorizer
regex1 = '/^[a-zA-Z]{3,7}$/'
vectorizer = CountVectorizer( analyzer='word',tokenizer= tokenize,stop_words = stopwords,token_pattern  = regex1,min_df= 2, max_df = 0.9,max_features = 2000)
vectorizer1 = vectorizer.fit_transform(token_dict.values())

也尝试了其他正则表达式 -

  "^[a-zA-Z]{3,7}$"
r'^[a-zA-Z]{3,7}$'

【问题讨论】：

为什么它被否决了？请解释一下
@VivekKumar 我认为这不是问题所在。如果是的话会引发错误
@rock321987 是的，它可能是。但是在我们收到MCVE 之前，我们将如何决定？

标签： python python-3.x scikit-learn countvectorizer

【解决方案1】：

在CountVectorizer 的文档中，默认token_pattern 采用2 个或更多字母数字字符的标记。如果你想改变这个，传递你自己的正则表达式

在您的情况下，将token_pattern = "^[a-zA-Z]{3,7}$" 添加到CountVectorizer 的选项中

编辑

应该使用的正则表达式是[a-zA-Z]{3,7}。请参阅下面的示例 -

doc1 = ["Elon Musk is genius", "Are you mad", "Constitutional Ammendments in Indian Parliament",\
        "Constitutional Ammendments in Indian Assembly", "House of Cards", "Indian House"]

from sklearn.feature_extraction.text import CountVectorizer

regex1 = '[a-zA-Z]{3,7}'
vectorizer = CountVectorizer(analyzer='word', stop_words = 'english', token_pattern  = regex1)
vectorizer1 = vectorizer.fit_transform(doc1)

vectorizer.vocabulary_

结果 -

{u'ammendm': 0,
 u'assembl': 1,
 u'cards': 2,
 u'constit': 3,
 u'elon': 4,
 u'ent': 5,
 u'ents': 6,
 u'genius': 7,
 u'house': 8,
 u'indian': 9,
 u'mad': 10,
 u'musk': 11,
 u'parliam': 12,
 u'utional': 13}

【讨论】：

那时我正在做的事情有问题。尝试了所有的正则表达式。还是不行
您是否收到“ValueError：空词汇；也许文件只包含停用词'？如果是这样，那就是 min_df 和 max_df 条件的问题

【解决方案2】：

我认为您的正则表达式模式在这里是错误的。它的 Javscript。应该是这样的

regex1 = r'^[a-zA-Z]{3,7}$'

我还假设正则表达式应该匹配整个字符串 NOT 一些子字符串。所以如果一个字符串像aaaaabbb cc 应该被丢弃。

如果不是，您应该使用单词边界\b 而不是开始^ 和结束$ 锚点。所以应该是

regex1 = r'\b[a-zA-Z]{3,7}\b'

这是一个工作示例

from sklearn.feature_extraction.text import CountVectorizer
regex1 = r'\b[a-zA-Z]{3,7}\b'
token_dict = {123: 'horses', 345: 'ab'}
vectorizer = CountVectorizer(token_pattern  = regex1)
vectorizer1 = vectorizer.fit_transform(token_dict.values())

print(vectorizer.get_feature_names())

输出

['horses']

【讨论】：

@Indi 有什么错误？您还应该添加来自 token_dict 的数据
我没有收到任何错误，但我看到 'yr' 和 'ar' 之类的标记出现在 dtm 中。 token_dict 包含令牌及其计数
@Indi 您是否使用答案中编写的正则表达式不是您的？我想你不是。你能检查一下吗
@Indi 也是 token_dict 键值对，如 {token: count} 还是 {count: token}
对不起 token_dict 的格式为 { : }