为 sklearn CountVectorizer 使用自定义词汇 n-gram答案

【问题标题】：Using custom vocabulary n-grams for sklearn CountVectorizer为 sklearn CountVectorizer 使用自定义词汇 n-gram
【发布时间】：2021-06-26 14:45:36
【问题描述】：

我想要一个自定义的CountVectorizer 词汇表来记录表达式的存在或不存在。我希望它检测单词的组合，而不是单词。

根据我的自定义词汇，我希望sklearn 检测“大狗”。

from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(vocabulary=['big dog', 'cat'])

cvec.fit_transform(['The big dog and the cat']).toarray()

array([[0, 1]], dtype=int64)

它似乎没有检测到我正在寻找的单词组合“big dog”。有没有办法做到这一点，或者这个功能只能检测单词？

【问题讨论】：

标签： python numpy scikit-learn

【解决方案1】：

你应该定义大于(1, 1)的ngram_range，例如(1, 2)，如果你想让sklearn考虑两个词的组合。

from sklearn.feature_extraction.text import CountVectorizer

cvec = CountVectorizer(vocabulary=['big dog', 'cat'], ngram_range=(1, 2))

cvec.fit_transform(['The big dog and the cat']).toarray()

array([[1, 1]], dtype=int64)

【讨论】：

从输出中可以看出，没有检测到“big dog”，因为索引 0 处有一个 0
哦，如果我使用参数ngram_range=(1, 2)，它会起作用如果你想将它添加到你的答案中，我会接受它