将 Sklearn CountVectorizer 词汇设置为短语字典答案

【问题标题】：setting Sklearn's CountVectorizer's vocabulary to a dict of phrases将 Sklearn CountVectorizer 词汇设置为短语字典
【发布时间】：2017-09-14 18:23:27
【问题描述】：

您好，我一直在尝试使用 scikit-learn 进行文本分析，并且想到了使用 CountVectorizer 来检测文档是否包含一组关键字和短语。

我知道我们可以这样做：

words = ['cat', 'dog', 'walking']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

...

   cat  dog  walking
    1    1        1

我想知道是否可以调整一些东西，以便我可以使用单词短语而不是单个单词

从上面的例子：

phrases = ['cat in the park', 'walking my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=phrases)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names()) 
... 

       cat in the park   walking my dog
            1                   1

现在使用短语的代码只是输出

cat in the park   walking my dog
     0                   0

提前谢谢你！

【问题讨论】：

标签： python scikit-learn nlp

【解决方案1】：

试试这个：

In [104]: lens = [len(x.split()) for x in phrases]

In [105]: mn, mx = min(lens), max(lens)

In [106]: vect = CountVectorizer(vocabulary=phrases, ngram_range=(mn, mx))

In [107]: dtm = vect.fit_transform(example)

In [108]: pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
Out[108]:
   cat in the park  walking my dog
0                1               1

In [109]: print(mn, mx)
3 4

【讨论】：

非常适合上面的示例，但是当我在设置词汇表时使用我正在构建的函数中的方法时，它没有检测到文档中的短语。我会尝试自己进行故障排除，看看发生了什么。
知道了，我接受了答案，再次感谢您的帮助！
谢谢，很有帮助。