【发布时间】:2017-09-14 18:23:27
【问题描述】:
您好,我一直在尝试使用 scikit-learn 进行文本分析,并且想到了使用 CountVectorizer 来检测文档是否包含一组关键字和短语。
我知道我们可以这样做:
words = ['cat', 'dog', 'walking']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=words)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
...
cat dog walking
1 1 1
我想知道是否可以调整一些东西,以便我可以使用单词短语而不是单个单词
从上面的例子:
phrases = ['cat in the park', 'walking my dog']
example = ['I was walking my dog and cat in the park']
vect = CountVectorizer(vocabulary=phrases)
dtm = vect.fit_transform(example)
>>> pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())
...
cat in the park walking my dog
1 1
现在使用短语的代码只是输出
cat in the park walking my dog
0 0
提前谢谢你!
【问题讨论】:
标签: python scikit-learn nlp