【问题标题】:CountVectorizer raising error on short wordsCountVectorizer 在短词上引发错误
【发布时间】:2018-08-04 08:46:12
【问题描述】:

有人能解释一下为什么当我尝试 fit_transform 任何短词时 CountVectorizer 会引发此错误吗?即使我使用 stopwords=None 我仍然会得到同样的错误。 这是代码

from sklearn.feature_extraction.text import CountVectorizer

text = ['don\'t know when I shall return to the continuation of my scientific work. At the moment I can do absolutely nothing with it, and limit myself to the most necessary duty of my lectures; how much happier I would be to be scientifically active, if only I had the necessary mental freshness.']

cv = CountVectorizer(stop_words=None).fit(text)

并且工作几乎与预期一样。然后,如果我尝试使用另一个文本进行 fit_transform

cv.fit_transform(['q'])

并且引发的错误是

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-acbd560df1a2> in <module>()
----> 1 cv.fit_transform(['q'])

~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in fit_transform(self, raw_documents, y)
    867 
    868         vocabulary, X = self._count_vocab(raw_documents,
--> 869                                           self.fixed_vocabulary_)
    870 
    871         if self.binary:

~/.local/lib/python3.6/site-packages/sklearn/feature_extraction/text.py in _count_vocab(self, raw_documents, fixed_vocab)
    809             vocabulary = dict(vocabulary)
    810             if not vocabulary:
--> 811                 raise ValueError("empty vocabulary; perhaps the documents only"
    812                                  " contain stop words")
    813 

ValueError: empty vocabulary; perhaps the documents only contain stop words

我阅读了一些有关此错误的主题,因为它似乎确实经常出现错误 CV 引发,但我发现的只是涵盖文本真正只包含停用词的情况。我真的无法弄清楚我的问题是什么,所以如果我能得到任何帮助,我将不胜感激!

【问题讨论】:

    标签: python machine-learning scikit-learn valueerror countvectorizer


    【解决方案1】:

    CountVectorizer(token_pattern='(?u)\\b\\w\\w+\\b') 默认只标记包含 2 个以上字符的单词(标记)

    您可以更改此默认行为:

    vect = CountVectorizer(token_pattern='(?u)\\b\\w+\\b')
    

    测试:

    In [29]: vect.fit_transform(['q'])
    Out[29]:
    <1x1 sparse matrix of type '<class 'numpy.int64'>'
            with 1 stored elements in Compressed Sparse Row format>
    
    In [30]: vect.get_feature_names()
    Out[30]: ['q']
    

    【讨论】:

      猜你喜欢
      • 2017-03-15
      • 2018-08-31
      • 2017-12-19
      • 2017-09-14
      • 2020-06-11
      • 2017-10-11
      • 2016-08-15
      • 2016-09-25
      • 2020-01-15
      相关资源
      最近更新 更多