【问题标题】:Given a dictionary of word and frequency pairs, how to proceed with text mining in scikit给定一个单词和频率对的字典,如何在 scikit 中进行文本挖掘
【发布时间】:2015-03-27 13:17:13
【问题描述】:

我已经有了这样的词频和类别:

y = ['animals', 'restaurants', 'sports']
x = [{'cat':1, 'dog':2}, {'food':4, 'drink':2}, {'baseball':4, 'basketball':5}]

我应该如何按照以下教程继续构建管道:

>>> from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([('vect', CountVectorizer()),
...                      ('tfidf', TfidfTransformer()),
...                      ('clf', MultinomialNB()),
... ])

>>> text_clf = text_clf.fit(twenty_train.data, twenty_train.target)

CountVectorizer 需要一个字符串...我想我可以从字典中创建一个字符串并重复每个单词出现的次数?

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html

【问题讨论】:

    标签: python scikit-learn text-mining


    【解决方案1】:

    如果您已有词频,请使用DictVectorizer:

    from sklearn.feature_extraction import DictVectorizer
    
    pipeline = Pipeline([('dvect', DictVectorizer()),
                         ('tfidf', TfidfTransformer()),
                         ('clf', MultinomialNB())])
    model = pipeline.fit(x, y)
    

    那么你可以这样做:

    >>> model.predict([{'cat':1}])[0]
    'animals'
    

    【讨论】:

      猜你喜欢
      • 2014-01-07
      • 1970-01-01
      • 2013-03-26
      • 2018-09-21
      • 1970-01-01
      • 2017-06-01
      • 2010-12-07
      • 2015-12-05
      • 2016-02-22
      相关资源
      最近更新 更多