【问题标题】:Random forest problems w/ 20 newsgroups带有 20 个新闻组的随机森林问题
【发布时间】:2018-08-22 17:36:25
【问题描述】:

我正在尝试使用 20 个新闻组数据集运行随机森林算法,但我不知道如何解决该问题。我之前使用过 SVM 和 NB 处理相同的数据集,效果很好。

 from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer


dataset_train=fetch_20newsgroups(subset='train',shuffle=True)
dataset_test=fetch_20newsgroups(subset='test',shuffle=True)
vectorizer=CountVectorizer()

x_train_counts=vectorizer.fit_transform(dataset_train.data)

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer=TfidfVectorizer(stop_words='english',lowercase=True,ngram_range=(1,5))
x_train_tfidf=vectorizer.fit_transform(dataset_train.data)

from sklearn.ensemble import RandomForestClassifier
model=RandomForestClassifier(n_estimators=10)
model=model.fit(dataset_train.data,dataset_train.target)

这就是错误:

    Traceback (most recent call last):
  File "C:/Users/new_randomforest.py", line 18, in <module>
    model=model.fit(dataset_train.data,dataset_train.target)
  File "C:\Users\forest.py", line 247, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "C:\Users\validation.py", line 433, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: "From: lerxst@wam.umd.edu (where's my thing)\nSubject: WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization: University of Maryland, College Park\nLines: 15\n\n I was wondering if anyone out there could enlighten me on this car I saw\nthe other day. It was a 2-door sports car, looked to be from the late 60s/\nearly 70s. It was called a Bricklin. The doors were really small. In addition,\nthe front bumper was separate from the rest of the body. This is \nall I know. If anyone can tellme a model name, engine specs, years\nof production, where this car is made, history, or whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,\n- IL\n   ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n"

【问题讨论】:

    标签: python-2.7 machine-learning random-forest decision-tree


    【解决方案1】:

    您需要在拟合模型时使用您创建的训练向量(此处为 x_train_tfidf

    model.fit(x_train_tfidf,dataset_train.target)
    

    dataset_train.data 是这里的字符串列表。这就是错误的原因。

    PS:: 基本上错误的意思是,您试图将string 放入您的模型中,这是不允许的。引用自documentation

    fit(X, y, sample_weight=None)
    

    X : 形状的类数组或稀疏矩阵 = [n_samples, n_features]

    训练输入样本。在内部,它的 dtype 将被转换为 dtype=np.float32。如果提供了稀疏矩阵,则将其转换为稀疏 csc_matrix。

    【讨论】:

      猜你喜欢
      • 2017-10-17
      • 2013-04-25
      • 2015-01-08
      • 2014-01-23
      • 2016-03-02
      • 2020-12-26
      • 1970-01-01
      • 2015-09-16
      • 2019-10-05
      相关资源
      最近更新 更多