int 和 string txt 的分类抛出 ValueError: Number of features of the model must match the input。模型 n_features答案

【问题标题】：Classification of int and string txt throws ValueError: Number of features of the model must match the input. Model n_featuresint 和 string txt 的分类抛出 ValueError: Number of features of the model must match the input。模型 n_features
【发布时间】：2020-07-03 01:34:48
【问题描述】：

我是机器学习的新手，提前抱歉我正在尝试从包含训练样本的 txt 文件中读取：

123 这是一个长文本字符串

325 另一个文本

我的 labels.txt 文件是这样的：

123 1

325 2

经过多次尝试，我设法用 pandas 阅读它们：

train_labels = pd.read_csv('train_labels.txt', nrows=200, dtype=str, delimiter="\t", header=None)

train_samples = pd.read_csv('train_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="\t", header=None)

然后我使用矢量化器转换我的训练样本中的字符串列

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stop_words)

X = tfidfconverter.fit_transform(train_samples.iloc[:, 1]).toarray()

然后我尝试将我的分类器与随机森林相匹配

clf = RandomForestClassifier(n_estimators=1000, random_state=0)

clf.fit(X, train_labels) -> error

然后我读取样本来计算我的准确度分数

validation_source_samples = pd.read_csv('validation_source_samples.txt', nrows=200, dtype=str, encoding="UTF-8", delimiter="\t", header=None)

validation_source_labels = pd.read_csv('validation_source_labels.txt', nrows=200, dtype=str, delimiter="\t", header=None)

T = tfidfconverter.fit_transform(validation_source_samples.iloc[:, 1]).toarray()


pred = clf.predict(T)

在clf.predict 我得到错误：

`ValueError: Number of features of the model must match the input`.

模型 n_features 为 780，输入 n_features 为 879

我已经搜索过此类错误的答案，但似乎没有任何内容与我的实际输入文件和问题相匹配。如果之前已经回答过，请提前道歉。

【问题讨论】：

标签： python machine-learning scikit-learn

【解决方案1】：

这是因为您在验证数据上再次拟合了向量器，而模型从训练数据中学习了拟合向量器，您可以通过将验证行上的 fit_transform 更改为 @ 来修复它987654322@这样的：

T = tfidfconverter.transform(validation_source_samples.iloc[:, 1]).toarray()

【讨论】：

我明白了！这已经解决了错误，但我似乎从第一个 train_samples/labels 中获得了我的预测模型，而不是我测试的集合（验证..）
你确定吗？你怎么知道的？