在不同的数据集上运行经过训练的机器学习模型答案

【问题标题】：Run trained Machine Learning model on a different dataset在不同的数据集上运行经过训练的机器学习模型
【发布时间】：2019-05-13 09:18:47
【问题描述】：

我是机器学习的新手，正在尝试在另一个相同格式的数据集上运行我使用 pickle 训练并保存的简单分类模型。我有以下 python 代码。

代码

#Training set
features = pd.read_csv('../Data/Train_sop_Computed.csv')
#Testing set
testFeatures = pd.read_csv('../Data/Test_sop_Computed.csv')

print(colored('\nThe shape of our features is:','green'), features.shape)
print(colored('\nThe shape of our Test features is:','green'), testFeatures.shape)

features = pd.get_dummies(features)
testFeatures = pd.get_dummies(testFeatures)

features.iloc[:,5:].head(5)
testFeatures.iloc[:,5].head(5)

labels = np.array(features['Truth'])
testlabels = np.array(testFeatures['Truth'])

features= features.drop('Truth', axis = 1)
testFeatures = testFeatures.drop('Truth', axis = 1)

feature_list = list(features.columns)
testFeature_list = list(testFeatures.columns)

def add_missing_dummy_columns(d, columns):
    missing_cols = set(columns) - set(d.columns)
    for c in missing_cols:
        d[c] = 0


def fix_columns(d, columns):
    add_missing_dummy_columns(d, columns)

    # make sure we have all the columns we need
    assert (set(columns) - set(d.columns) == set())

    extra_cols = set(d.columns) - set(columns)
    if extra_cols: print("extra columns:", extra_cols)

    d = d[columns]
    return d


testFeatures = fix_columns(testFeatures, features.columns)

features = np.array(features)
testFeatures = np.array(testFeatures)

train_samples = 100

X_train, X_test, y_train, y_test = model_selection.train_test_split(features, labels, test_size = 0.25, random_state = 42)
testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

print(colored('\n        TRAINING SET','yellow'))
print(colored('\nTraining Features Shape:','magenta'), X_train.shape)
print(colored('Training Labels Shape:','magenta'), X_test.shape)
print(colored('Testing Features Shape:','magenta'), y_train.shape)
print(colored('Testing Labels Shape:','magenta'), y_test.shape)

print(colored('\n        TESTING SETS','yellow'))
print(colored('\nTraining Features Shape:','magenta'), testX_train.shape)
print(colored('Training Labels Shape:','magenta'), textX_test.shape)
print(colored('Testing Features Shape:','magenta'), testy_train.shape)
print(colored('Testing Labels Shape:','magenta'), testy_test.shape)

from sklearn.metrics import precision_recall_fscore_support

import pickle

loaded_model_RFC = pickle.load(open('../other/SOPmodel_RFC', 'rb'))
result_RFC = loaded_model_RFC.score(textX_test, testy_test)
print(colored('Random Forest Classifier: ','magenta'),result_RFC)

loaded_model_SVC = pickle.load(open('../other/SOPmodel_SVC', 'rb'))
result_SVC = loaded_model_SVC.score(textX_test, testy_test)
print(colored('Support Vector Classifier: ','magenta'),result_SVC)

loaded_model_GPC = pickle.load(open('../other/SOPmodel_Gaussian', 'rb'))
result_GPC = loaded_model_GPC.score(textX_test, testy_test)
print(colored('Gaussian Process Classifier: ','magenta'),result_GPC)

loaded_model_SGD = pickle.load(open('../other/SOPmodel_SGD', 'rb'))
result_SGD = loaded_model_SGD.score(textX_test, testy_test)
print(colored('Stocastic Gradient Descent: ','magenta'),result_SGD)

我能够得到测试集的结果。

但我面临的问题是我需要在整个Test_sop_Computed.csv 数据集上运行模型。但它只在我拆分的测试数据集上运行。如果有人可以就如何在整个数据集上运行加载的模型提供任何建议，我将不胜感激。我知道我在下面的代码行中出错了。

testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

训练数据集和测试数据集都具有Subject、Predicate、Object、Computed 和Truth，并且具有Truth 的特征是预测类。测试数据集具有此Truth 列的实际值，我使用testFeatures = testFeatures.drop('Truth', axis = 1) 对其进行处理，并打算使用各种加载的分类器模型将Truth 预测为0 或 1 为整个数据集，然后将预测作为一个数组。

到目前为止，我已经这样做了。但我认为我也在拆分我的测试数据集。有没有办法通过整个测试数据集，即使它在另一个文件中？

此测试数据集的格式与训练集相同。我检查了两者的形状，得到以下结果。

确认特征和形状

Shape of the Train features is: (1860, 5)
Shape of the Test features is: (1386, 5)

         TRAINING SET

Training Features Shape: (1395, 1045)
Training Labels Shape: (465, 1045)
Testing Features Shape: (1395,)
Testing Labels Shape: (465,)

          TEST SETS

Training Features Shape: (1039, 1045)
Training Labels Shape: (347, 1045)
Testing Features Shape: (1039,)
Testing Labels Shape: (347,)

在这方面的任何建议都将受到高度赞赏。

【问题讨论】：

由于该问题不涉及tensorflow，因此请避免向标签发送垃圾邮件（已删除）-scikit-learn 更合适（已添加）。
我对您的数据集以及您处理它的方式感到非常困惑。您的训练集如何同时包含 trainX 和 testX？ testX_train 应该是什么意思？
@offeltoffel，test 字符串用于识别我对测试集不必要的拆分子集。这是我需要澄清的，现在它可以工作了。感谢您回复我的疑问。

标签： python machine-learning scikit-learn training-data

【解决方案1】：

你的问题有点不清楚，但据我了解，你想在 testX_train 和 testX_test 上运行你的模型（这只是 testFeatures 分成两个子数据集）。

因此，您可以像 testX_test 一样在 testX_train 上运行模型，例如：

result_RFC_train = loaded_model_RFC.score(textX_train, testy_train)

或者您可以删除以下行：

testX_train, textX_test, testy_train, testy_test = model_selection.train_test_split(testFeatures, testlabels, test_size= 0.25, random_state = 42)

因此，您不必拆分数据并在完整数据集上运行它：

result_RFC_train = loaded_model_RFC.score(testFeatures, testlabels)

【讨论】：

谢谢亚历山大。这对我有用。另外，我想知道的是如何获取预测值，如使用预测并将模型预测值作为数组/列表获取？
不要犹豫，为这个问题提出另一个问题以获得完整的详细答案，但我会在这里回答。根据您的模型，您将获得值或概率：predicted_y_RFC = loaded_model_RFC.predict(testFeatures)predicted_probas_y_RFC = loaded_model_RFC.predict_probas(testFeatures)
感谢一百万亚历山大。这正是我想要的。
不客气！不要犹豫，看看 sklearn 上的文档：scikit-learn.org/stable/modules/generated/… 你有关于加载模型、拟合、预测等的解释。:) 如果你在 sklearn 网站上搜索一下，你也会发现如何保存模型、绘图数据/预测、如何对模型进行评分等。
感谢您详尽的解释。我真的很感激