如何在 Python 中使用 sklearn 对模型进行单一预测？答案

【问题标题】：How can I make a single prediction on a model using sklearn in Python?如何在 Python 中使用 sklearn 对模型进行单一预测？
【发布时间】：2020-04-11 14:27:56
【问题描述】：

我已经使用 sklearn 在公司数据集上训练了机器学习模型。该数据集具有以下属性：name, domain, year_founded, industry, size_range, locality, country, linkedin_url, current_employee_estimate, total_employee_estimate。

我想训练一个机器学习模型来尝试使用name 和year_founded 属性来预测size_range 值（根据公司的规模属于八类之一）。我已经使用以下培训代码完成了这项工作：

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import logistic
from tools import pickleFile
from tools import unpickleFile
from tools import cleanDataset
from tools import getPrettyTimestamp
import sklearn
import pandas as pd
import numpy as np
import datetime
import sys


def train_model(clf, X_train, y_train, epochs=10):
    """
    Trains a specific model and returns a list of results

    :param clf: sklearn model
    :param X_train: encoded training data (attributes)
    :param y_train: training data (attribute to predict
    :param epochs: number of iterations (default=10)
    :return: result (accuracy) for this training data
    """
    scores = []
    print("Starting training...")
    for i in range(1, epochs + 1):
        print("Epoch:" + str(i) + "/" + str(epochs) + " -- " + str(datetime.datetime.now()))
        clf.fit(X_train, y_train)
        score = clf.score(X_train, y_train)
        scores.append(score)
    print("Done training.  The score(s) is/are: " + str(scores))
    return scores

def main():

    # Parse the arguments.
    userRequestedTrain, filename = parseArgs()

    # Some custom Pandas settings - TODO remove this
    pd.set_option('display.max_columns', 30)
    pd.set_option('display.max_rows', 1000)

    dataset = pd.read_csv("companies_sorted.csv", nrows=50000)


    origLen = len(dataset)
    print(origLen)

    dataset = cleanDataset(dataset)

    cleanLen = len(dataset)
    print(cleanLen)

    print("\n======= Some Dataset Info =======\n")
    print("Dataset size (original):\t" + str(origLen))
    print("Dataset size (cleaned):\t" + str(len(dataset)))
    print("\nValues of size_range:\n")
    print(dataset['size_range'].value_counts())
    print()

    # size_range is the attribute to be predicted, so we pop it from the dataset
    sizeRange = dataset.pop("size_range").values

    # We split our dataset and attribute-to-be-preditcted into training and testing subsets.
    xTrain, xTest, yTrain, yTest = train_test_split(dataset, sizeRange, test_size=0.25, random_state=1)


    print(xTrain.transpose())
    le = LabelEncoder()
    ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')

    # Our feature set, i.e. the inputs to our machine-learning model.
    featureSet = ['name', 'year_founded']

    # Making a copy of test and train sets with only the columns we want.
    xTrain_sf = xTrain[featureSet].copy()
    xTest_sf = xTest[featureSet].copy()

    # Apply one-hot encoding to columns
    ohe.fit(xTrain_sf)

    print(xTrain_sf)
    print(xTest_sf)

    featureNames = ohe.get_feature_names()

    # Encoding test and train sets
    xTrain_sf_encoded = ohe.transform(xTrain_sf)
    xTest_sf_encoded = ohe.transform(xTest_sf)

    # ------ Using Logistic Regression classifier - TRAINING PHASE ------

    if userRequestedTrain:
        # We define the model we're going to use.
        lrModel = LogisticRegression(solver='lbfgs', multi_class="multinomial", max_iter=1000, random_state=1)

        # Now, let's train it.
        lrScores = train_model(lrModel, xTrain_sf_encoded, yTrain, 1)

        # Save the model as a file.
        filename = "models/Model_" + getPrettyTimestamp()
        print("Training done! Pickling model to " + str(filename) + "...")
        pickleFile(lrModel, filename)

    # Reload the model for testing.  If we didn't train the model ourselves, then it was specified as an argument.
    lrModel = unpickleFile(filename)

    PRED = lrModel.predict(xTrain_sf_encoded[0:10])

    print("Unpickled successfully from file " + str(filename))

    # ------- TESTING PHASE -------

    testLrScores = train_model(lrModel, xTest_sf_encoded, yTest, 1)

    if userRequestedTrain:
        trainScore = lrScores[0]
    else:
        trainScore = 0.9201578143173162  # Modal training score - substitute if we didn't train model ourselves

    testScore = testLrScores[0]

    scores = sorted([(trainScore, 'train'), (testScore, 'test')], key=lambda x: x[0], reverse=True)
    better_score = scores[0]  # largest score
    print(scores)

    # Which score was better?
    print("Better score: %s" % "{}".format(better_score))

    print("Pickling....")

    pickleFile(lrModel, "models/TESTING_" + getPrettyTimestamp())

此代码运行成功 - 训练和测试阶段完成，测试阶段的准确率约为 60%：

Starting training...
Epoch:1/1 -- 2019-12-18 20:03:13.462479
Done training.  The score(s) is/are: [0.8854667949951877]
Training done! Pickling model to models/Model_2019-12-18_2003...
Unpickled successfully from file models/Model_2019-12-18_2003
= = = = = = = = = = = = = = = = = = = 

First 10 predictions:

['5001 - 10000' '10001+' '1001 - 5000' '5001 - 10000' '1001 - 5000'
 '1001 - 5000' '5001 - 10000' '1001 - 5000' '1001 - 5000' '1001 - 5000']
['5001 - 10000' '10001+' '1001 - 5000' '5001 - 10000' '1001 - 5000'
 '1001 - 5000' '5001 - 10000' '1001 - 5000' '1001 - 5000' '1001 - 5000']
 = = = = = = = = = = = = = 
Starting training...
Epoch:1/1 -- 2019-12-18 20:03:20.775392
Done training.  The score(s) is/are: [0.5906466512702079]
[(0.8854667949951877, 'train'), (0.5906466512702079, 'test')]
Better score: (0.8854667949951877, 'train')
Pickling....

Process finished with exit code 0

但是，假设我想使用此模型进行 SINGLE 预测，即通过将公司名称和公司成立年份传递给它。我执行以下操作：

lrModel = pickle.load(open(filename, 'rb'))
predictedSet = lrModel.predict([["SomeRandomCompany", 2019]])

但是当我这样做时，我得到以下 ValueError：

  X = check_array(X, accept_sparse='csr')
Traceback (most recent call last):
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 85, in <module>
    main()
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 58, in main
    predictions(model, reducedSetEncoded, reducedSet)
  File "/home/ivor/Documents/companySizeEstimator/companySizeEstimator.py", line 80, in predictions
    predictedSet = lrModel.predict([["SomeCompany", 2019]])
  File "/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 293, in predict
    scores = self.decision_function(X)
  File "/home/ivor/Documents/companySizeEstimator/venv/lib/python3.8/site-packages/sklearn/linear_model/_base.py", line 272, in decision_function
    raise ValueError("X has %d features per sample; expecting %d"
ValueError: X has 2 features per sample; expecting 54897

它似乎想要一个与用于训练它的数据集形状完全相同的数据集，即具有 11,000 行的数据集。它可以在问题的测试阶段给出很好的预测，因此很明显该模型能够很好地做出预测。如上所示，我怎样才能让它仅基于 one 值进行预测？

【问题讨论】：

标签： python pandas machine-learning scikit-learn

【解决方案1】：

当您训练模型时，您正在使用具有 N 个特征的数据集，模型也期望具有相同数量的特征进行预测。因为您的模型是通过查看这 N 个特征进行训练并进行预测的，所以它需要相同的维度。为什么你得到 X 每个样本有 2 个特征；预计 54897 错误。

您可以做的一件事是使用与所需维度 (N) 匹配的零创建矩阵或 df，并填充用于预测 df 确切位置的值。

【讨论】：

我只是好奇如果只有这两个特征而其他所有特征都为零，那么预测会是什么。如果这项工作，意味着所有其他特征都无用，则无需训练它。
我不明白的是我用2个特征训练了模型，而不是54,987个特征；我刚刚使用了name 和year_founded。真的没有办法通过将公司名称（字符串）和年份（int）传递给模型来从模型中获得just一个预测吗？我必须创建一个与用于训练的相同大小的空白 DataFrame，并用值替换第一个条目？
@ivorysoap 是的，相同的列大小。行可以不同..这是从模型中获取预测的一种方式..如果您的训练集的 name 和 year_founded 在第一和第二列，在测试集中确保第一列和第二列是name 和year_founded，其他列为零。

【解决方案2】：

我认为您应该仔细检查用于训练的 df：xTrain_sf_encoded，它应该是一个 2 列 DataFrame，而由于某种原因它有 54,987 个。

还有一件事，你为什么在测试阶段这样做？

testLrScores = train_model(lrModel, xTest_sf_encoded, yTest, 1)

您正在重新训练模型，而我相信您想像这样测试它：

# Print Predictions
yPred = lrModel.predict(xTest_sf_encoded)
print(yPred)
# Print the actual values
print(yTest)
# Compare
print(yPred==yTest)

【讨论】：