我们如何使用 Scikit-Learn 分类器进行预测？答案

【问题标题】：How can we make a prediction using Scikit-Learn Classifiers?我们如何使用 Scikit-Learn 分类器进行预测？
【发布时间】：2022-09-27 15:43:55
【问题描述】：

我可以使用下面的代码轻松训练和测试分类器。

import pandas as pd
import numpy as np

# Load Library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,GradientBoostingClassifier# Step1: Create data set


# Define the headers since the data does not have any
headers = [\"symboling\", \"normalized_losses\", \"make\", \"fuel_type\", \"aspiration\",
           \"num_doors\", \"body_style\", \"drive_wheels\", \"engine_location\",
           \"wheel_base\", \"length\", \"width\", \"height\", \"curb_weight\",
           \"engine_type\", \"num_cylinders\", \"engine_size\", \"fuel_system\",
           \"bore\", \"stroke\", \"compression_ratio\", \"horsepower\", \"peak_rpm\",
           \"city_mpg\", \"highway_mpg\", \"price\"]

# Read in the CSV file and convert \"?\" to NaN
df = pd.read_csv(\"https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data\",
                  header=None, names=headers, na_values=\"?\" )
df.head()

df.columns

df_fin = pd.DataFrame({col: df[col].astype(\'category\').cat.codes for col in df}, index=df.index)
df_fin


X = df_fin[[\'symboling\', \'normalized_losses\', \'make\', \'fuel_type\', \'aspiration\',
       \'num_doors\', \'body_style\', \'drive_wheels\', \'engine_location\',
       \'wheel_base\', \'length\', \'width\', \'height\', \'curb_weight\', \'engine_type\',
       \'num_cylinders\', \'engine_size\', \'fuel_system\', \'bore\', \'stroke\',
       \'compression_ratio\', \'horsepower\', \'peak_rpm\']]
y = df_fin[\'city_mpg\']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


# Fit a Decision Tree model
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

现在，如何根据自变量对目标变量（因变量）进行预测？

像这样的东西应该可以工作，我认为，但它不...

clf.predict([[2,164,\'audi\',\'gas\',\'std\',\'four\',\'sedan\',\'fwd\',\'front\',99.8,176.6,66.2,54.3,2337,\'ohc\',\'four\',109,\'mpfi\',3.19,3.4,10,102,5500,24,30,13950,]])

如果我们将数字保留为数字，并在标签周围加上引号，我想预测因变量，但我不能，因为标签数据。如果数据都是数字，这是一个回归问题，它会工作！！我的问题是......我们如何输入数字和标签，就像一个真实的人会理解的那样，而不是使用标签转换成的数字。我必须相信，在训练和测试完成之前，标签会被转换成数字（一种热编码、分类代码或其他），对吧。

这是我收到的错误消息。

clf.predict([[2,164,\'audi\',\'gas\',\'std\',\'four\',\'sedan\',\'fwd\',\'front\',99.8,176.6,66.2,54.3,2337,\'ohc\',\'four\',109,\'mpfi\',3.19,3.4,10,102,5500,24,30,13950,]])
C:\\Users\\ryans\\anaconda3\\lib\\site-packages\\sklearn\\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(

Traceback (most recent call last):

  Input In [20] in <cell line: 1>
    clf.predict([[2,164,\'audi\',\'gas\',\'std\',\'four\',\'sedan\',\'fwd\',\'front\',99.8,176.6,66.2,54.3,2337,\'ohc\',\'four\',109,\'mpfi\',3.19,3.4,10,102,5500,24,30,13950,]])

  File ~\\anaconda3\\lib\\site-packages\\sklearn\\tree\\_classes.py:505 in predict
    X = self._validate_X_predict(X, check_input)

  File ~\\anaconda3\\lib\\site-packages\\sklearn\\tree\\_classes.py:471 in _validate_X_predict
    X = self._validate_data(X, dtype=DTYPE, accept_sparse=\"csr\", reset=False)

  File ~\\anaconda3\\lib\\site-packages\\sklearn\\base.py:577 in _validate_data
    X = check_array(X, input_name=\"X\", **check_params)

  File ~\\anaconda3\\lib\\site-packages\\sklearn\\utils\\validation.py:856 in check_array
    array = np.asarray(array, order=order, dtype=dtype)

ValueError: could not convert string to float: \'audi\'

请发布回溯

标签： python machine-learning scikit-learn data-science classification

【解决方案1】：

您可以使用相应的分类值创建列值的映射：

col_dictionary = {}
for col in df:
    dictionary = dict(enumerate(df[col].astype('category').cat.categories))
    col_dictionary[col] = {v: k for k, v in dictionary.items()}

获得：

{'symboling': {-2: 0, -1: 1, 0: 2, ..., 3: 5}, 'normalized_losses': {65.0: 0, 74.0: 1, ..., 197.0: 48, 231.0: 49, 256.0: 50}, 'make': {'alfa-romero': 0, 'audi': 1, 'bmw': 2, 'chevrolet': 3, 'dodge': 4, ..., 'volkswagen': 20, 'volvo': 21}, 'fuel_type': {'diesel': 0, 'gas': 1}, 'aspiration': {'std': 0, 'turbo': 1}, 'num_doors': {'four': 0, 'two': 1}, 'body_style': {'convertible': 0, 'hardtop': 1, 'hatchback': 2, 'sedan': 3, 'wagon': 4}, 'drive_wheels': {'4wd': 0, 'fwd': 1, 'rwd': 2}, 'engine_location': {'front': 0, 'rear': 1}, 'wheel_base': {86.6: 0, 88.4: 1, ..., 115.6: 51, 120.9: 52}, 'length': {141.1: 0, 144.6: 1, ..., 202.6: 73, 208.1: 74}, 'width': {60.3: 0, 61.8: 1, ..., 59.1: 47, 59.8: 48}, 'curb_weight': {1488: 0, 1713: 1, 1819: 2, ..., 4066: 170}, 'engine_type': {'dohc': 0, 'dohcv': 1, 'l': 2, 'ohc': 3, 'ohcf': 4, 'ohcv': 5, 'rotor': 6}, 'num_cylinders': {'eight': 0, 'five': 1, 'four': 2, 'six': 3, 'three': 4, 'twelve': 5, 'two': 6}, 'engine_size': {61: 0, 70: 1, 79: 2, ..., 304: 41, 308: 42, 326: 43}, 'fuel_system': {'1bbl': 0, '2bbl': 1, '4bbl': 2, 'idi': 3, 'mfi': 4, 'mpfi': 5, 'spdi': 6, 'spfi': 7}, 'bore': {2.54: 0, 2.68: 1, ..., 3.94: 37}, 'stroke': {2.07: 0, 2.19: 1, ..., 3.9: 34, 4.17: 35}, 'compression_ratio': {7.0: 0, 7.5: 1, ..., 23.0: 31}, 'horsepower': {48.0: 0, 52.0: 1, ..., 288.0: 58}, 'peak_rpm': {4150.0: 0, ..., 6600.0: 22}, 'city_mpg': {13: 0, 14: 1, 15: 2, ..., 49: 28}, 'highway_mpg': {16: 0, ..., 53: 28, 54: 29}, 'price': {5118.0: 0, 5151.0: 1, ..., 41315.0: 184, 45400.0: 185}}

然后使用这个映射来转换你想要预测的数组：

prediction_values = [2, 164, 'audi', 'gas', 'std', 'four', 'sedan', 'fwd', 'front', 99.8, 176.6, 66.2, 54.3, 2337, 'ohc', 'four', 109, 'mpfi', 3.19, 3.4, 10, 102, 5500, 30, 13950]
to_predict = []
for (column, value) in zip(X.columns, prediction_values):
    to_predict.append(col_dictionary[column][value])
to_predict_df = pd.DataFrame([to_predict], columns=X.columns)
clf.predict([to_predict_df.iloc[0].values])

【讨论】：