【发布时间】:2021-02-04 22:59:29
【问题描述】:
我目前正在学习 ML,我正在使用 scikit learn 预处理两个 txt 文件(一个用于训练,一个用于测试)。
在将 2 个文件中的数据加载到 2 个数据帧中,并将标签移动到另外两个数据帧(train_y 和 test_y)后,我正在对 onehotencode 标签应用转换并标准化数值数据。
train = pd.read_csv("./training.txt", delimiter="\t", header = None, names = col_names)
test = pd.read_csv("./test.txt", delimiter="\t", header = None, names = col_names)
train = train.sample(frac=1).reset_index(drop=True)
test = test.sample(frac=1).reset_index(drop=True)
train_x = train.drop(["style"], axis=1)
test_x = test.drop(["style"], axis=1)
train_y = train["style"].to_frame()
test_y = test["style"].to_frame()
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
num_attribs_X = ["calorific_value", "nitrogen", "turbidity", "alcohol", "sugars", "bitterness", "colour", "degree_of_fermentation"]
num_pipeline_X = Pipeline([('std_scaler', StandardScaler())])
full_pipeline_X = ColumnTransformer([
("num", num_pipeline_X, num_attribs_X),
("cat", OneHotEncoder(), ["beer_id"])
])
full_pipeline_Y = ColumnTransformer([
("cat", OneHotEncoder(), ["style"])
])
train_x_prepared = full_pipeline_X.fit_transform(train_x)
test_x_prepared = full_pipeline_X.fit_transform(test_x)
train_y_prepared = full_pipeline_Y.fit_transform(train_y)
test_y_prepared = full_pipeline_Y.fit_transform(test_y)
但是,尽管 train_x 和 test_x 具有相同的列数,但 train_x_prepared 和 test_x_prepared 却没有。我不确定为什么会这样?有没有更好的方法来做我上面所做的事情?
【问题讨论】:
标签: python pandas machine-learning scikit-learn