如何将单热编码存储为对象？答案

【问题标题】：How can I store one-hot-encodings as an object?如何将单热编码存储为对象？
【发布时间】：2019-09-23 16:33:58
【问题描述】：

首先，我的模型架构的一些背景知识。

我的 keras 模型的输入相当简单：

分类变量A
分类变量 B
数字输入 C，范围为 [0,1]。

该模型具有单一输出：

[0,1] 上的数字

在训练模型时，我的输入数据是使用pd.read_sql() 从 SQL 数据库中获取的数据框。我使用以下函数对分类变量 A 和 B（分别位于数据框 original_data 的 col1 和 col2 中）进行单热编码：

from keras import utils as np_utils

def preprocess_categorical_features(self):
        col1 = np_utils.to_categorical(np.copy(self.original_data.CURRENT_RTIF.values))
        col2 = np_utils.to_categorical(np.copy(self.original_data.NEXT_RTIF.values))
        cat_input_data = np.append(col1,col2,axis=1)
        return cat_input_data

稍后，当我需要根据该模型进行预测时，输入数据来自 RabbitMQ 以字典形式提供的实时提要。这个 RabbitMQ 数据必须由它自己的（不同的）reprocess_categorical_features() 函数处理。

这让我想到了我的问题：如何确保 one-hot-encodings 完全相同，无论我是在预处理数据库中的数据还是 rabbitMQ 提要？

应用于数据库数据的 A 的 One-Hot Encoding：

|---------------------|------------------|
|          A          | One-Hot-Encoding |
|---------------------|------------------|
|       "coconut"     |      <0,1,0,0>   |
|---------------------|------------------|
|       "apple"       |      <1,0,0,0>   |
|---------------------|------------------|
|       "quince"      |      <0,0,0,1>   |
|---------------------|------------------|
|       "plum"        |      <0,1,0,0>   |
|---------------------|------------------|

应用于 RabbitMQ 数据的 A 的 One-Hot 编码（它们必须相同）：

|---------------------|------------------|
|          A          | One-Hot-Encoding |
|---------------------|------------------|
|       "coconut"     |      <0,1,0,0>   |
|---------------------|------------------|
|       "apple"       |      <1,0,0,0>   |
|---------------------|------------------|
|       "quince"      |      <0,0,0,1>   |
|---------------------|------------------|
|       "plum"        |      <0,1,0,0>   |
|---------------------|------------------|

有没有办法让我将编码保存为数据帧、numpy ndarray 或字典，以便我可以将编码从预处理我的训练数据的函数传递到预处理我的输入的函数数据？我愿意为 OHE 使用 Keras 以外的其他库，但我很想知道是否有办法使用我目前正在使用的 keras 的 to_categorical 函数来实现这一点。

【问题讨论】：

也请发布预期的数据帧
@anky_91 嘿，谢谢！我并不真正期待数据框。我上面写的函数将数据框作为输入（通过self.original_data，但返回cat_input_data，这是一个numpy ndarray。
我想弄清楚的是如何将分类值的映射存储到 keras 的 utils.to_categorical() 函数使用的单热编码向量，以便我可以使用这个映射后期预处理。

标签： python numpy keras categorical-data

【解决方案1】：

我决定使用sklearn.preprocessing.OneHotEncoder，而不是依赖keras的utils.to_categorical方法。这允许我在处理训练数据时声明一个单热编码器对象self.encoder：

class TrainingData:
    def preprocess_categorical_features(self):
        # declare OneHotEncoder object to save for later
        self.encoder = OneHotEncoder(sparse=False)

        # fit encoder to data
        self.encoder.fit(self.original_data.CURRENT_RTIF.values.reshape(-1,1))

        # perform one-hot-encoding on columns 1 and 2 of the training data
        col1 = self.encoder.transform(self.original_data.CURRENT_RTIF.values.reshape(-1,1))
        col2 = self.encoder.transform(self.original_data.NEXT_RTIF.values.reshape(-1,1))

        # return on-hot-encoded data as a numpy ndarray
        cat_input_data = np.append(col1,col2,axis=1)
        return cat_input_data

稍后，我可以重新使用该编码器（通过将其作为参数传递 training_data_ohe_encoder）到处理最终做出预测所需的输入数据的方法。

class LiveData:
    def preprocess_categorical_features(self, training_data_ohe_encoder):
        # notice the training_data_ohe_encoder parameter; this is the 
        # encoder attribute from the Training Data Class.

        # one-hot-encode the live data using the training_data_ohe_encoder encoder
        col1 = training_data_ohe_encoder.transform(np.copy(self.preprocessed_data.CURRENT_RTIF.values).reshape(-1, 1))
        col2 = training_data_ohe_encoder.transform(np.copy(self.preprocessed_data.NEXT_RTIF.values).reshape(-1, 1))

        # return on-hot-encoded data as a numpy ndarray
        cat_input_data = np.append(col1,col2,axis=1)
        return cat_input_data

【讨论】：