Tensorflow：从类标签创建 y 索引答案

【问题标题】：Tensorflow: create y-indices from class labelsTensorflow：从类标签创建 y 索引
【发布时间】：2020-12-25 07:31:39
【问题描述】：

我的类标签为：

y = ["class1", "class2", "class3"]

为了在模型中使用它们，我想使用 keras 和/或 tensorflow2.0 的方法将这些类转换为 y_indices 作为 1, 2。

我目前在做的是：

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(y)
y_train = tokenizer.texts_to_sequences(y)

我知道标记器在这里被滥用了。有没有更好更小的解决方案将类标签转换为索引？谢谢。

【问题讨论】：

标签： python numpy tensorflow keras deep-learning

【解决方案1】：

您不能为此使用 Tokenizer，因为 Tokenizer 索引从 1 开始，而不是 0。您可以使用 tf.where：

import tensorflow as tf

y = ['class3', 'class1', 'class1', 'class2', 'class3', 'class1', 'class2']

names = ["class1", "class2", "class3"]

labeler = lambda x: tf.where(tf.equal(x, names))

dataset = tf.data.Dataset.from_tensor_slices(y).map(labeler)

next(iter(dataset))

<tf.Tensor: shape=(1, 1), dtype=int64, numpy=array([[2]], dtype=int64)>

如果您想在列表或 Numpy 数组上执行此操作，您可以使用 Scikit-Learn：

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
    
le.fit_transform(y)

array([2, 0, 0, 1, 2, 0, 1], dtype=int64)

正如我之前所说，您的实现从 1 开始索引：

[[2], [1], [1], [3], [2], [1], [3]]

这会使 Keras 在测量损失和指标时崩溃。它将返回nan，因为您将拥有三个最终神经元，但目标是从第二个索引到第四个。 tl;dr 不要在 Keras 中使用从 1 开始的索引。

【讨论】：

我在一个小案例中使用了 4 个输出神经元作弊。但是，这是一个非常有用的信息。谢谢你的回答！