如何验证验证标签与训练标签的范围相同，Python Numpy答案

【问题标题】：How to verify that validation labels are in the same range as training labels, Python Numpy如何验证验证标签与训练标签的范围相同，Python Numpy
【发布时间】：2020-03-11 20:15:02
【问题描述】：

作为项目的一部分，我需要在 Python 中训练一个多标签文本分类器。我正在遵循某种指南，但由于我在 Python 方面的经验不足，我在理解验证验证标签与训练标签在同一范围内的部分代码时遇到了一些问题。 + 这是抛出错误的原因。

我试图理解的代码是这个：（更具体地说，这段代码的前两行让我感到困惑）

num_classes = max(np.array(train_labels)) + 1
missing_classes = [i for i in range(num_classes) if i not in train_labels]
if len(missing_classes):
    raise ValueError('Missing samples with label value(s) '
                     '{missing_classes}. Please make sure you have '
                     'at least one sample for every label value '
                     'in the range(0, {max_class})'.format(
                        missing_classes=missing_classes,
                        max_class=num_classes - 1))

if num_classes <= 1:
    raise ValueError('Invalid number of labels: {num_classes}.'
                     'Please make sure there are at least two classes '
                     'of samples'.format(num_classes=num_classes))

unexpected_labels = [v for v in test_labels if v not in range(num_classes)]
if len(unexpected_labels):
    raise ValueError('Unexpected label values found in the test set:'
                     ' {unexpected_labels}. Please make sure that the '
                     'labels in the validation set are in the same range '
                     'as training labels.'.format(
                         unexpected_labels=unexpected_labels))

还有它给我的错误：

    num_classes = max(np.array(train_labels)) 
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

如果这对您来说很重要，那么在此代码块之前编写的代码是：

lb = preprocessing.LabelBinarizer()
train_labels = lb.fit_transform(train_df['label'])
train_labels = np.squeeze(train_labels)

print(lb.classes_)

test_labels=lb.transform(test_df['label'])
test_labels=np.squeeze(test_labels)

这给了我这个输出： [67 68 69 70]

任何帮助我更好地理解将不胜感激。

【问题讨论】：

标签： python numpy error-handling preprocessor text-classification

【解决方案1】：

num_classes = max(np.array(train_labels)) + 1

这仅在train_labels 包含(0, n_classes) 范围内的整数值时才有意义（类似于[1, 0, 3, 2, 0, 2, 3, 1, 0] 4 个类）。这似乎不是你在这里所拥有的......

lb.classes_ == [67, 68, 69, 70] 表示这些是train_df['label'] 中的唯一值。 LabelBinarizer 接受一个任意标签数组，并且“one-hot”将它们编码成一个由 0 和 1 组成的数组，形状为 (n_samples, n_classes)。比如：

>> train_labels
array([[1, 0, 0, 0],
       [0, 0, 0, 1],
       ...
       [0, 1, 0, 0]])

无论有多少类，该数组中的最大值始终为1。

此外，您只能在一维数组上调用内置的max 函数。您得到的错误来自 max 尝试比较可迭代的值，在二维 numpy 数组的情况下是行向量。这是模棱两可的，这就是为什么您无法以这种方式找到最大值：

>> np.array([0, 1, 0]) > np.array([1, 0, 0])
array([False,  True,  False])

（要真正找到numpy数组的最大值，请改用np.max()。）

无论如何，如果你想要二值化标签数组所代表的类的数量，你可以直接得到列数：

>> train_labels.shape[-1]
4

以下行用于检查train_labels 中的每个类是否至少有一个实例，但如果train_labels 是二进制二维数组，则同样没有意义：

missing_classes = [i for i in range(num_classes) if i not in train_labels]

您可以将train_labels 转换为整数标签数组：

>> np.argmax(train_labels, axis=-1)
array([0, 3, ... , 1])

或者您可以检查每列中是否至少有一个非零值：

>> np.sum(train_labels, axis=0) > 0
array([ True,  True,  True,  True])

>> np.all(np.sum(train_labels, axis=0) > 0)
True

【讨论】：