如何选择随机数量的随机MNIST数字，其标签不重复，同时排除某个数字？答案

【问题标题】：How to select random MNIST digits of random quantity, whose labels do not repeat, while excluding a certain digit?如何选择随机数量的随机MNIST数字，其标签不重复，同时排除某个数字？
【发布时间】：2019-12-09 23:24:47
【问题描述】：

我对编码比较陌生，我很感激任何帮助，但对我要温柔。

我正在研究用于神经网络的 MNIST 数据库，因为我想将结果转移到另一个问题上。我正在尝试做的是通过将一组图像包含到要分类的图像中来操纵 MNIST 训练数据集。请允许我构建方法：

在训练神经网络时，MNIST 数据库会提供手写数字 (x_train) 及其标签/类别 (y_train) 的图像
但是，我不仅希望使用单个图像输入来训练神经网络，而且还希望为神经网络提供可选图像以供选择
所以如果我想让机器对数字“5”进行分类，我将输入数字“5”的图像和一组随机图像，这些图像应该具有随机数量：

-> 输入 = 图像分类“5” |要引用“1”、“4”、“5”的图像，下一个将是图像分类“0” |图片要引用“0”、“9”、“3”、“5”、“6”等...

“要参考的图像”应始终包含“要分类的数字”，而不是“要分类的图像”。意思是“图像分类“5”的索引不应与“图像参考...“5”的索引相同
到目前为止，我设法选择了随机数字 (random_with_N_digits()) 的随机图像 (digit_randomizer())。我想念的是：
1. 排除自身的索引：要分类的索引“5”不是可供选择的索引“5”
2. 要引用的图像不应有重复的数字

To 1.: 下面你可以看到我的函数 digit_randomizer()。我目前不知道如何解决这个问题，但使用嵌套循环检查“np.where(j != i)”

To 2.：我正在考虑将 y_train 分成 10 组不同的标签（每组代表一位）。但是我不知道我应该写什么样的命令，因为我需要定义随机数量的图像。从 10 个集合中随机选择一个随机图像，同时注意索引。

到目前为止，这是我的代码：

import keras as k
import numpy as np
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Dropout, Flatten, MaxPooling2D
import matplotlib.pyplot as plt

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()


print('')
print('x_train shape:', x_train.shape)

# Reshaping the array to 4-dims so that it can work with the Keras API
x_train = x_train.reshape(x_train.shape[0], 28, 28, 1)
x_test = x_test.reshape(x_test.shape[0], 28, 28, 1)
input_shape = (28, 28, 1)
# Making sure that the values are float so that we can get decimal points after division
x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
# Normalizing the RGB codes by dividing it to the max RGB value.
x_train /= 255
x_test /= 255
print('x_train shape reshaped:', x_train.shape)
print('Number of images in x_train', x_train.shape[0])
print('Number of images in x_test', x_test.shape[0])

classes = set(y_train)
print('Number of classes =', len(classes),'\n')
print('Classes: \n', classes, '\n')

print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)


import random
import warnings
warnings.filterwarnings("ignore")
from random import randint

#select a random image from the training data
def digit_select():
    for j in range(1):
        j = np.random.choice(np.arange(0, len(y_train)), size = (1,))
        digit = (x_train[j] * 255).reshape((28, 28)).astype("uint8")
        imgplot = plt.imshow(digit)
        plt.title(y_train[j])
        imgplot.set_cmap('Greys')
        plt.show()

# return between 1 or 10 images
def random_with_N_digits():
    range_start = 0
    range_end = 9
    return randint(range_start, range_end)

# return 1 or 10 random images
def digit_randomizer():  
    for i in range(random_with_N_digits()):
        i = np.random.choice(np.arange(0, len(y_train)), size = (1,))
        image = (x_train[i] * 255).reshape((28, 28)).astype("uint8")
        imgplot = plt.imshow(image)
        imgplot.set_cmap('Greys')
        plt.title(y_train[i])
        plt.show()

应该以某种方式将 digit_select 从 digit_randomizer 中排除，并且 digit_randomizer 应该只从 y_train 中为每个类别选择一个图像。

非常感谢任何想法。

代码编辑：

def digit_label_randselect():
    j = np.random.choice(np.arange(0, len(y_train)), size=(1,))
    return int(y_train[j])
print('Randomly selected label:', digit_label_randselect())

Output: Randomly selected label: 4

def n_reference_digits(input_digit_label):
    other_digits = list(np.unique(y_train)) #create list with all digits
    other_digits.remove(input_digit_label) #remove the input digit label
    sample = random.sample(other_digits, len(np.unique(y_train))-1) #Take a sample of size n of the digits
    sample.append(input_digit_label)
    random.shuffle(sample)
    return sample
print('Randomly shuffled labels:', n_reference_digits(digit_label_randselect()))

Output: Randomly shuffled labels: [8, 0, 6, 2, 7, 4, 3, 5, 9, 1]


'''returns a list of 10 random indices.
necessary to choose random 10 digits as a set, which will be used to train the NN.
the set needs to contain a single identical y_train value (label),
meaning that a random digit definitely has the same random digit in the set.
however their indices have to differ. moreover all y_train values (labels) have to be different,
meaning that the set contains a single representation of every digit.'''
def digit_indices_randselect():
    listi = []
    for i in range(10):
        i = np.random.choice(np.arange(0, len(y_train)), size = (1,))
        listi.append(i)
    return listi
listindex = digit_indices_randselect()
print('Random list of indices:', listindex)

Output: Random list of indices: [array([53451]), array([31815]), array([4519]), array([21354]), array([14855]), array([45147]), array([42903]), array([37681]), array([1386]), array([9584])]

'''for every index in listindex return the corresponding index, pixel array and label'''
#TO DO: One Hot Encode the labels
def array_and_label_for_digit_indices_randselect():
    listi = []
    digit_data = []
    labels = []
    for i in listindex:
        digit_array = x_train[i] #digit data (image array) is the data from index i
        label = y_train[i] #corresponding label
        listi.append(i)
        digit_data.append(digit_array)
        labels.append(label)
    list3 = list(zip(listi, digit_data, labels))
    return list3
array_and_label_for_digit_indices_randselect()


Output:[(array([5437]),
  array([[[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   4,  29,  29,  29,
            29,  29,  29,  29,  92,  91, 141, 241, 255, 228,  94,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,  45, 107, 179, 252, 252, 252,
           253, 252, 252, 252, 253, 252, 252, 252, 253, 252, 224,  19,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,  45, 240, 252, 253, 252, 252, 252,
           253, 252, 252, 252, 253, 252, 252, 252, 253, 252, 186,   6,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0, 157, 252, 252, 253, 252, 252, 252,
           253, 252, 252, 252, 241, 215, 252, 252, 253, 202,  19,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,  41, 253, 253, 253, 255, 234, 100,   0,
             0,   0,   0,   0,   0,  70, 253, 253, 251, 125,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,  66, 252, 252, 252, 253, 133,   0,   0,
             0,   0,   0,   0,   0, 169, 252, 252, 200,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   7, 130, 168, 168, 106,  19,   0,   0,
             0,   0,   0,   0,  10, 197, 252, 252, 113,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0, 128, 252, 252, 252,  63,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,  13, 204, 253, 253, 241,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,  88, 253, 252, 233, 109,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0, 225, 253, 252, 234,  22,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,  38, 237, 253, 252, 164,  15,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,  26, 172, 253, 254, 228,  31,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0, 114, 234, 252, 253, 139,  19,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           111, 234, 252, 252,  94,  19,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           241, 252, 252, 202,  13,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  76,
           254, 253, 253,  78,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 225,
           253, 252, 233,  22,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0, 225,
           253, 233,  62,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,  38, 187,
           241,  59,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0],
          [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
             0,   0,   0,   0]]], dtype=uint8),
  array([7], dtype=uint8)),...


'''for every index in x_train return a random index, its pixel array, label.
also return a list of 10 random indices (digit_indices_randselect())'''
def digit_with_set():
    for i in x_train:
        i = random.randrange(len(y_train)) #returns a random index of a digit between 0 - 60000
        digit_data = x_train[i] #digit data (image array) is the data from index i
        label = y_train[i] #corresponding label
        #digit_randomizer() #returns a random set of 10 images
        print("Index of digit to classify:", i), \
        print("Digit to classify:", label), \
        print("Corresponding array:", digit_data), \
        print("Set of 10 Images:", array_and_label_for_digit_indices_randselect())
        print("")
        print("Next:")
digit_with_set()```



***PURPOSE Edit:*** The purpose of this approach is to research, whether a neural network can devise a model, which not only classifies the input, but also recognizes the possibility from choosing a label from the optional set. Meaning that the model not only classifies the "5" as the "5", but also looks into its options and finds a fit there as well.

This may not make much sense for an image classification problem. However I am working on a sequence to sequence problem in another project. The input sequences are in multiple columns in a .csv file. The output is another sequence. The issue lies within the very heterogenous input, so the accuracy is low and loss is very high.

This is how the data is structured:

**Input**: | AA_10| 31.05.2019 | CW20 | Project1 |   **Output**: AA_Project1_[11]

**Input**: |      | CW19       |      | Project2 |   **Output**: AA_Project2_[3]

**Input**: | y550 | 01.06.2019 | AA12 | Project1 |   **Output**: AA_Project1_[12]

The AA_ProjectX_[Value] within the output is the main issue since its range varies from project to project. Project1 can have [0-12], Project 2 can have [0-20], ProjectX [0-N].

By adding a range of values to the input data I hope to restrict the network from learning values which are not part of the project.

Input: | AA_10| 31.05.2019 | CW20 | Project1 | [0,1,2,3,4,5,6,7,8,9,10,11,12] |  Output: AA_Project1_[11]

So when I want to classify the digit 5, I give the machine a range of possibile classes and corresponding images to derive the output class from.

【问题讨论】：

标签： python image-processing random mnist digits

【解决方案1】：

您在问多个问题。不鼓励这样做，因为您的整个问题对您来说非常具体，因此不会帮助其他人。尝试将您的问题分解为更基本的问题，然后您可以搜索 SO（因为它们可能已经被回答）或单独询问。如果您的问题更加模块化，您会发现人们更有可能回复。

在我具体讨论您的问题之前，我对您的代码有一些 cmet，正如您提到的新手编码器一样。我尽量做到明确，代码可能更短或更优雅，但我选择表达性作为我的主要目标。

函数是一段模块化的代码，最好有一个工作。因此，我不会将可视化代码放在您的 digit_select 中。我会为可视化创建一个单独的函数。也许是这样的：

def vis_digit(index):
    digit = (x_train[index] * 255).reshape((28, 28)).astype("uint8")
    imgplot = plt.imshow(digit)
    plt.title(y_train[index])
    imgplot.set_cmap('Greys')
    plt.show()

现在我们可以进一步重构 digit_select。我认为您不需要这里的 for 循环。据我了解，该方法仅选择一个随机图像，因此您不需要重复行为。您现在编写它的方式无论如何都不会重复代码，因为 range(1) 给出了一个仅包含 0 的可迭代对象。此外，j 是您用来选择训练图像的索引，它可以是一个纯整数。因此，您可以使用 random.randrange 或 random.randint，我更喜欢 randrange（请阅读文档以了解两者之间的区别）。你要记住你使用的图像，因为这个图像不能在你的参考集中，所以我建议返回 j。 digit_select 方法可能如下所示：

def digit_select():
    digit_index = random.randrange(len(y_train))
    digit_data = x_train[digit_index]
    label = y_train[digit_index]
    return digit_index, digit_data, label

现在，据我所知，我将回答您的复合问题的一个方面：“如何选择必须包含特定数字的唯一数字的随机列表？”。这可以用谷歌搜索，例如this。我在我的回答中使用了接受的答案。

我会使用一个返回一些数字标签列表的函数，其中包括所需的输入数字标签。

def n_reference_digits(input_digit_label):
    other_digits = list(range(10)) #create list with all digits
    other_digits.remove(input_digit_label) #remove the input digit label
    n = random.randrange(10) #pick a random n [0,10)
    sample = random.sample(other_digits, n) #Take a sample of size n of the digits
    sample.append(input_digit_label)
    return sample

现在，我知道这还不完整，但请尝试弄清楚下一步是什么。尝试用谷歌搜索这个小步骤，但找不到答案。只是问一个新的（更具体的）问题。 :)

【讨论】：

您好，感谢您的反馈。我可以选择一个标签并将其与其他标签打乱，而不会重复标签：输出：Randomly shuffled labels: [8, 0, 6, 2, 7, 4, 3, 5, 9, 1] 我还设法选择了 10 个索引，添加相应的像素数组和标签并压缩它们：输出： [(array([5437]), array([[[0, ... 0]]], dtype=uint8), array([7], dtype=uint8)) 等等。我现在不明白的是如何确保 10 个选定的索引没有重复标签。每个索引都应该是单个标签的一部分。

【解决方案2】：

虽然我不太了解您尝试做什么的目的，但您可以随机选择数据/标签中的一些索引，注意不要选择不需要的数字。

import random

data = [["digit1"],["digit3"],["digit1"],["digit2"],["digit3"],["digit1"],["digit2"],        
     ["digit1"],["digit2"],["digit3"]]

labels = [1,3,1,2,3,1,2,1,2,3]

unwanted_label = 1
nb_samples = 3

samples = random.sample([(i, j) for i, j in zip(data, labels) if j!=unwanted_label],nb_samples)

print(list(zip(*samples)))

你会随机得到你的数据和相关标签，如下所示：

[(['digit2'], ['digit3'], ['digit3']), (2, 3, 3)]

【讨论】：

非常感谢。我目前正在修改您的建议，看看我是否可以应用它。我还编辑了我的问题并添加了这种方法的目的。
欢迎询问我的小代码有没有问题！

【解决方案3】：

您可以将数据组合在一起（加入训练和测试）并将它们转换为 pandas 数据帧，然后只需执行这两行即可：

bad_labels = df[df['labels'] == X].sample(amount).index df= df[~df.index.isin(bad_labels)]

data['labels'] 代表数据框的最后一列（您的标签），X 是您希望排除的标签，amount 是您希望保留的随机数数量。

以下是加入数据的方法：

import numpy as np data = np.concatenate((x_train, x_test), axis=0)

转换为熊猫：

import pandas as pd df = pd.DataFrame(data)

如何将 y 标签添加到数据框： df['labels'] = y

要获取训练和测试部分数据，您可以使用 sklearn 的 train_test_split() 函数

【讨论】：

抱歉，如果我的解决方案过于混乱，特别是如果您是编码新手。这是我能想到的最好的，一定要问你是否没有得到任何东西