训练CNN时未知原因的高丢失率答案

【问题标题】：High loss rate for unknow reason when training CNN训练CNN时未知原因的高丢失率
【发布时间】：2022-01-29 17:15:25
【问题描述】：

我已经坚持了三天的任务，并检查了我可以从互联网上获得的所有信息。但是我的模型的损失率无法降低。该模型只是随机猜测验证数据集。

（数据来源）[https://www.kaggle.com/datamunge/sign-language-mnist]

以下是我尝试并验证的一些不起作用的方法：

增加batch size，但batch size似乎与高丢失率和低准确率无关。
检查输入数据的格式，但什么也没发现，一切似乎都正常。
尝试去掉图像增强，损失率无所谓。
尝试更改优化器，我尝试过 Adam、RMSDrop、SGD。
尝试添加更多的神经元和增加训练的epoch，只会增加训练准确率，但不会增加验证准确率。
检查我的环境，我已经运行了 CNN 的其他示例代码，它们按预期工作。

这是我的代码和输出。

import matplotlib.pyplot as plt
import csv
import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from os import getcwd
import sys


def progressbar(it, prefix="", size=29, file=sys.stdout):
    # This def is made by: https://stackoverflow.com/users/1207193/iambr
    # it is the list you are going to iterate
    # prefix is the title of your progress bar
    # size is the length of your progress bar
    count = len(it)

    def show(j):
        x = int(size*j/count)
        file.write("%s[%s%s%s] %i/%i\r" %
                   (prefix, "="*x, ">", "."*(size-x), j, count))
        file.flush()
    show(0)
    for i, item in enumerate(it):
        yield item
        show(i+1)
    file.write("\n")
    file.flush()


def get_data(filename):
    with open(filename) as training_file:
        images = np.empty((0, 28, 28), dtype=float)
        labels = np.empty((0), dtype=float)
        # Your code starts here
        raw_file = np.loadtxt(training_file.readlines()[
                              :-1], dtype=float, skiprows=1, delimiter=',')
        for row in progressbar(raw_file, "Loading data: "):
            if(len(row) == 785):
                labels = np.append(labels, row[0])
                image = np.reshape(row[1:785], (1, 28, 28))
                images = np.append(image, images, axis=0)
        print(f'read file:{filename} complete')
        return images, labels


# full data set
# path_sign_mnist_train = f'{getcwd()}/tmp2/sign_mnist_train.csv'
# path_sign_mnist_test = f'{getcwd()}/tmp2/sign_mnist_test.csv'

# reduce training set
path_sign_mnist_train = f'{getcwd()}/tmp2/sign_mnist_train_a.csv'
path_sign_mnist_test = f'{getcwd()}/tmp2/sign_mnist_test_a.csv'

training_images, training_labels = get_data(path_sign_mnist_train)
testing_images, testing_labels = get_data(path_sign_mnist_test)

training_images=training_images/255.
testing_images=testing_images/255.

# Keep these
print(training_images.shape)
print(training_labels.shape)
print(testing_images.shape)
print(testing_labels.shape)
print(testing_labels)

# Testing code
plt.imshow(training_images[1], interpolation='nearest')
plt.show()
print(training_labels[1])

train_datagen = ImageDataGenerator(
   featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=14,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.09, # Randomly zoom image 
        width_shift_range=0.14,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.14,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False,   # randomly flip images
        brightness_range = (0.8, 1.0),  # brightness of image
        rescale = 1. / 255.)

validation_datagen = ImageDataGenerator(rescale=1./255.)

training_images = np.reshape(training_images, (-1,28,28,1))
train_datagen.fit(training_images)
testing_images = np.reshape(testing_images,(-1,28,28,1))

training_labels=tf.keras.utils.to_categorical(training_labels,num_classes=25)
testing_labels=tf.keras.utils.to_categorical(testing_labels, num_classes=25)

batch_size = 16

train_generator = train_datagen.flow(
    training_images,
    training_labels, batch_size=batch_size)

validation_generator = validation_datagen.flow(
    testing_images,
    testing_labels, batch_size=batch_size)
# Keep These
print(training_images.shape)
print(testing_images.shape)

# Their output should be:
# (27455, 28, 28, 1)
# (7172, 28, 28, 1)

# Define the model
# Use no more than 2 Conv2D and 2 MaxPooling2D
model = tf.keras.models.Sequential([
    # Your Code Here
    tf.keras.layers.Conv2D(64, (3, 3), activation='relu',
                           input_shape=(28, 28, 1)),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Conv2D(32, (3, 3), activation='relu'),
    tf.keras.layers.MaxPooling2D(2, 2),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(25, activation='softmax')
])

# Compile Model.
model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.005),loss='categorical_crossentropy',metrics=['accuracy'])

model.summary()

# Train the Model
history = model.fit_generator(train_generator,
                              validation_data=validation_generator,
                              steps_per_epoch=len(training_images)//batch_size,
                              epochs=10,
                              validation_steps=len(testing_images)//batch_size
                              )

# model.evaluate(testing_images/255., testing_labels, verbose=0)

# Plot the chart for accuracy and loss on both training and validation
acc = history.history['accuracy']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(acc))

plt.plot(epochs, acc, 'r', label='Training accuracy')
plt.plot(epochs, val_acc, 'b', label='Validation accuracy')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()

plt.plot(epochs, loss, 'r', label='Training Loss')
plt.plot(epochs, val_loss, 'b', label='Validation Loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

但是损失率几乎没有变化……

Epoch 1/10
WARNING:tensorflow:AutoGraph could not transform <function Model.make_train_function.<locals>.train_function at 0x0000026B4B18F948> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: unsupported operand type(s) for -: 'NoneType' and 'int'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
2022-01-27 09:40:05.564400: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_10.dll
2022-01-27 09:40:05.743540: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2022-01-27 09:40:06.492580: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Invoking GPU asm compilation is supported on Cuda non-Windows platforms only
Relying on driver to perform ptx compilation.
Modify $PATH to customize ptxas location.
This message will be only logged once.
430/437 [============================>.] - ETA: 0s - loss: 3.1891 - accuracy: 0.0461WARNING:tensorflow:AutoGraph could not transform <function Model.make_test_function.<locals>.test_function at 0x0000026B490A4F78> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: unsupported operand type(s) for -: 'NoneType' and 'int'
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
437/437 [==============================] - 3s 7ms/step - loss: 3.1890 - accuracy: 0.0463 - val_loss: 3.2067 - val_accuracy: 0.0230
Epoch 2/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1828 - accuracy: 0.0425 - val_loss: 3.1952 - val_accuracy: 0.0333
Epoch 3/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1802 - accuracy: 0.0401 - val_loss: 3.2006 - val_accuracy: 0.0230
Epoch 4/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1789 - accuracy: 0.0434 - val_loss: 3.2012 - val_accuracy: 0.0348
Epoch 5/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1782 - accuracy: 0.0448 - val_loss: 3.2109 - val_accuracy: 0.0345
Epoch 6/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1784 - accuracy: 0.0454 - val_loss: 3.2056 - val_accuracy: 0.0230
Epoch 7/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1782 - accuracy: 0.0407 - val_loss: 3.2032 - val_accuracy: 0.0230
Epoch 8/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1780 - accuracy: 0.0391 - val_loss: 3.2080 - val_accuracy: 0.0230
Epoch 9/10
437/437 [==============================] - 3s 7ms/step - loss: 3.1775 - accuracy: 0.0417 - val_loss: 3.2033 - val_accuracy: 0.0230
Epoch 10/10
418/437 [===========================>..] - ETA: 0s - loss: 3.1773 - accuracy: 0.0460Traceback (most recent call last):

【问题讨论】：

标签： python tensorflow keras conv-neural-network

【解决方案1】：

这个网络的唯一问题是，学习太快了。如果您将学习率设置为 0.005 到 0.0005，则此模型可以正常工作。

# Compile Model.
  model.compile(optimizer=tf.keras.optimizers.Adam(lr=0.0005),
          loss='categorical_crossentropy', metrics=['accuracy'])

不要学得太快，否则你会陷入局部最小值，永远出不来。

Epoch 2/10
437/437 [==============================] - 3s 6ms/step - loss: 2.5773 - accuracy: 0.2133 - val_loss: 2.2050 - val_accuracy: 0.3542
Epoch 3/10
437/437 [==============================] - 3s 6ms/step - loss: 2.1190 - accuracy: 0.3262 - val_loss: 1.6197 - val_accuracy: 0.5278
Epoch 4/10
437/437 [==============================] - 3s 7ms/step - loss: 1.7566 - accuracy: 0.4223 - val_loss: 1.3985 - val_accuracy: 0.5492
Epoch 5/10
437/437 [==============================] - 3s 6ms/step - loss: 1.5062 - accuracy: 0.4929 - val_loss: 1.1146 - val_accuracy: 0.7000
Epoch 6/10
437/437 [==============================] - 3s 6ms/step - loss: 1.3736 - accuracy: 0.5323 - val_loss: 1.0778 - val_accuracy: 0.6756
Epoch 7/10
437/437 [==============================] - 3s 6ms/step - loss: 1.2198 - accuracy: 0.5836 - val_loss: 0.8912 - val_accuracy: 0.7650
Epoch 8/10
437/437 [==============================] - 3s 6ms/step - loss: 1.1396 - accuracy: 0.6066 - val_loss: 0.8298 - val_accuracy: 0.7486
Epoch 9/10
437/437 [==============================] - 3s 6ms/step - loss: 1.1084 - accuracy: 0.6182 - val_loss: 0.9152 - val_accuracy: 0.6830
Epoch 10/10
437/437 [==============================] - 3s 6ms/step - loss: 1.0196 - accuracy: 0.6525 - val_loss: 0.8014 - val_accuracy: 0.7307

顺便说一句：读取方法的效率不是很'python'。这样效果更好。

def get_data(filename):
    with open(filename) as training_file:
        raw_file = np.loadtxt(training_file.readlines()[
                              :-1], dtype=float, skiprows=1, delimiter=',')
        labels=np.array([i[0] for i in raw_file])
        images=np.array([i[1:785] for i in raw_file])
        images=images.reshape(-1,28,28)
        print(f'read file:{filename} complete')
        return images, labels

【讨论】：