Keras 中针对大型数据集的数据增强答案

【问题标题】：data augmentation in Keras for large datasetsKeras 中针对大型数据集的数据增强
【发布时间】：2019-01-02 16:55:30
【问题描述】：

我正在使用 Keras 来训练图像分类模型，并且正在使用大约 50k 图像。每张图像有三个通道，每张图像的大小为 150x150。由于三个通道之间图像强度的微小差异，我必须使用浮点数来存储图像。我正在使用 GPU 进行训练，但我的显卡上没有很多内存，也没有钱来升级我的 GPU。我还必须扩充我的数据集，因为我的训练图像没有涵盖我的测试数据集中所有可能的旋转和平移。我已经编写了自己的生成器，它将输入图像和标签分成块，然后将其提供给 Keras 的数据增强例程和 model.fit()。以下是我的代码：

from __future__ import print_function
from keras.preprocessing.image import ImageDataGenerator
from keras.utils import np_utils
from keras.callbacks import Callback
from keras.callbacks import ModelCheckpoint
from keras.callbacks import ReduceLROnPlateau
from keras.callbacks import CSVLogger
from keras.callbacks import EarlyStopping, TensorBoard, LearningRateScheduler
from keras.optimizers import SGD, Adam, RMSprop
from keras import backend as K
import tensorflow as tf
from sklearn.model_selection import train_test_split

import numpy as np
import math
import myCNN # my own convolutional neural network

def myBatchGenerator(X_train_large, y_train_large, chunk_size):
    number_of_images = len(y_train_large)

    while True:
        batch_start = 0
        batch_end = chunk_size

        while batch_start < number_of_images:
            limit = min(batch_end, number_of_images)
            X = X_train_large[batch_start:limit,:,:,:]
            y = y_train_large[batch_start:limit,:]
            yield(X,y)

            batch_start += chunk_size
            batch_end += chunk_size

if __name__ == '__main__':
    input_image_shape = (150,150,3)
    # read input images and labels
    # X_train_large is an array of type float16          
    # y_train_large is an array of size number of images x number of classes 
    X_train_large, y_train_large = myFunctionToReadTrainingImagesAndLabels()

    # validation images: about 5000 images 
    X_validation_large, y_validation_large = 
                                  myFunctionToReadValidationImagesAndLabels() 
    # create a stratified sample from the large training set. use 100 samples from each class
    y_train_large_vectors = [np.where(r == 1)[0][0] for r in y_train_large]
    unique, counts = np.unique(y_train_large_vectors, return_counts=True)

    X_train_sample = np.empty((12000, 150, 150, 3))
    y_train_sample = np.empty((12000, 12))

    for idx in range(num_classes):
        start_idx_for_sample = 100*idx
        end_idx_for_sample = start_idx_for_sample+99
        start_idx_for_large = np.max(counts)*idx
        end_idx_for_large = start_idx_for_large+99

        X_train_sample[start_idx_for_sample:end_idx_for_sample,:,:,:] = X_train_large[start_idx_for_large:end_idx_for_large,:,:,:]
        y_train_sample[start_idx_for_sample:end_idx_for_sample,:] = y_train_large[start_idx_for_large:end_idx_for_large,:]

    # define augmentation needed for image data generator
    train_datagen = ImageDataGenerator(featurewise_center=False,  
                                       samplewise_center=False,  
                                       featurewise_std_normalization=False, 
                                       samplewise_std_normalization=False,  
                                       zca_whitening=False,  
                                       rotation_range=90,  
                                       width_shift_range=0.1,  
                                       height_shift_range=0.1, 
                                       horizontal_flip=True, 
                                       vertical_flip=True)  
                                       
    train_datagen.fit(X_train_sample)
    
    # load my model
    model = myCNN.build_model(input_image_shape)
    sgd = SGD(lr=0.05,decay=10e-4,momentum=0.9)
    model.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'
    
    for e in range(number_of_epochs):
       print('*********************epoch',e)
       # get 1000 images at a time from the input image set
       for X_train, y_train in myBatchGenerator(X_train_large, y_train_large,chunk_size=1000):
           # split it into batches of 32 images/labels and augment on the fly
           for X_batch, y_batch in train_datagen.flow(X_train_large,y_train_large,batch_size=32):
               # train
               model.fit(X_batch,y_batch,validation_data=(X_validation_large,y_validation_large))

    model.save('myCNN_trained_on_largedataset.h5')

简而言之， 1. 我创建了输入图像的分层样本以用于图像数据生成器。 2. 我将输入图像分成 1000 张图像的块，并将这 1000 张图像以 32 张为一组提供给模型。

因此，我一次在 32 张图像上训练我的模型，在运行中对其进行扩充，并在大约 5000 张图像上验证模型。

我仍在运行我的模型，但每批 32 张图像目前需要 30 秒才能解决。这意味着只需花费大量时间来解决一个 epoch。我在这里遗漏了一些东西。

我已经在一个较小的数据集上测试了我的 CNN 代码，它可以工作。所以我知道问题不是我读取输入图像的功能，也不是我的 CNN。我认为这就是如何将我的数据分成块并对其进行批处理。但我无法弄清楚我哪里出错了。你能指导我吗？

提前感谢您的宝贵时间

【问题讨论】：

标签： keras classification large-data

【解决方案1】：

为什么不使用 ImageDataGenerator 类中的flow_from_directory()？它是 keras 内置的，非常适合轻松处理像您这样的问题！
Flow_from_directory，具体来说，直接从您的目录中提取批次，您可以即时执行数据扩充. 我还可以向您推荐几个示例：

Building powerful image classification models using very little data。这是一篇关于像您这样的问题的 Keras 博客文章，非常容易阅读。
cifar10_cnn_tfaugment2d.py。 Tensorflow 上更高级的临时解决方案，定义了特定的扩充层。不过非常有趣！

我认为这足以让您的网络运行 ;)。
希望对您有所帮助，祝您好运！

【讨论】：

谢谢。我查看了您发布的链接。虽然它们是很好的例子，但如果不进行重大修改，它们就不适用于我。虽然我的图像有 3 个通道，但它们不是 RGB 图像。我分别读取每个通道并创建一个 3-D 矩阵。因此，我的选择是修改 Keras 的 flow_from_directory 以读取 3 张图像，然后自己创建一个矩阵或读取图像，扩充扩充后的图像并将其保存到一个文件夹中（如您发布的 CIFAR 示例链接）。我希望避免这两种选择，并选择了github.com/keras-team/keras/issues/68 中建议的方法。