Keras flowFromDirectory 在生成文件时获取文件名答案

【问题标题】：Keras flowFromDirectory get file names as they are being generatedKeras flowFromDirectory 在生成文件时获取文件名
【发布时间】：2017-06-02 13:43:55
【问题描述】：

是否可以获取使用 flow_from_directory 加载的文件名？我有：

datagen = ImageDataGenerator(
    rotation_range=3,
#     featurewise_std_normalization=True,
    fill_mode='nearest',
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True
)

train_generator = datagen.flow_from_directory(
        path+'/train',
        target_size=(224, 224),
        batch_size=batch_size,)

我的多输出模型有一个自定义生成器，例如：

a = np.arange(8).reshape(2, 4)
# print(a)

print(train_generator.filenames)

def generate():
    while 1:
        x,y = train_generator.next()
        yield [x] ,[a,y]

节点，目前我正在为a 生成随机数，但对于真正的训练，我希望加载一个包含我的图像边界框坐标的json 文件。为此，我需要获取使用train_generator.next() 方法生成的文件名。有了它之后，我可以加载文件，解析json 并传递它而不是a。 x 变量的顺序和我得到的文件名列表也必须相同。

【问题讨论】：

只使用默认的 Keras - 这是不可能的。但是您可以更改 Keras 代码来执行此操作。
你看过我的回答了吗？

标签： python machine-learning neural-network keras

【解决方案1】：

是的，至少在 2.0.4 版本中是可能的（不知道早期版本）。

ImageDataGenerator().flow_from_directory(...) 的实例有一个属性filenames，它是所有文件的列表，按生成器生成它们的顺序排列，还有一个属性batch_index。所以你可以这样做：

datagen = ImageDataGenerator()
gen = datagen.flow_from_directory(...)

生成器的每次迭代都可以得到相应的文件名，如下所示：

for i in gen:
    idx = (gen.batch_index - 1) * gen.batch_size
    print(gen.filenames[idx : idx + gen.batch_size])

这将为您提供当前批次中图像的文件名。

【讨论】：

需要注意的是，如果 shuffle 为 True（默认），这将不起作用。您将始终按照首先处理它们的顺序获取文件名，而不是按照它们从生成器返回的顺序。
@AlexGuth 使用shuffle=True时应该怎么做？
最后一批生成器调用将batch_index 重置为0。所以你会得到idx = -1，它会完全过滤掉最后一批。

【解决方案2】：

您可以通过继承 DirectoryIterator 创建一个返回 image, file_path 元组的非常小的子类：

import numpy as np
from keras.preprocessing.image import ImageDataGenerator, DirectoryIterator

class ImageWithNames(DirectoryIterator):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.filenames_np = np.array(self.filepaths)
        self.class_mode = None # so that we only get the images back

    def _get_batches_of_transformed_samples(self, index_array):
        return (super()._get_batches_of_transformed_samples(index_array),
                self.filenames_np[index_array])

在初始化中，我添加了一个属性，它是self.filepaths 的 numpy 版本，以便我们可以轻松地索引到该数组以获取每个批次生成的路径。

对基类的唯一更改是返回一个元组，即图像批处理 super()._get_batches_of_transformed_samples(index_array) 和文件路径 self.filenames_np[index_array]。

有了它，你可以像这样制作你的生成器：

imagegen = ImageDataGenerator()
datagen = ImageWithNames('/data/path', imagegen, target_size=(224,224))

然后检查

next(datagen)

【讨论】：

优秀的答案。几个小建议：示例类名不匹配，应该是“ImageWithNames”。该示例还可能包括subset="validation", shuffle=False，以防那些应该去这里的人不清楚。最后，对于那些使用 tensorflow 中的 keras 的人，导入将是 from tensorflow.keras.preprocessing...。并且也许检查data_batch, filenames = next(datagen)，以防它不是很明显。
这是正确的（或更多pythonic）的做事方式，IMO。谢谢！

【解决方案3】：

这是一个同样适用于shuffle=True 的示例。并且还正确处理了最后一批。一次通过：

datagen = ImageDataGenerator().flow_from_directory(...)    
batches_per_epoch = datagen.samples // datagen.batch_size + (datagen.samples % datagen.batch_size > 0)
for i in range(batches_per_epoch):
    batch = next(datagen)
    current_index = ((datagen.batch_index-1) * datagen.batch_size)
    if current_index < 0:
        if datagen.samples % datagen.batch_size > 0:
            current_index = max(0,datagen.samples - datagen.samples % datagen.batch_size)
        else:
            current_index = max(0,datagen.samples - datagen.batch_size)
    index_array = datagen.index_array[current_index:current_index + datagen.batch_size].tolist()
    img_paths = [datagen.filepaths[idx] for idx in index_array]
    #batch[0] - x, batch[1] - y, img_paths - absolute path

【讨论】：

【解决方案4】：

至少在2.2.4版本，你可以这样做

datagen = ImageDataGenerator()
gen = datagen.flow_from_directory(...)
for file in gen.filenames:
    print(file)

或获取文件路径

for filepath in gen.filepaths:
    print(filepath)

【讨论】：

此解决方案存在上述shuffle=True批处理中文件名和文件不匹配的问题。

【解决方案5】：

下面的代码可能会有所帮助。覆盖 flow_from_directory

    class AugmentingDataGenerator(ImageDataGenerator):
    def flow_from_directory(self, directory, mask_generator, *args, **kwargs):
        generator = super().flow_from_directory(directory, class_mode=None, *args, **kwargs)        
        seed = None if 'seed' not in kwargs else kwargs['seed']
        while True:           
            for image_path in generator.filepaths:
                # Get augmentend image samples
                image = next(generator)
                # print(image_path )

                yield image,image_path

# Create training generator
train_datagen = AugmentingDataGenerator(  
    rotation_range=10,
    width_shift_range=0.1,
    height_shift_range=0.1,
    rescale=1./255,
    horizontal_flip=True
)
train_generator = train_datagen.flow_from_directory(
    TRAIN_DIRECTORY_PATH, 
    target_size=(256, 256),
    shuffle = False,
    batch_size=BATCH_SIZE
)

# Create testing generator
test_datagen = AugmentingDataGenerator(rescale=1./255)
test_generator = test_datagen.flow_from_directory(
    TEST_DIRECTORY_PATH,  
    target_size=(256, 256),
    shuffle = False, # inorder to get imagepath of the same image
    batch_size=BATCH_SIZE 
)

并检查您返回的图像和文件路径

image,file_path = next(test_generator)
# print(file_path)
# plt.imshow(image)

【讨论】：

【解决方案6】：

我正是需要这个，我开发了一个与shuffle=True 或shuffle=False 配合使用的简单函数。

def get_indices_from_keras_generator(gen, batch_size):
    """
    Given a keras data generator, it returns the indices and the filepaths
    corresponding the current batch. 
    :param gen: keras generator.
    :param batch_size: size of the last batch generated.
    :return: tuple with indices and filenames
    """

    idx_left = (gen.batch_index - 1) * batch_size
    idx_right = idx_left + gen.batch_size if idx_left >= 0 else None
    indices = gen.index_array[idx_left:idx_right]
    filenames = [gen.filenames[i] for i in indices]
    return indices, filenames

然后，您将按如下方式使用它：

for x, y in gen:
    indices, filenames = get_indices_from_keras_generator(gen)

【讨论】：

你需要在调用它时提供一个 batch_size。类似：for x, y in gen: indices, filenames = get_indices_from_keras_generator(gen, gen.batch_size)