为什么我使用 Tensorflow 和 Keras GPU 的模型出现 OOM 错误？答案

【问题标题】：Why does my model which uses Tenserflow and Keras GPU OOM error?为什么我使用 Tensorflow 和 Keras GPU 的模型出现 OOM 错误？
【发布时间】：2021-06-02 22:24:25
【问题描述】：

我正在尝试运行我的模型，但运行时出现错误

2021-06-03 01:20:42.015864: W tensorflow/core/common_runtime/bfc_allocator.cc:467] **************************************************************************__________________________
2021-06-03 01:20:42.015984: W tensorflow/core/framework/op_kernel.cc:1767] OP_REQUIRES failed at concat_op.cc:158 : Resource exhausted: OOM when allocating tensor with shape[8938,46080] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[8938,46080] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:ConcatV2] name: concat

我的代码：

import numpy as np
import tensorflow as tf
from cv2 import cv2
from keras.applications.densenet import preprocess_input
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation
from tensorflow.keras.optimizers import Adam, SGD, RMSprop
from tensorflow.keras.metrics import categorical_crossentropy
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import MaxPool2D, MaxPool3D, GlobalAveragePooling2D, Reshape, GlobalMaxPooling2D, MaxPooling2D, Flatten, AveragePooling2D

# physical_devices = tf.config.experimental.list_physical_devices('GPU')
# print("Num GPU Available", len(physical_devices))
# tf.config.experimental.set_memory_growth(physical_devices[0], True)

train_path = 'data/train'
test_path = 'data/test'
batch_size = 16
image_size = (360, 360)

train_batches = ImageDataGenerator(
    preprocessing_function=preprocess_input,
    # rescale=1./255,
    horizontal_flip=True,
    rotation_range=.3,
    width_shift_range=.2,
    height_shift_range=.2,
    zoom_range=.2
).flow_from_directory(directory=train_path,
                      target_size=image_size,
                      color_mode='rgb',
                      batch_size=batch_size,
                      shuffle=True)

test_batches = ImageDataGenerator(
    preprocessing_function=preprocess_input
    # rescale=1./255
).flow_from_directory(directory=test_path,
                      target_size=image_size,
                      color_mode='rgb',
                      batch_size=batch_size,
                      shuffle=True)

# mobile = tf.keras.applications.mobilenet.MobileNet()
mobile = tf.keras.applications.mobilenet_v2.MobileNetV2(include_top=False, weights='imagenet', input_shape=(360, 360, 3))

x = MaxPool2D()(mobile.layers[-1].output)
x = Flatten()(x)
model = Model(inputs=mobile.input, outputs=x)

train_features = model.predict(train_batches, train_batches.labels)
test_features = model.predict(test_batches, test_batches.labels)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

train_scaled = scaler.fit_transform(train_features)
test_scaled = scaler.fit_transform(test_features)

from sklearn.svm import SVC
svm = SVC()

svm.fit(train_scaled, train_batches.labels)

print('train accuracy:')
print(svm.score(train_scaled, train_batches.labels))
print('test accuracy:')
print(svm.score(test_scaled, test_batches.labels))

【问题讨论】：

此链接github.com/tensorflow/tensorflow/issues/16768 可能会有所帮助

标签： python tensorflow machine-learning keras deep-learning

【解决方案1】：

如果减小 batch_size 值不能解决问题。您可以启用 cuda 统一内存。 here

【讨论】：

我尝试通过取消这些行 # physical_devices = tf.config.experimental.list_physical_devices('GPU') # print("Num GPU Available", len(physical_devices)) # tf.config.experimental.set_memory_growth(physical_devices[0], True) 来启用 cuda unflied 内存，但这并没有解决问题，因为它一直在崩溃在 cpu 上运行也不起作用。
将batch_size设置为8，也不行。
@bSwizzle 你能试试与你的预训练模型匹配的输入形状吗？我还没有尝试过。请看一看。要指定的输入形状应该是 (224, 224, 3) 而不是 input_shape=(360, 360, 3)。 tensorflow.org/api_docs/python/tf/keras/applications/…

【解决方案2】：

此错误是内存不足。尝试降低batch_size的值。

【讨论】：

我已经尝试过了，它似乎不想工作。我真的不明白这个问题，但一位朋友试图在他们的笔记本电脑上运行该模型并且它可以工作，即使他们的规格比我少。我只是希望我的 GPU 没有故障之类的。