【问题标题】:Weird Nan loss for custom Keras loss自定义 Keras 损失的奇怪 Nan 损失
【发布时间】:2018-06-25 11:19:38
【问题描述】:

我正在尝试在 Keras 中实现自定义损失,但无法正常工作。

我已经在 numpy 和 keras.backend 中实现了它:

def log_rmse_np(y_true, y_pred):
    d_i = np.log(y_pred) -  np.log(y_true)
    loss1 = (np.sum(np.square(d_i))/np.size(d_i))
    loss2 = ((np.square(np.sum(d_i)))/(2 * np.square(np.size(d_i))))
    loss = loss1 - loss2
    print('np_loss =  %s - %s = %s'%(loss1, loss2, loss))
    return loss

def log_rmse(y_true, y_pred):
    d_i = (K.log(y_pred) -  K.log(y_true))
    loss1 = K.mean(K.square(d_i))
    loss2 = K.square(K.sum(K.flatten(d_i),axis=-1))/(K.cast_to_floatx(2) * K.square(K.cast_to_floatx(K.int_shape(K.flatten(d_i))[0])))
    loss = loss1 - loss2
    return loss

当我用以下函数测试和比较损失时,一切似乎都正常。

def check_loss(_shape):
    if _shape == '2d':
        shape = (6, 7)
    elif _shape == '3d':
        shape = (5, 6, 7)
    elif _shape == '4d':
        shape = (8, 5, 6, 7)
    elif _shape == '5d':
        shape = (9, 8, 5, 6, 7)

    y_a = np.random.random(shape)
    y_b = np.random.random(shape)

    out1 = K.eval(log_rmse(K.variable(y_a), K.variable(y_b)))
    out2 = log_rmse_np(y_a, y_b)

    print('shapes:', str(out1.shape), str(out2.shape))
    print('types: ', type(out1), type(out2))
    print('log_rmse:    ', np.linalg.norm(out1))
    print('log_rmse_np: ', np.linalg.norm(out2))
    print('difference:  ', np.linalg.norm(out1-out2))
    assert out1.shape == out2.shape
    #assert out1.shape == shape[-1]

def test_loss():
    shape_list = ['2d', '3d', '4d', '5d']
    for _shape in shape_list:
        check_loss(_shape)
        print ('======================')

test_loss()

以上代码打印:

np_loss =  1.34490449177 - 0.000229461787517 = 1.34467502998
shapes: () ()
types:  <class 'numpy.float32'> <class 'numpy.float64'>
log_rmse:     1.34468
log_rmse_np:  1.34467502998
difference:   3.41081509703e-08
======================
np_loss =  1.68258448859 - 7.67580654591e-05 = 1.68250773052
shapes: () ()
types:  <class 'numpy.float32'> <class 'numpy.float64'>
log_rmse:     1.68251
log_rmse_np:  1.68250773052
difference:   1.42057615005e-07
======================
np_loss =  1.99736933814 - 0.00386228512295 = 1.99350705302
shapes: () ()
types:  <class 'numpy.float32'> <class 'numpy.float64'>
log_rmse:     1.99351
log_rmse_np:  1.99350705302
difference:   2.53924863358e-08
======================
np_loss =  1.95178217182 - 1.60006871892e-05 = 1.95176617114
shapes: () ()
types:  <class 'numpy.float32'> <class 'numpy.float64'>
log_rmse:     1.95177
log_rmse_np:  1.95176617114
difference:   3.78277884572e-08
======================

当我用这种损失编译和拟合我的模型时,当我用“adam”损失运行模型时,我从来没有遇到过异常,一切正常。 然而,随着这种损失,keras 不断显示出 nan-loss:

Epoch 1/10000
 17/256 [>.............................] - ETA: 124s - loss: nan

有点卡在这里......我做错了吗?

在 Ubuntu 16.04 上使用 TensorFlow 1.4

更新:

根据 Marcin Możejko 的建议,我更新了代码,但不幸的是训练损失仍然是 Nan:

def get_log_rmse(normalization_constant):
    def log_rmse(y_true, y_pred):
        d_i = (K.log(y_pred) -  K.log(y_true))
        loss1 = K.mean(K.square(d_i))
        loss2 = K.square(K.sum(K.flatten(d_i),axis=-1))/K.cast_to_floatx(2 * normalization_constant ** 2)
        loss = loss1 - loss2
        return loss
    return log_rmse

然后通过以下方式编译模型:

model.compile(optimizer='adam', loss=get_log_rmse(batch_size))

更新 2:

模型摘要如下所示:

Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         (None, 160, 256, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 160, 256, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 160, 256, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 80, 128, 64)       0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 80, 128, 128)      73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 80, 128, 128)      147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 40, 64, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 40, 64, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 40, 64, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 40, 64, 256)       590080    
_________________________________________________________________
block3_conv4 (Conv2D)        (None, 40, 64, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 20, 32, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 20, 32, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 20, 32, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 20, 32, 512)       2359808   
_________________________________________________________________
block4_conv4 (Conv2D)        (None, 20, 32, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 10, 16, 512)       0         
_________________________________________________________________
conv2d_transpose_5 (Conv2DTr (None, 10, 16, 128)       1048704   
_________________________________________________________________
up_sampling2d_5 (UpSampling2 (None, 20, 32, 128)       0         
_________________________________________________________________
conv2d_transpose_6 (Conv2DTr (None, 20, 32, 64)        131136    
_________________________________________________________________
up_sampling2d_6 (UpSampling2 (None, 40, 64, 64)        0         
_________________________________________________________________
conv2d_transpose_7 (Conv2DTr (None, 40, 64, 32)        32800     
_________________________________________________________________
up_sampling2d_7 (UpSampling2 (None, 80, 128, 32)       0         
_________________________________________________________________
conv2d_transpose_8 (Conv2DTr (None, 80, 128, 16)       8208      
_________________________________________________________________
up_sampling2d_8 (UpSampling2 (None, 160, 256, 16)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 160, 256, 1)       401       
=================================================================
Total params: 11,806,401
Trainable params: 11,806,401
Non-trainable params: 0

更新 3:

样本 y_true:

【问题讨论】:

  • 这可能是由于 log 函数,如果 y_pre 或 y_true 为 0,您正在尝试计算 log(0),即 -inf,如果您尝试 np.log(0) - np.log (0) 你得到 nan
  • 好点,但我认为这不是问题的根源,因为数据介于 0 和 1 之间,并且在 y_true 和 y_pred 都加 1 后,nan-loss 仍然存在。跨度>

标签: python python-3.x tensorflow machine-learning keras


【解决方案1】:

问题出在这部分:

K.cast_to_floatx(K.int_shape(K.flatten(d_i))[0]

在提供任何形状之前编译损失函数 - 此表达式的计算结果为 None,这是您的错误的来源。我尝试设置batch_input_shape 而不是input_shape 但这也不起作用(可能是由于keras 编译模型的方式)。我建议通过以下方式将此数字设置为常量:

def get_log_rmse(normalization_constant):
    def log_rmse(y_true, y_pred):
        d_i = (K.log(y_pred) -  K.log(y_true))
        loss1 = K.mean(K.square(d_i))
        loss2 = K.square(
            K.sum(
                K.flatten(d_i),axis=-1))/(K.cast_to_floatx(
                    2 * normalization_constant ** 2) 
        loss = loss1 - loss2
        return loss
    return log_rmse

然后编译:

model.compile(..., loss=get_log_rmse(normalization_constant))

我猜normalization_constant 等于batch_size,但我不确定,所以我将其设为通用。

更新:

根据 Marcin Możejko 的建议,我更新了代码,但不幸的是训练损失仍然是 Nan:

def get_log_rmse(normalization_constant):
    def log_rmse(y_true, y_pred):
        d_i = (K.log(y_pred) -  K.log(y_true))
        loss1 = K.mean(K.square(d_i))
        loss2 = K.square(K.sum(K.flatten(d_i),axis=-1))/K.cast_to_floatx(2 * normalization_constant ** 2)
        loss = loss1 - loss2
        return loss
    return log_rmse

然后通过以下方式编译模型:

model.compile(optimizer='adam', loss=get_log_rmse(batch_size))

更新 2:

模型定义如下所示:

input_shape = (160, 256, 3)
print('Input_shape: %s'%str(input_shape))
base_model = keras.applications.vgg19.VGG19(include_top=False, weights='imagenet', 
                               input_tensor=None, input_shape=input_shape, 
                               pooling=None, # None, 'avg', 'max'
                               classes=1000)
for i in range(5):
    base_model.layers.pop()
base_model = Model(inputs=base_model.input, outputs=base_model.get_layer('block4_pool').output)
print('VGG19 output_shape: ' + str(base_model.output_shape))

x = Deconv(128, kernel_size=(4, 4), strides=1, padding='same', activation='relu')(base_model.output)
x = UpSampling2D((2, 2))(x)
x = Deconv(64, kernel_size=(4, 4), strides=1, padding='same', activation='relu')(x)
x = UpSampling2D((2, 2))(x)
x = Deconv(32, kernel_size=(4, 4), strides=1, padding='same', activation='relu')(x)
x = UpSampling2D((2, 2))(x)
x = Deconv(16, kernel_size=(4, 4), strides=1, padding='same', activation='relu')(x)
x = UpSampling2D((2, 2))(x)
x = Conv2D(1, kernel_size=(5, 5), strides=1, padding='same')(x)
model = Model(inputs=base_model.input, outputs=x)

【讨论】:

  • 谢谢,这听起来很合理!不幸的是,在实施您的建议后,我仍然会遭受 nan-loss。更新了问题以显示您建议的代码。
  • 你确定你的y_truey_pred不等于0吗?就我而言 - 它适用于随机数据。
  • 如果我用out1 = K.eval(get_log_rmse(batch_size)(K.variable(y_a), K.variable(y_b))) 替换直接方法调用,它也适用于测试方法中随机生成的数据。 keras fit() 函数仍然向我显示 nan 值... :( 此外,如果我将 1(或 1000)添加到输入(y_true,y_pred)我得到 nan。
  • 这很奇怪,因为输入是一个缩放到 0-1 的灰度图像。
  • 您能提供一个模型定义吗?
【解决方案2】:

尝试将您的模型拟合到内置损失上几个 epoch。然后使用您自己的损失再次编译您的模型。这可能会有所帮助。

【讨论】:

    【解决方案3】:

    当我遇到与root mean square percentage error = k.sqrt(K.mean(K.square( (y_true - y_pred) / y_true ))) 相同的错误时

    解决方案:
    我删除了分母并跑了几个时代。然后停下来用原始方程跑。它开始给出有限的损失值。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-07-26
      • 1970-01-01
      • 2021-09-06
      • 2019-03-10
      • 2020-12-19
      相关资源
      最近更新 更多