Tensorflow model.fit 没有启动答案

【问题标题】：Tensorflow model.fit doesn't startTensorflow model.fit 没有启动
【发布时间】：2021-04-28 07:21:20
【问题描述】：

我们有以下tensorflow模型拟合代码。

data, labels, data_test, labels_test = get_data_and_labels()
model = tf.keras.models.Sequential(
    [
        tf.keras.layers.InputLayer(input_shape=(data.shape[1],)),
        tf.keras.layers.Dense(64),
        tf.keras.layers.Activation(tf.keras.activations.relu, name=f"relu1"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(64),
        tf.keras.layers.Activation(tf.keras.activations.relu, name=f"relu2"),
        tf.keras.layers.Dropout(0.2),
        tf.keras.layers.Dense(1),
        tf.keras.layers.Activation(tf.keras.activations.tanh, name=f"tanh"),
    ]
)

optimizer = tf.optimizers.Adam(learning_rate=1e-5)
model.compile(
    optimizer,
    tf.keras.losses.mean_squared_error,
    metrics=['cosine_similarity', 'logcosh']
)

cp_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath="model_checkpoints/cp-{epoch:04d}.ckpt", save_weights_only=True, verbose=1
)

model.fit(
    data,
    labels,
    # batch_size=64,
    epochs=200,
    callbacks=[cp_callback],
    validation_data=(data_test, labels_test),
)

我试图在没有GPU（我们目前没有）的情况下安装它，但是当我运行代码时，我唯一得到的是：

2021-04-28 10:06:58.123786: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-04-28 10:06:58.126582: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

它会像这样卡住几个小时，没有其他错误。我们有大约 140 万条记录，并且在 Windows 10 上运行。

卡住只是因为缺少GPU吗？还是我们应该做点别的？任何帮助都会非常有价值。

【问题讨论】：

你在colab中测试过吗？
您是否在一小部分数据样本上测试过代码？
@jhmt，是的，我尝试了 2K 样本，它运行正常
@M.Innat，我还没有使用任何云平台。我们希望避免这种情况。
在您的系统上使用 mnist 或 cifar10 等示例数据集测试您的设置，看看它提供了什么。

标签： python tensorflow keras neural-network

【解决方案1】：

因此，问题不在于实际训练，而在于训练测试记录的拆分。

这是我们最初的做法：

def split_train_test(data: pd.DataFrame, label: pd.DataFrame, test_part: float = 0.3):
    size = data.shape[0]
    test_size = int(size * test_part)

    all_indices = list(range(size))
    test_indices = list(np.random.choice(all_indices, size=test_size))
    test_indices.sort()
    train_indices = [i for i in all_indices if i not in test_indices]

    return (
        data.iloc[train_indices, :].to_numpy(),
        label.iloc[train_indices].to_numpy(),
        data.iloc[test_indices, :].to_numpy(),
        label.iloc[test_indices].to_numpy(),
    )

这就是我们现在的做法：

def split_train_test(df: pd.DataFrame, test_part: float = 0.3):
    df = df.sample(frac=1)
    size = df.shape[0]
    test_size = int(size * test_part)
    data, label = df.drop(PREDICTION_FIELD, axis=1), df[PREDICTION_FIELD]

    return (
        data.iloc[:-test_size, :].to_numpy(),
        label.iloc[:-test_size].to_numpy(),
        data.iloc[-test_size:, :].to_numpy(),
        label.iloc[-test_size:].to_numpy(),
    )

这解决了性能问题。

【讨论】：