【问题标题】:Zombie processes while using use_multiprocessing=True in Keras model.fit()在 Keras model.fit() 中使用 use_multiprocessing=True 时的僵尸进程
【发布时间】:2022-07-04 18:00:57
【问题描述】:

我在使用 Keras 的 model.fit() 方法训练神经网络时遇到了僵尸进程。由于<defunct> 进程,训练不会结束,所有受影响的进程都必须用 SIGKILL 杀死。重新启动训练脚本不会重现相同的问题,并且有时会完成执行。禁用多处理时不会出现该问题:model.fit(use_multiprocessing=False)

这是ps aufx 命令的输出。

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
...
root      774690  0.1  0.0  79032 70048 ?        Ss   Mai23  17:16 /usr/bin/python3 /usr/bin/tm legacy-worker run mlworker
root     1607844  0.0  0.0   2420   524 ?        SNs  Jun02   0:00  \_ /bin/sh -c /usr/bin/classifier-train
root     1607845 38.5  4.7 44686436 12505168 ?   SNl  Jun02 551:05      \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root     1639337  0.0  3.7 43834076 10005208 ?   SN   Jun02   0:00          \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root     1639339  0.0  0.0      0     0 ?        ZN   Jun02   0:00          \_ [classifier-train] <defunct>
root     1639341  0.0  0.0      0     0 ?        ZN   Jun02   0:00          \_ [classifier-train] <defunct>
root     1639343  0.0  0.0      0     0 ?        ZN   Jun02   0:00          \_ [classifier-train] <defunct>
root     1639345  0.0  0.0      0     0 ?        ZN   Jun02   0:00          \_ [classifier-train] <defunct>
root     1639347  0.0  0.0      0     0 ?        ZN   Jun02   0:00          \_ [classifier-train] <defunct>
root     1639349  0.0  0.0      0     0 ?        ZN   Jun02   0:00          \_ [classifier-train] <defunct>

以下是相关代码sn-ps:

def get_keras_model():
    # some code here
    model = keras.models.Model(
        inputs=(input_layer_1, input_layer_2),
        outputs=prediction_layer,
    )
    model.compile(loss=..., optimizer=..., metrics=...)
    return model


def preprocess(data):
    # Some code here to convert strings values into numpy arrays of dtype=np.uint32
    return X, y


class DataSequence(keras.utils.Sequence):
    def __init__(self, data, preprocess_func, keys, batch_size=4096):
        self.keys = keys
        self.data = data
        self.batch_size = batch_size
        self.preprocess_func = preprocess_func

    def __len__(self):
        # returns the number of batches
        return int(np.ceil(len(self.keys) / float(self.batch_size)))

    def __getitem__(self, idx):
        keys = self.keys[idx * self.batch_size : (idx + 1) * self.batch_size]
        return self.preprocess_func([self.data[key] for key in keys]


def train(model, data, preprocess):
    train_sequence = DataSequence(data, preprocess, list(data.keys()))

    history = model.fit(
        x=train_sequence,
        epochs=15,
        steps_per_epoch=len(train_sequence),
        verbose=2,
        workers=8,
        use_multiprocessing=True,
    )

    return model, history


data = {
    "key_1": {"name": "black", "y": 0},
    "key_2": {"name": "white", "y": 1},
    # upto 70M docs in this dictionary
}
model = get_keras_model()

model, history = train(model, data, preprocess)  # model training hangs

日志输出:

显示多个Caught signal 15. Terminating. 日志消息,当训练脚本完成执行并且没有遇到任何僵尸进程时也是如此。 Exception in thread Thread-## 输出也有相同的行为;当模型训练不受僵尸进程影响并正常完成执行时也会发生这种情况。

Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22,024 - MainThread - INFO - Start working on fold 1/5
Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22.725522: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instruc>
Jun 09 14:16:22 mlworker tm[575915]: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23.439638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6882 MB memory:  -> device: 0, name: Tesla P4, p>
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23,709 - MainThread - INFO - Fitting model ...
Jun 09 14:16:24 mlworker tm[575915]: Epoch 1/15
Jun 09 14:16:31 mlworker tm[575915]: 3/3 - 7s - loss: 6.9878 - acc: 1.0908e-04 - 7s/epoch - 2s/step
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Epoch 2/15
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: 3/3 - 3s - loss: 6.9392 - acc: 0.0055 - 3s/epoch - 1s/step
...
Jun 09 14:16:48 mlworker tm[575915]: Epoch 7/15
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Exception in thread Thread-87:
Jun 09 14:16:51 mlworker tm[575915]: Traceback (most recent call last):
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
Jun 09 14:16:51 mlworker tm[575915]:     self.run()
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/threading.py", line 892, in run
Jun 09 14:16:51 mlworker tm[575915]:     self._target(*self._args, **self._kwargs)
Jun 09 14:16:51 mlworker tm[575915]:   File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 759, in _run
Jun 09 14:16:51 mlworker tm[575915]:     with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
Jun 09 14:16:51 mlworker tm[575915]:   File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 736, in pool_fn
Jun 09 14:16:51 mlworker tm[575915]:     pool = get_pool_class(True)(
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/context.py", line 119, in Pool
Jun 09 14:16:51 mlworker tm[575915]:     return Pool(processes, initializer, initargs, maxtasksperchild,
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
Jun 09 14:16:51 mlworker tm[575915]:     self._repopulate_pool()
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
Jun 09 14:16:51 mlworker tm[575915]:     return self._repopulate_pool_static(self._ctx, self.Process,
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
Jun 09 14:16:51 mlworker tm[575915]:     w.start()
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/process.py", line 121, in start
Jun 09 14:16:51 mlworker tm[575915]:     self._popen = self._Popen(self)
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
Jun 09 14:16:51 mlworker tm[575915]:     return Popen(process_obj)
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
Jun 09 14:16:51 mlworker tm[575915]:     self._launch(process_obj)
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 73, in _launch
Jun 09 14:16:51 mlworker tm[575915]:     os._exit(code)
Jun 09 14:16:51 mlworker tm[575915]:   File "/usr/lib/python3/dist-packages/solute/click.py", line 727, in raiser
Jun 09 14:16:51 mlworker tm[575915]:     raise Termination(128 + signo)
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Epoch 8/15
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: 3/3 - 3s - loss: 5.6978 - acc: 0.1000 - 3s/epoch - 1s/step
...
Jun 09 14:17:02 mlworker tm[575915]: Epoch 11/15
Jun 09 14:17:05 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:05 mlworker tm[575915]: 3/3 - 3s - loss: 5.5029 - acc: 0.0804 - 3s/epoch - 1s/step
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Epoch 12/15
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.

在最后一条消息之后没有看到进一步的日志输出。必须使用sudo kill -SIGKILL 终止进程,并且必须再次重新启动模型训练。

系统信息:

我在具有不同 GPU 和不同 Python 版本的不同机器上遇到了同样的问题。

  • 操作系统平台和发行版:Debian GNU/Linux 11 (bullseye)、Ubuntu 20.04.4 LTS
  • TensorFlow 版本:v2.9.0-18-gd8ce9f9c301 2.9.1 (Debian 11)、v2.9.0-18-gd8ce9f9c301 2.9.1 (Ubuntu LTS)
  • Python 版本:Python 3.9.2 (Debian 11)、Python 3.8.10 (Ubuntu LTS)
  • GPU 型号和内存:Debian 11 上的 Tesla T4 (16 GB),另一台 Debian 11 机器上的 Tesla P4 (8 GB),Ubuntu LTS 上的 GeForce GTX 1080 Ti (12 GB)

【问题讨论】:

    标签: tensorflow keras python-multiprocessing python-multithreading


    【解决方案1】:

    我们在脚本开头使用以下行解决了问题:

    signal.signal(signal.SIGTERM, signal.SIG_DFL)
    

    说明: 我们的脚本中有一个自定义的 SIGTERM 处理程序,它干扰了发送到线程的 SIGTERM。这 1 行代码恢复了 Python 对 SIGTERM 的默认处理程序,并避免遇到无响应的子进程。

    Tensorflow 或 Keras 代码中没有错误 :)

    【讨论】:

      猜你喜欢
      • 2014-09-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-02-26
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多