【问题标题】:Google Colaboratory session abruptly ends when filling up shuffle buffer填充随机缓冲区时,Google Colaboratory 会话突然结束
【发布时间】:2021-02-01 17:58:02
【问题描述】:

我正在使用 Google Colaboratory 使用 TensorFlow 1.15 训练图像识别算法。我已将所有需要的文件上传到 Google Drive,并让代码运行,直到 shuffle 缓冲区完成运行。但是,我在对话框中得到一个“^C”,但无法弄清楚发生了什么。

注意:我之前曾尝试在我的 PC 上训练算法,并没有删除上一次训练生成的检查点文件。这可能是问题所在吗?

代码:

!pip install --upgrade pip
!pip install --upgrade protobuf

!pip install tensorflow-gpu==1.15
import tensorflow as tf
print(tf.__version__)

device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at {}'.format(device_name))

!ln -sf /opt/bin/nvidia-smi /usr/bin/nvidia-smi
!pip install gputil
!pip install psutil
!pip install humanize
import psutil
import humanize
import os
import GPUtil as GPU
GPUs = GPU.getGPUs()
gpu = GPUs[0]
def printm():
  process = psutil.Process(os.getpid())
  print("Gen RAM Free: " + humanize.naturalsize(psutil.virtual_memory().available ), " | Proc size: " + humanize.naturalsize( process.memory_info().rss))
  print("GPU RAM Free: {0:.0f}MB | Used: {1:.0f}MB | Util {2:3.0f}% | Total {3:.0f}MB".format(gpu.memoryFree, gpu.memoryUsed, gpu.memoryUtil*100, gpu.memoryTotal))
printm()

from google.colab import drive
#Mount the drive
drive.mount('/content/gdrive')

#Change to working tensorflow directory on the drive
%cd '/content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/'

!apt-get install protobuf-compiler python-pil python-lxml python-tk
!pip install Cython
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/
!protoc object_detection/protos/*.proto --python_out=.
import os
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'
!python setup.py build
!python setup.py install

import time, psutil
Start = time.time() - psutil.boot_time()
Left = 12*3600 - Start
print('Time remaining for this session is: ', Left/3600)

!pip install tf_slim
%cd /content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/
os.environ['PYTHONPATH'] += ':/content/gdrive/My Drive/weeds/tensorflow_models/models/research/:/content/gdrive/My Drive/weeds/tensorflow_models/models/research/slim'

!python train.py --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_coco.config --logtostderr

流程到此结束,但需要开始使用“全局步骤”训练模型。

2020-10-18 22:42:45.587477: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 168 of 2048
2020-10-18 22:42:55.668973: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 334 of 2048
2020-10-18 22:43:06.067869: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 379 of 2048
2020-10-18 22:43:15.705090: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 503 of 2048
2020-10-18 22:43:26.781151: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 576 of 2048
2020-10-18 22:43:38.120069: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 640 of 2048
2020-10-18 22:43:45.813089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 708 of 2048
2020-10-18 22:43:58.071040: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 752 of 2048
2020-10-18 22:44:07.506961: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 828 of 2048
2020-10-18 22:44:16.355753: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 908 of 2048
2020-10-18 22:44:25.922348: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 960 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:44:34.783342 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:44:36.327813: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1036 of 2048
2020-10-18 22:44:45.651473: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1151 of 2048
2020-10-18 22:44:55.554234: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1186 of 2048
2020-10-18 22:45:05.648568: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1242 of 2048
2020-10-18 22:45:15.644396: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1313 of 2048
2020-10-18 22:45:25.551708: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1386 of 2048
2020-10-18 22:45:35.549003: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1458 of 2048
2020-10-18 22:45:45.648835: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1531 of 2048
2020-10-18 22:45:55.643920: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1602 of 2048
2020-10-18 22:46:05.559702: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1674 of 2048
2020-10-18 22:46:15.547609: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1746 of 2048
2020-10-18 22:46:25.645939: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1819 of 2048
INFO:tensorflow:global_step/sec: 0
I1018 22:46:35.052108 140291121678080 supervisor.py:1099] global_step/sec: 0
2020-10-18 22:46:35.645583: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1891 of 2048
2020-10-18 22:46:45.553851: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:145] Filling up shuffle buffer (this may take a while): 1962 of 2048
^C

我能做些什么来解决这个问题?训练过程在我的 PC (NVIDA GEFORCE RTX) 上运行良好,但我只需要通过 Google Colab 获得更多计算能力。

【问题讨论】:

  • 您是否尝试过从train.py 减小随机缓冲区大小?没有看到任何代码很难提供帮助
  • 我无法运行你的代码,因为我没有这个文件'/content/gdrive/My Drive/weeds/tensorflow_models/models/research/object_detection/'
  • 消息Filling up shuffle buffer (this may take a while)不是错误,它只是一条日志消息
  • 也许你可以使用一个小的缓冲区来训练你的数据集

标签: python tensorflow google-colaboratory


【解决方案1】:

我无法运行您的代码,因为您在其中使用了一些文件。但我可以告诉你,这可能是因为你使用的是 TF 1 和 GPU,而在 Colab 中,在 GPU 方面降级并不容易。

例如,我没有在您的代码中看到您已像这样将 CUDA 降级(到您想要的版本):

!wget https://developer.nvidia.com/compute/cuda/9.0/Prod/local_installers/cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!dpkg -i cuda-repo-ubuntu1604-9-0-local_9.0.176-1_amd64-deb
!apt-key add /var/cuda-repo-9-0-local/7fa2af80.pub
!apt-get update
!apt-get install cuda=9.0.176-1

您可以通过!nvcc --version查看CUDA的版本。

并且 Colab 在降级 TensorFlow 版本方面并不快。您可能需要多次重新启动运行时。

我建议您将代码更改为 TensorFlow 2

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-07-31
    • 1970-01-01
    • 2012-02-04
    • 2011-02-20
    • 1970-01-01
    • 2017-01-05
    • 2013-02-21
    相关资源
    最近更新 更多