【发布时间】:2021-07-20 14:55:54
【问题描述】:
我正在使用 tensorflow 对象检测 API,但在模型训练方面遇到了一些问题。尤其是CPU和Ram使用率很高,而GPU基本不用(根据Windows任务管理器):
我已经按照this guide安装了TF对象检测API,验证GPU识别成功:
python -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2021-07-20 15:36:37.630320: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-07-20 15:36:49.683811: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library nvcuda.dll
2021-07-20 15:36:49.990907: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3080 Laptop GPU computeCapability: 8.6
coreClock: 1.605GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-07-20 15:36:50.017685: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudart64_110.dll
2021-07-20 15:36:50.142257: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublas64_11.dll
2021-07-20 15:36:50.158525: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cublasLt64_11.dll
2021-07-20 15:36:50.173970: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cufft64_10.dll
2021-07-20 15:36:50.183516: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library curand64_10.dll
2021-07-20 15:36:50.196516: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusolver64_11.dll
2021-07-20 15:36:50.213625: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cusparse64_11.dll
2021-07-20 15:36:50.231417: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library cudnn64_8.dll
2021-07-20 15:36:50.234253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-20 15:36:50.238133: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-07-20 15:36:50.245602: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: NVIDIA GeForce RTX 3080 Laptop GPU computeCapability: 8.6
coreClock: 1.605GHz coreCount: 48 deviceMemorySize: 8.00GiB deviceMemoryBandwidth: 417.29GiB/s
2021-07-20 15:36:50.265550: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2021-07-20 15:36:54.162700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-07-20 15:36:54.168506: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264] 0
2021-07-20 15:36:54.169910: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0: N
2021-07-20 15:36:54.176538: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5177 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3080 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
tf.Tensor(1222.1837, shape=(), dtype=float32)
编辑:我只在 centernet_hg104_512x512_coco17_tpu-8(我使用下面显示的 pipeline.config 文件)遇到这个问题,而其他模型(ssd_resnet 或efficientdet)实际上使用 gpu。
model {
center_net {
num_classes: 1
feature_extractor {
type: "hourglass_104"
channel_means: 104.01361846923828
channel_means: 114.03422546386719
channel_means: 119.91659545898438
channel_stds: 73.60276794433594
channel_stds: 69.89082336425781
channel_stds: 70.91507720947266
bgr_ordering: true
}
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 512
max_dimension: 512
pad_to_max_dimension: true
}
}
object_detection_task {
task_loss_weight: 1.0
offset_loss_weight: 1.0
scale_loss_weight: 0.10000000149011612
localization_loss {
l1_localization_loss {
}
}
}
object_center_params {
object_center_loss_weight: 1.0
classification_loss {
penalty_reduced_logistic_focal_loss {
alpha: 2.0
beta: 4.0
}
}
min_box_overlap_iou: 0.6
max_box_predictions: 50
}
}
}
train_config {
batch_size: 2
data_augmentation_options {
random_horizontal_flip {
}
}
data_augmentation_options {
random_crop_image {
min_aspect_ratio: 0.5
max_aspect_ratio: 1.7000000476837158
random_coef: 0.25
}
}
data_augmentation_options {
random_adjust_hue {
}
}
data_augmentation_options {
random_adjust_contrast {
}
}
data_augmentation_options {
random_adjust_saturation {
}
}
data_augmentation_options {
random_adjust_brightness {
}
}
data_augmentation_options {
random_absolute_pad_image {
max_height_padding: 200
max_width_padding: 200
pad_color: 0.0
pad_color: 0.0
pad_color: 0.0
}
}
optimizer {
adam_optimizer {
learning_rate {
manual_step_learning_rate {
initial_learning_rate: 0.0010000000474974513
schedule {
step: 1000
learning_rate: 9.999999747378752e-05
}
schedule {
step: 5000
learning_rate: 9.999999747378752e-06
}
}
}
epsilon: 1.0000000116860974e-07
}
use_moving_average: false
}
fine_tune_checkpoint: "pre-trained-models/centernet_hg104_512x512_coco17_tpu-8/checkpoint/ckpt-0"
num_steps: 5000
max_number_of_boxes: 50
unpad_groundtruth_tensors: false
fine_tune_checkpoint_type: "detection"
fine_tune_checkpoint_version: V2
}
train_input_reader {
label_map_path: "annotations/label_map.pbtxt"
tf_record_input_reader {
input_path: "annotations/train.record"
}
}
eval_config {
metrics_set: "coco_detection_metrics"
use_moving_averages: false
batch_size: 1
}
eval_input_reader {
label_map_path: "annotations/label_map.pbtxt"
shuffle: false
num_epochs: 1
tf_record_input_reader {
input_path: "annotations/test.record"
}
}
需要 2 的小批量大小,因为使用任何更大的数字会使用于训练模型的脚本获得一点堆栈然后退出(这对我来说也很令人惊讶,但我刚刚开始玩这些东西所以也许这实际上是正常的)。
我正在使用:
Windows 10
CPU:i9-10980HK
内存:32GB
GPU:GTX3080 8GB 专用内存
张量流 = 2.5
CUDA = 11.3.1
cuDNN = 8.2.1.32
这是预期的低 GPU/高 CPU 使用率吗?我在这里错过了什么吗? 感谢您的帮助,如果我可以提供任何其他有用的信息,请告诉我。
【问题讨论】:
标签: python tensorflow object-detection-api