GluonCV - 对象检测，将 mx.ctx 设置为 GPU，但仍使用所有 CPU 内核答案

【问题标题】：GluonCV - Object detection, set mx.ctx to GPU, but still using all CPU coresGluonCV - 对象检测，将 mx.ctx 设置为 GPU，但仍使用所有 CPU 内核
【发布时间】：2020-04-09 16:14:46
【问题描述】：

我正在服务器上运行对象检测例程。
我将上下文设置为 GPU，并在 GPU 上加载模型、参数和数据。该程序正在使用 OpenCV 从视频文件或 rtsp 流中读取。

使用 nvidia-smi 时，我看到选择的 GPU 使用率为 20%，这是合理的。但是，对象检测例程仍然使用 750-1200 % 的 CPU（基本上是服务器的所有可用内核）。

这是代码：

def main():

    ctx = mx.gpu(3)

    # -------------------------
    # Load a pretrained model
    # -------------------------
    net = gcv.model_zoo.get_model('ssd_512_mobilenet1.0_coco', pretrained=True)

    # Load the webcam handler
    cap = cv2.VideoCapture("video/video_01.mp4")

    count_frame = 0
    while(True):
        print(f"Frame: {count_frame}")

        # Load frame from the camera
        ret, frame = cap.read()


        if (cv2.waitKey(25) & 0xFF == ord('q')) or (ret == False):
            cv2.destroyAllWindows()
            cap.release()
            print("Done!!!")
            break

        # Image pre-processing
        frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
        frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)
        if isinstance(frame_nd, mx.ndarray.ndarray.NDArray):
            frame_nd.wait_to_read()

        # Run frame through network
        frame_nd = frame_nd.as_in_context(ctx)
        class_IDs, scores, bounding_boxes = net(frame_nd)
        if isinstance(class_IDs, mx.ndarray.ndarray.NDArray):
            class_IDs.wait_to_read()
        if isinstance(scores, mx.ndarray.ndarray.NDArray):
            scores.wait_to_read()
        if isinstance(bounding_boxes, mx.ndarray.ndarray.NDArray):
            bounding_boxes.wait_to_read()


        count_frame += 1



    cv2.destroyAllWindows()
    cap.release()

这是 nvidia-smi 的输出：

虽然这是 top 的输出：

预处理操作在 CPU 上运行：

frame = mx.nd.array(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)).astype('uint8')
frame_nd, frame_np = gcv.data.transforms.presets.ssd.transform_test(frame, short=512, max_size=700)

但是足以证明如此高的 CPU 使用率是合理的吗？万一，我也可以在 GPU 上运行它们吗？

编辑：我修改并复制了整个代码，以回应 Olivier_Cruchant 的评论（谢谢！）

【问题讨论】：

这可能是由于 CPU 上发生的工作，例如解压缩和预处理。如何获得“框架”对象？
感谢您的回复，为了更好地说明情况，我修改了原帖

标签： mxnet

【解决方案1】：

您的 CPU 可能很忙，因为预处理负载以及从内存到 GPU 的频繁来回，因为推理似乎是逐帧运行我建议尝试以下方法：

运行批量推理（向网络发送一批 N 帧）以增加 GPU 使用率并减少通信
尝试使用NVIDIA DALI 更好地使用 GPU 进行数据摄取和预处理（DALI MXNet reference、DALI mp4 ingestion pytorch example）

【讨论】：