TFLite 的 CoreMLDelegate 能否在 iOS 中同时使用 GPU 和 CPU？答案

【问题标题】：Can TFLite's CoreMLDelegate use GPU and CPU simultaneously in iOS?TFLite 的 CoreMLDelegate 能否在 iOS 中同时使用 GPU 和 CPU？
【发布时间】：2020-12-09 16:04:51
【问题描述】：

我已经在我的应用程序中成功使用了 tflite 的 MetalDelegate。当我切换到 CoreMLDelegate 时，它完全在 CPU 上运行我的 (float) tflite 模型 (MobileNet)，显示 GPU 使用率为 0。我在兼容设备的 iPhone 11MaxPro 上运行它。在初始化期间，我注意到以下行： “CoreML 委托：31 个节点中的 29 个节点被委托，有 2 个分区”。任何想法为什么？如何让 CoreMLDelegate 在 iOS 上同时使用 GPU 和 CPU？我从here 下载了mobilenet_v1_1.0_224.tflite 模型文件。

import AVFoundation
import UIKit
import SpriteKit
import Metal

var device: MTLDevice!
var commandQueue: MTLCommandQueue!

private var total_latency:Double = 0
private var total_count:Double = 0
private var sstart = TimeInterval(NSDate().timeIntervalSince1970)

class ViewController: UIViewController {
...
}

// MARK: CameraFeedManagerDelegate Methods
extension ViewController: CameraFeedManagerDelegate {

  func didOutput(pixelBuffer: CVPixelBuffer) {
    let currentTimeMs = Date().timeIntervalSince1970 * 1
    guard (currentTimeMs - previousInferenceTimeMs) >= delayBetweenInferencesMs else { return }
    previousInferenceTimeMs = currentTimeMs



    //  1. First create the Metal device and command queue in viewDidLoad():
    device = MTLCreateSystemDefaultDevice()
    commandQueue = device.makeCommandQueue()

    var timestamp = NSDate().timeIntervalSince1970
    let start = TimeInterval(timestamp)

    // 2. Access the shared MTLCaptureManager and start capturing
    let capManager = MTLCaptureManager.shared()
    let myCaptureScope = capManager.makeCaptureScope(device: device)
    myCaptureScope.begin()
    let commandBuffer = commandQueue.makeCommandBuffer()!
    // Do Metal work


    // Pass the pixel buffer to TensorFlow Lite to perform inference.
    result = modelDataHandler?.runModel(onFrame: pixelBuffer)


    // 3.
    // encode your kernel
    commandBuffer.commit()
    myCaptureScope.end()

    timestamp = NSDate().timeIntervalSince1970
    let end = TimeInterval(timestamp)
    //var end = NSDate(timeIntervalSince1970: TimeInterval(myTimeInterval))

    total_latency += (end - start)
    total_count += 1;
    let rfps = total_count/(end - sstart)
    let fps = total_count/(end - start)
    let stri = "Time: " + String(end - start) + " avg: " + String(total_latency/total_count)+" count: " + String(total_count)+" rfps: "+String(rfps)+" fps: "+String(fps)
    print(stri)


    // Display results by handing off to the InferenceViewController.
    DispatchQueue.main.async {
      guard let finalInferences = self.result?.inferences else {
        self.resultLabel.text = ""
        return
      }
     let resultStrings = finalInferences.map({ (inference) in
        return String(format: "%@ %.2f",inference.label, inference.confidence)
      })
      self.resultLabel.text = resultStrings.joined(separator: "\n")
    }

  }

2020-08-22 07:09:39.783215-0400 ImageClassification[3039:645963] coreml_version 必须为 2 或 3。设置为 3。 2020-08-22 07:09:39.785103-0400 ImageClassification[3039:645963] 为 Metal 创建了 TensorFlow Lite 委托。 2020-08-22 07:09:39.785505-0400 ImageClassification[3039:645963] 启用金属 GPU 帧捕获 2020-08-22 07:09:39.786110-0400 ImageClassification[3039:645963] 启用金属 API 验证 2020-08-22 07:09:39.927854-0400 ImageClassification[3039:645963] 初始化 TensorFlow Lite 运行时。 2020-08-22 07:09:39.928928-0400 ImageClassification[3039:645963] CoreML 委托：在 31 个节点中委托 29 个节点，有 2 个分区

【问题讨论】：

如果您使用的是 11MaxPro，它具有神经引擎，CoreML 委托默认使用该引擎。在这种情况下，预计您的 GPU 不会被使用（即神经引擎是一个单独的加速器，与 GPU 不同）。您是否尝试过比较使用和不使用 CoreML 委托的延迟基准？
感谢您的澄清，@yyoon。是的，我都试过了。 MetalDelegate 给了我 120 FPS，CPU 1.5 毫秒，GPU 0.4 毫秒每帧，能量影响 - 非常高。使用 CoreMLDelegate，我得到 26 FPS，CPU 38.6 毫秒，GPU 0 毫秒，能量影响 - 高。在没有任何加速的情况下，该模型以 20 FPS 的速度运行，帧时间为 51.2 毫秒，并且能耗非常高。所以很难做出决定。首先，我想找出在我的情况下可实现的最大 FPS 是多少，其次，我想找到“最有效”的运行方式：能源使用和 FPS
这与预期不符。对于 MobilenetV1，CoreML 委托实际上应该提供与 GPU 相似的 FPS，并且应该比普通 TFLite 快得多。您可以尝试使用适用于 iOS 的官方基准应用程序来测量平均延迟吗？ tensorflow.org/lite/performance/measurement#ios_benchamark_app
@yyoon，在基准应用程序中启动延迟：397.347 毫秒推断：-运行次数：312-平均：3.16502 毫秒-最小值：2.857 毫秒-最大值：3.575 毫秒-标准偏差：0.116 毫秒预热：-运行次数：149 - 平均：3.31269 毫秒 - 最小值：2.78 毫秒 - 最大值：7.151 毫秒 - 标准偏差：0.616 毫秒

标签： ios swift gpu coreml tensorflow-lite

【解决方案1】：

感谢您试用 Core ML 委托。你能分享你使用的 TFLite 版本，以及你用来初始化 Core ML 委托的代码吗？另外，您能否确认您正在尝试运行浮点模型，而不是量化模型？

延迟可能会因您测量的内容而异，但仅测量推理时间时，我的 iPhone 11 Pro 显示 CPU 为 11 毫秒，Core ML 代理为 5.5 毫秒。

分析器未捕获神经引擎利用率，但如果您看到延迟和高 CPU 利用率，则可能表明您的模型仅在 CPU 上运行。你也可以试试time profiler，看看哪个部分消耗的资源最多。

【讨论】：

你好@Taehee Jeong 我用pod 'TensorFlowLiteSwift', '~> 0.0.1-nightly', :subspecs => ['Metal', 'CoreML'] swift中的初始化命令：var options = CoreMLDelegate.Options()，options.enabledDevices = .all，let coreMLDelegate = CoreMLDelegate(options: options)!，interpreter = try Interpreter(modelPath: modelPath, delegates: [coreMLDelegate] )
感谢您的确认。对于基准应用程序，它仅测量与 Interpreter.invoke() 等效的 Swift。您能否检查一下您的应用中是否也这样做了，或者您是否在时间测量中包含了其他代码？
你好，@Taehee Jeong 在初始化过程中，我注意到这一行：“CoreML 委托：31 个节点中的 29 个节点被委托，有 2 个分区”。至于您之前的问题，mobilenet_v1_1.0_224.tflite 文件来自[这里](tensorflow.org/lite/guide/hosted_models)。我在didOutput(pixelBuffer: CVPixelBuffer) 函数中测量result = modelDataHandler?.runModel(onFrame: pixelBuffer) 命令的持续时间（请参阅更新问题中的代码示例）。我还使用了工具工具，并且在使用 coreml 委托进行推理期间没有看到神经引擎调用。