【问题标题】:GPU Utilization Percentage Prometheus 查询
【发布时间】:2021-11-01 23:12:41
【问题描述】:

我能否从 Prometheus 的以下指标中找出 GPU 利用率百分比?我不知道如何查询它。 我没有 PPC64lE 环境的 dcgm-exporter 映像。您还可以分享制作ppc64le环境的dcgm-exporter的docker镜像的链接

 HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 0
go_gc_duration_seconds{quantile="0.25"} 0
go_gc_duration_seconds{quantile="0.5"} 0
go_gc_duration_seconds{quantile="0.75"} 0
go_gc_duration_seconds{quantile="1"} 0
go_gc_duration_seconds_sum 0
go_gc_duration_seconds_count 0
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 8
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.17"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.499048e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 2.499048e+06
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 4593
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 761
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 4.368032e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 2.499048e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 4.13696e+06
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 3.760128e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 5731
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 4.13696e+06
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 7.897088e+06
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 0
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 6492
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 153600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 163840
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 58752
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 65536
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.473924e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.037183e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 491520
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 491520
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 1.4027792e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 9
# HELP nvidia_gpu_duty_cycle Percent of time over the past sample period during which one or more kernels were executing on the GPU device
# TYPE nvidia_gpu_duty_cycle gauge
nvidia_gpu_duty_cycle{minor_number="0",name="Tesla V100-SXM2-32GB",uuid="GPU-5481fdc1-1b2c-381d-90d9-2df35fc8cecf"} 0
nvidia_gpu_duty_cycle{minor_number="1",name="Tesla V100-SXM2-32GB",uuid="GPU-af66b351-1498-c103-f39e-7592b645dc80"} 0
nvidia_gpu_duty_cycle{minor_number="2",name="Tesla V100-SXM2-32GB",uuid="GPU-95887069-482a-9a95-d02a-7c6e79c47893"} 0
nvidia_gpu_duty_cycle{minor_number="3",name="Tesla V100-SXM2-32GB",uuid="GPU-a38af12e-e2f7-ee15-b064-4628cf1fc5da"} 0
# HELP nvidia_gpu_memory_total_bytes Total memory of the GPU device in bytes
# TYPE nvidia_gpu_memory_total_bytes gauge
nvidia_gpu_memory_total_bytes{minor_number="0",name="Tesla V100-SXM2-32GB",uuid="GPU-5481fdc1-1b2c-381d-90d9-2df35fc8cecf"} 3.4089730048e+10
nvidia_gpu_memory_total_bytes{minor_number="1",name="Tesla V100-SXM2-32GB",uuid="GPU-af66b351-1498-c103-f39e-7592b645dc80"} 3.4089730048e+10
nvidia_gpu_memory_total_bytes{minor_number="2",name="Tesla V100-SXM2-32GB",uuid="GPU-95887069-482a-9a95-d02a-7c6e79c47893"} 3.4089730048e+10
nvidia_gpu_memory_total_bytes{minor_number="3",name="Tesla V100-SXM2-32GB",uuid="GPU-a38af12e-e2f7-ee15-b064-4628cf1fc5da"} 3.4089730048e+10
# HELP nvidia_gpu_memory_used_bytes Memory used by the GPU device in bytes
# TYPE nvidia_gpu_memory_used_bytes gauge
nvidia_gpu_memory_used_bytes{minor_number="0",name="Tesla V100-SXM2-32GB",uuid="GPU-5481fdc1-1b2c-381d-90d9-2df35fc8cecf"} 4.470079488e+09
nvidia_gpu_memory_used_bytes{minor_number="1",name="Tesla V100-SXM2-32GB",uuid="GPU-af66b351-1498-c103-f39e-7592b645dc80"} 2.588934144e+09
nvidia_gpu_memory_used_bytes{minor_number="2",name="Tesla V100-SXM2-32GB",uuid="GPU-95887069-482a-9a95-d02a-7c6e79c47893"} 0
nvidia_gpu_memory_used_bytes{minor_number="3",name="Tesla V100-SXM2-32GB",uuid="GPU-a38af12e-e2f7-ee15-b064-4628cf1fc5da"} 5.640290304e+09
# HELP nvidia_gpu_num_devices Number of GPU devices
# TYPE nvidia_gpu_num_devices gauge
nvidia_gpu_num_devices 4
# HELP nvidia_gpu_power_usage_milliwatts Power usage of the GPU device in milliwatts
# TYPE nvidia_gpu_power_usage_milliwatts gauge
nvidia_gpu_power_usage_milliwatts{minor_number="0",name="Tesla V100-SXM2-32GB",uuid="GPU-5481fdc1-1b2c-381d-90d9-2df35fc8cecf"} 68088
nvidia_gpu_power_usage_milliwatts{minor_number="1",name="Tesla V100-SXM2-32GB",uuid="GPU-af66b351-1498-c103-f39e-7592b645dc80"} 56426
nvidia_gpu_power_usage_milliwatts{minor_number="2",name="Tesla V100-SXM2-32GB",uuid="GPU-95887069-482a-9a95-d02a-7c6e79c47893"} 38826
nvidia_gpu_power_usage_milliwatts{minor_number="3",name="Tesla V100-SXM2-32GB",uuid="GPU-a38af12e-e2f7-ee15-b064-4628cf1fc5da"} 71068
# HELP nvidia_gpu_temperature_celsius Temperature of the GPU device in celsius
# TYPE nvidia_gpu_temperature_celsius gauge
nvidia_gpu_temperature_celsius{minor_number="0",name="Tesla V100-SXM2-32GB",uuid="GPU-5481fdc1-1b2c-381d-90d9-2df35fc8cecf"} 45
nvidia_gpu_temperature_celsius{minor_number="1",name="Tesla V100-SXM2-32GB",uuid="GPU-af66b351-1498-c103-f39e-7592b645dc80"} 46
nvidia_gpu_temperature_celsius{minor_number="2",name="Tesla V100-SXM2-32GB",uuid="GPU-95887069-482a-9a95-d02a-7c6e79c47893"} 37
nvidia_gpu_temperature_celsius{minor_number="3",name="Tesla V100-SXM2-32GB",uuid="GPU-a38af12e-e2f7-ee15-b064-4628cf1fc5da"} 51
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.02
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 22
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.6646144e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.63059385687e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.264910336e+09
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19
# HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
# TYPE promhttp_metric_handler_requests_in_flight gauge
promhttp_metric_handler_requests_in_flight 1
# HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
# TYPE promhttp_metric_handler_requests_total counter
promhttp_metric_handler_requests_total{code="200"} 1
promhttp_metric_handler_requests_total{code="500"} 0
promhttp_metric_handler_requests_total{code="503"} 0

【问题讨论】:

    标签: prometheus grafana nvidia ppc64le


    【解决方案1】:

    根据您分享的指标,以下指标将为您提供有关 GPU 利用率的信息:

    nvidia_gpu_duty_cycle - 在过去的采样周期中,一个或多个内核在 GPU 设备上执行的时间百分比

    nvidia_gpu_memory_total_bytes - GPU 设备可用的总内存(以字节为单位)

    nvidia_gpu_memory_used_bytes - GPU 设备使用的内存(以字节为单位)

    nvidia_gpu_num_devices - GPU 设备的数量

    nvidia_gpu_power_usage_milliwatts - GPU 设备的功耗(以毫瓦为单位)

    nvidia_gpu_temperature_celsius - GPU 设备的温度(摄氏度)

    从 Prometheus UI 或以 Prometheus 作为数据源的 Grafana 中,这些值可用于您的查询表达式中,以检索相关的 GPU 指标。例如,如果您要执行像 nvidia_gpu_memory_total_bytes 这样的简单查询,它将返回与该指标名称匹配的所有时间序列。

    还请注意,您共享的指标包含上述每个值的 4 个条目,每个可用的 GPU 设备一个,编号为 0-3。如果您只想查询特定设备的指标,假设 #2,您的查询需要如下所示:nvidia_gpu_memory_total_bytes{minor_number="2"}。请注意每个指标名称后{} 之间的各种逗号分隔标签,因为它们可用于根据您的喜好过滤查询。有关 Prometheus 查询的更多信息here

    对于DCGM本身,您可以使用来自官方github repo的源代码专门为PPC64IE构建一个Docker镜像。 instructions 将首先让您创建一个单独的 Docker 映像,该映像将用于生成 DCGM 构建。当generating the DCGM build 时,您需要在执行./build.sh 脚本以定位PPC64IE 时包含--arch ppc 选项。

    对于dcgm-exporter (github),NVIDIA 在其Docker Hub repo 上提供了许多预构建的图像,并在here 找到了官方文档。

    【讨论】:

    • 我需要查询以了解 GPU 利用率。我知道 GPU 内存利用率,但是是否可以从指标 nvidia_gpu_duty_cycle 中找出 GPU 利用率???
    • @ajnair95,是的,此指标可用于查找 GPU 利用率的百分比。只需输入 nvidia_gpu_duty_cycle 作为您的查询字符串
    猜你喜欢
    • 2017-04-17
    • 2015-07-09
    • 1970-01-01
    • 1970-01-01
    • 2020-08-03
    • 1970-01-01
    • 1970-01-01
    • 2021-10-25
    • 1970-01-01
    相关资源
    最近更新 更多