Pytorch剖析器及Pytorch模型的逐层分析

Pytorch的Autograd模块包括一个分析器（profiler），它可以让你检查模型中不同操作符的成本——包括CPU和GPU。

目前有两种模式——使用profile.实现仅cpu模式和基于nvprof(注册CPU和GPU活动)使用emit_nvtx。

`torch.autograd.profiler.profile`(enabled=True, use_cuda=False, record_shapes=False)

上下文管理器，用于管理autograd profiler状态并保存结果摘要。在后台，它仅记录正在C ++中执行的函数的事件，并将这些事件公开给Python。您可以将任何代码包装到其中，并且它只会报告PyTorch函数的运行时间。

参数：

enabled (bool, optional) – 将其设置为False将使该上下文管理器成为无操作。默认值:True。

use_cuda (bool, optional) – 使用cudaEvent API启用CUDA事件的计时。每个张量操作会增加大约4us的开销。默认值：False

record_shapes (bool, optional) – 如果设置了形状记录，则将收集有关输入尺寸的信息。这允许查看底层使用了哪些维度，并进一步使用prof.key_averages(group_by_input_shape=True)对它们进行分组。请注意，形状记录可能会使分析数据有偏差。对于最底部的事件(在嵌套函数调用的情况下)，很可能是可以忽略的。但是对于更高级别的函数，由于形状的收集，总self cpu time可能会人为地增加。

Example

x = torch.randn((1, 1), requires_grad=True)
with torch.autograd.profiler.profile() as prof:
for _ in range(100):  # any normal python code, really!
　　y = x ** 2
　　y.backward()
# NOTE: some columns were removed for brevity
print(prof.key_averages().table(sort_by="self_cpu_time_total"))

结果（没有使用gpu）：

------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     Number of Calls  Input Shapes                         
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
pow                                         64.76%           3.096ms          64.76%           3.096ms          3.096ms          1                []                                   
struct torch::autograd::GraphRoot           0.37%            17.700us         0.37%            17.700us         17.700us         1                []                                   
PowBackward0                                23.10%           1.104ms          23.10%           1.104ms          1.104ms          1                []                                   
pow                                         1.37%            65.700us         1.37%            65.700us         65.700us         1                []                                   
mul                                         10.11%           483.100us        10.11%           483.100us        483.100us        1                []                                   
mul                                         0.13%            6.200us          0.13%            6.200us          6.200us          1                []                                   
struct torch::autograd::AccumulateGrad      0.14%            6.500us          0.14%            6.500us          6.500us          1                []                                   
detach                                      0.03%            1.500us          0.03%            1.500us          1.500us          1                []                                   
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 4.780ms

结果（使用gpu）：

------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Name                                        Self CPU total %  Self CPU total   CPU total %      CPU total        CPU time avg     CUDA total %     CUDA total       CUDA time avg    Number of Calls  Input Shapes                         
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
pow                                         29.13%           3.246ms          29.13%           3.246ms          3.246ms          31.62%           2.866ms          2.866ms          1                []                                   
struct torch::autograd::GraphRoot           0.09%            9.600us          0.09%            9.600us          9.600us          0.02%            2.048us          2.048us          1                []                                   
PowBackward0                                34.12%           3.803ms          34.12%           3.803ms          3.803ms          32.89%           2.982ms          2.982ms          1                []                                   
pow                                         8.53%            950.500us        8.53%            950.500us        950.500us        2.63%            238.592us        238.592us        1                []                                   
mul                                         16.06%           1.789ms          16.06%           1.789ms          1.789ms          19.44%           1.762ms          1.762ms          1                []                                   
mul                                         8.94%            996.700us        8.94%            996.700us        996.700us        10.73%           972.864us        972.864us        1                []                                   
struct torch::autograd::CopyBackwards       1.47%            163.900us        1.47%            163.900us        163.900us        1.31%            118.688us        118.688us        1                []                                   
to                                          1.40%            155.900us        1.40%            155.900us        155.900us        1.27%            114.944us        114.944us        1                []                                   
empty_strided                               0.09%            10.300us         0.09%            10.300us         10.300us         0.01%            1.023us          1.023us          1                []                                   
struct torch::autograd::AccumulateGrad      0.13%            15.000us         0.13%            15.000us         15.000us         0.06%            5.281us          5.281us          1                []                                   
detach                                      0.04%            4.700us          0.04%            4.700us          4.700us          0.02%            1.760us          1.760us          1                []                                   
------------------------------------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  ---------------  -----------------------------------  
Self CPU time total: 11.144ms
CUDA time total: 9.066ms

`torch.autograd.profiler.record_function`(name)

上下文管理器/函数装饰器，在运行autograd profiler时向Python代码(或函数)块添加标签。它在跟踪代码概要文件时非常有用。

>>> x = torch.randn((1, 1), requires_grad=True)
>>> with torch.autograd.profiler.profile() as prof:
...     y = x ** 2
...     with torch.autograd.profiler.record_function("label-z"): # label the block
...         z = y ** 3
...     y.backward()
...
>>> # NOTE: some columns were removed for brevity
>>> print(prof.key_averages().table(sort_by="self_cpu_time_total"))
-----------------------------------  ---------------  ---------------  ---------------
Name                                 Self CPU total %  CPU time avg     Number of Calls
-----------------------------------  ---------------  ---------------  ---------------
pow                                  60.77%           47.470us         3
mul                                  21.73%           25.465us         2
PowBackward0                         12.03%           121.891us        1
torch::autograd::AccumulateGrad      2.70%            6.324us          1
label-z                              2.13%            12.421us         1
torch::autograd::GraphRoot           0.64%            1.503us          1
-----------------------------------  ---------------  ---------------  ---------------
Self CPU time total: 234.344us
CUDA time total: 0.000us

`torch.autograd.profiler.emit_nvtx`(enabled=True, record_shapes=False)

上下文管理器，使每个autograd操作发出一个NVTX范围。

在nvprof下运行程序时非常有用:

nvprof --profile-from-start off -o trace_name.prof -- <regular command here>

不幸的是，无法强制nvprof将收集到的数据刷新到磁盘，因此对于CUDA分析，必须使用此上下文管理器注释nvprof跟踪并等待进程退出后再检查它们。然后，可以使用NVIDIA Visual Profiler（nvvp）可视化时间轴，或者torch.autograd.profiler.load_nvprof（）可以加载结果以进行检查，例如在Python REPL中。

>>> with torch.cuda.profiler.profile():
...     model(x) # Warmup CUDA memory allocator and profiler
...     with torch.autograd.profiler.emit_nvtx():
...         model(x)

`torch.autograd.profiler.load_nvprof`(path)

打开nvprof跟踪文件并解析autograd注释。

Pytorch模型的逐层分析

采用torchprof库进行pytorch模型的逐层分析

pip install torchprof

 1 import torch
 2 import torchvision
 3 import torchprof
 4 
 5 model = torchvision.models.alexnet(pretrained=False).cuda()
 6 x = torch.rand([1, 3, 224, 224]).cuda()
 7 
 8 with torchprof.Profile(model, use_cuda=True) as prof:
 9     model(x)
10 
11 print(prof.display(show_events=False)) # equivalent to `print(prof)` and `print(prof.display())`

Module         | Self CPU total | CPU total | CUDA total | Occurrences
---------------|----------------|-----------|------------|------------
AlexNet        |                |           |            |
├── features   |                |           |            |
│├── 0         |        1.671ms |   6.589ms |    6.701ms |           1
│├── 1         |       62.430us |  62.430us |   63.264us |           1
│├── 2         |       62.909us | 109.948us |  112.640us |           1
│├── 3         |      225.389us | 858.376us |    1.814ms |           1
│├── 4         |       18.999us |  18.999us |   19.456us |           1
│├── 5         |       29.560us |  52.720us |   54.272us |           1
│├── 6         |      136.959us | 511.216us |  707.360us |           1
│├── 7         |       18.480us |  18.480us |   18.624us |           1
│├── 8         |       84.380us | 300.700us |  590.688us |           1
│├── 9         |       18.249us |  18.249us |   17.632us |           1
│├── 10        |       81.289us | 289.946us |  470.016us |           1
│├── 11        |       17.850us |  17.850us |   18.432us |           1
│└── 12        |       29.350us |  52.260us |   52.288us |           1
├── avgpool    |       41.840us |  70.840us |   76.832us |           1
└── classifier |                |           |            |
 ├── 0         |       66.400us | 122.110us |  125.920us |           1
 ├── 1         |      293.658us | 293.658us |  664.704us |           1
 ├── 2         |       17.600us |  17.600us |   18.432us |           1
 ├── 3         |       27.920us |  49.030us |   51.168us |           1
 ├── 4         |       40.590us |  40.590us |  208.672us |           1
 ├── 5         |       17.570us |  17.570us |   18.432us |           1
 └── 6         |       40.489us |  40.489us |   81.920us |           1

View Code

torch.autograd.profiler.profile(enabled=True, use_cuda=False, record_shapes=False)

参数：

Example

torch.autograd.profiler.record_function(name)

torch.autograd.profiler.emit_nvtx(enabled=True, record_shapes=False)

torch.autograd.profiler.load_nvprof(path)

Pytorch模型的逐层分析

`torch.autograd.profiler.profile`(enabled=True, use_cuda=False, record_shapes=False)

`torch.autograd.profiler.record_function`(name)

`torch.autograd.profiler.emit_nvtx`(enabled=True, record_shapes=False)

`torch.autograd.profiler.load_nvprof`(path)