【发布时间】:2022-01-06 14:18:40
【问题描述】:
我正在尝试比较 CUDA 内存管理的托管和非托管版本所花费的总执行时间。
在下面的示例代码中,我有两个功能完全相同。唯一不同的是它们的内存管理。一个函数使用cudaMalloc()/cudaMemcpy(),另一种方法只使用cudaMallocManaged()。
我使用nvprof计算不同的时间,得到以下输出:
托管版本nvprof 输出:
== 29028 == Profiling result :
Type Time(%) Time Calls Avg Min Max Name
GPU activities : 100.00 % 59.425us 1 59.425us 59.425us 59.425us add(int, float*, float*)
API calls : 78.08 % 296.49ms 2 148.24ms 1.7127ms 294.78ms cudaMallocManaged
19.61 % 74.451ms 1 74.451ms 74.451ms 74.451ms cuDevicePrimaryCtxRelease
1.55 % 5.8705ms 1 5.8705ms 5.8705ms 5.8705ms cudaLaunchKernel
0.67 % 2.5547ms 2 1.2774ms 974.40us 1.5803ms cudaFree
0.07 % 280.60us 1 280.60us 280.60us 280.60us cudaDeviceSynchronize
0.01 % 28.300us 3 9.4330us 3.0000us 13.300us cuModuleUnload
0.01 % 26.800us 1 26.800us 26.800us 26.800us cuDeviceTotalMem
0.00 % 17.700us 101 175ns 100ns 900ns cuDeviceGetAttribute
0.00 % 10.100us 3 3.3660us 300ns 8.8000us cuDeviceGetCount
0.00 % 3.2000us 1 3.2000us 3.2000us 3.2000us cuDeviceGetName
0.00 % 3.0000us 2 1.5000us 300ns 2.7000us cuDeviceGet
0.00 % 500ns 1 500ns 500ns 500ns cuDeviceGetLuid
0.00 % 200ns 1 200ns 200ns 200ns cuDeviceGetUuid
== 29028 == Unified Memory profiling result :
Device "GeForce GTX 1070 (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
64 128.00KB 128.00KB 128.00KB 8.000000MB 3.279000ms Host To Device
146 84.164KB 32.000KB 1.0000MB 12.00000MB 64.50870ms Device To Host
非托管版本nvprof 输出:
== 23864 == Profiling result :
Type Time(%) Time Calls Avg Min Max Name
GPU activities : 56.30 % 1.5032ms 2 751.60us 751.44us 751.76us[CUDA memcpy HtoD]
41.48 % 1.1075ms 1 1.1075ms 1.1075ms 1.1075ms[CUDA memcpy DtoH]
2.23 % 59.457us 1 59.457us 59.457us 59.457us add(int, float*, float*)
API calls : 78.92 % 270.08ms 2 135.04ms 656.40us 269.43ms cudaMalloc
19.79 % 67.730ms 1 67.730ms 67.730ms 67.730ms cuDevicePrimaryCtxRelease
1.05 % 3.5796ms 3 1.1932ms 1.0106ms 1.4341ms cudaMemcpy
0.10 % 346.20us 2 173.10us 3.4000us 342.80us cudaFree
0.09 % 314.30us 1 314.30us 314.30us 314.30us cudaDeviceSynchronize
0.02 % 74.200us 1 74.200us 74.200us 74.200us cudaLaunchKernel
0.01 % 34.700us 3 11.566us 2.5000us 29.100us cuModuleUnload
0.01 % 24.100us 1 24.100us 24.100us 24.100us cuDeviceTotalMem
0.00 % 17.100us 101 169ns 100ns 900ns cuDeviceGetAttribute
0.00 % 9.0000us 3 3.0000us 300ns 8.0000us cuDeviceGetCount
0.00 % 3.2000us 1 3.2000us 3.2000us 3.2000us cuDeviceGetName
0.00 % 1.5000us 2 750ns 200ns 1.3000us cuDeviceGet
0.00 % 300ns 1 300ns 300ns 300ns cuDeviceGetUuid
0.00 % 300ns 1 300ns 300ns 300ns cuDeviceGetLuid
我的代码:
int RunManagedVersion()
{
int N = 1 << 20;
float* x, * y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add << <numBlocks, blockSize >> > (N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
int RunUnmanagedVersion()
{
int N = 1 << 20;
//Declare pointers for input and output arrays
float* x = (float*)calloc(N, sizeof(float));
float* y = (float*)calloc(N, sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
//Allocate device memory for input and output images
float* d_pX = 0;
float* d_pY = 0;
cudaMalloc(&d_pX, N * sizeof(float));
cudaMalloc(&d_pY, N * sizeof(float));
//Copy INPUT ARRAY data from host to device
cudaMemcpy(d_pX, x, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_pY, y, N * sizeof(float), cudaMemcpyHostToDevice);
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add << <numBlocks, blockSize >> > (N, d_pX, d_pY);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
//Copy Results - Device to Host
cudaMemcpy(y, d_pY, N * sizeof(float), cudaMemcpyDeviceToHost);
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// device memory free
cudaFree(d_pX);
cudaFree(d_pX);
//host memory free
free(x);
free(y);
return 0;
}
int main()
{
RunUnmanagedVersion();
//RunManagedVersion();
return 0;
}
问题:我多次使用上述代码,并注意到在托管版本(即统一内存)的情况下,来自DeviceToHost 的数据传输时间要长得多。这是正常的(为什么?)还是我在代码中做错了什么?
【问题讨论】:
-
我不相信任何这些都被记录或指定,所以关于“为什么”的权威答案是不可能的,我不相信。你没有做错什么。您对 D->H 时间的解释与非托管情况不能直接比较,因为这些传输是与主机代码同时执行的。这些细节可以从探查器中观察到,但确切的特征没有记录,AFAIK。在 H->D 的情况下,内核启动会触发传输。在 H->D 情况下,传输由
cudaDeviceSynchronize()启用,但由主机代码活动驱动
标签: cuda