可以用 arrayfun() （或其他方式）加速这个 gpuArray 计算吗？答案

【问题标题】：Possible to speed up this gpuArray calculation with arrayfun() (or otherwise)?可以用 arrayfun() （或其他方式）加速这个 gpuArray 计算吗？
【发布时间】：2021-04-23 00:00:12
【问题描述】：

我有一个复数矩阵A，想根据A = exp( -1i*(A + abs(A).^2) )修改它Nt次。 A的大小一般为1000x1000，运行次数大概在10000左右。

我希望减少执行这些操作所需的时间。对于 CPU 上的 1000 次迭代，我测量大约 6.4 秒。在Matlab documentation 之后，我能够将其移至 GPU，从而将所需时间减少到 0.07 秒（令人难以置信的 x91 改进！）。到目前为止一切顺利。

但是，我现在还阅读了文档中的 this link，其中描述了如果我们也使用 arrayfun()，有时我们可以如何进一步改进元素计算。如果我尝试按照教程进行操作，所花费的时间实际上更糟，时间为 0.47 秒。我的测试如下所示：

Nt = 1000; % Number of times to run each method
test_functionFcn = @test_function;

A = rand( 500, 600, 'double' ) + rand( 500, 600, 'double' )*1i; % Define an initial complex matrix
    
gpu_A = gpuArray(A); % Transfer matrix to a GPU array

%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%%%%%%%%%
cpu_data_out = A;
tic
for k = 1:Nt 
    cpu_data_out = test_function( cpu_data_out );
end
tcpu = toc;

%%%%%%%%%%%%%%%%% Run the calculation Nt times on GPU directly %%%%%%%%%%%%%%%%%%%%
gpu_data_out = gpu_A;
tic
for k = 1:Nt
    gpu_data_out = test_function(gpu_data_out);
end
tgpu = toc;

%%%%%%%%%%%%%% Run the calculation Nt times on GPU using arrayfun() %%%%%%%%%%%%%%
gpuarrayfun_data_out = gpu_A;
tic
for k = 1:Nt
    gpuarrayfun_data_out = arrayfun( test_functionFcn, gpuarrayfun_data_out );
end
tgpu_arrayfun = toc;

%%% Print results %%%
fprintf( 'Time taken using only CPU: %g\n', tcpu );
fprintf( 'Time taken using gpuArray directly: %g\n', tgpu );
fprintf( 'Time taken using GPU + arrayfun(): %g\n', tgpu_arrayfun );

%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

结果是：

Time taken using only CPU: 6.38785
Time taken using gpuArray directly: 0.0680587
Time taken using GPU + arrayfun(): 0.474612

我的问题是：

在这种情况下我是否正确使用了arrayfun()，并且预计arrayfun() 应该更糟？
如果是这样，并且真的只是预期它比直接 gpuArray 方法慢，是否有任何简单的（即非 MEX）方法来加快这样的计算？（例如，我看到他们还提到使用 pagefun）。

提前感谢您的任何建议。

（显卡是Nvidia Quadro M4000，我运行的是Matlab R2017a）

编辑

阅读@Edric 的回答后，我认为展示更多更广泛的代码很重要。我在 OP 中没有提到的一件事是，在我的实际主代码中，在 k=1:Nt 循环中还有一个额外的操作，它是矩阵乘法与稀疏三对角矩阵的转置。下面是一个更加充实的 MWE 真正发生的事情：

Nt = 1000; % Number of times to run each method
N_rows = 500;
N_cols = 600;
test_functionFcn = @test_function;
A = rand( N_rows, N_cols, 'double' ) + rand( N_rows, N_cols, 'double' )*1i; % Define an initial complex matrix
%%% Generate a sparse, tridiagonal, square transformation matrix %%%%%%%%
mm = 10*ones(N_cols,1); % Subdiagonal elements
dd = 20*ones(N_cols,1); % Main diagonal elements
pp = 30*ones(N_cols,1); % Superdiagonal elements
M = spdiags([mm dd pp],-1:1,N_cols,N_cols);
M(1,1) = 6; % Set a couple of other entries
M(2,1) = 3;
%%%%%%%%%%%%%%%%%%%% Run the calculation Nt times on CPU only %%%%%%%%%%%%
cpu_data_out = A;
for k = 1:Nt 
    cpu_data_out = test_function( cpu_data_out );
    cpu_data_out = cpu_data_out*M.';
end
%%% Function to operate on matrices %%%
function y = test_function(x)
y = exp(-1i*(x + abs(x).^2));
end

我很抱歉没有将它包含在 OP 中 - 我当时没有意识到它可能与解决方案相关。这会改变事情吗？在 GPU 上使用 arrayfun() 是否仍有收益，或者现在不适合转换为 arrayfun() ？

【问题讨论】：

标签： performance matlab gpu gpuarray

【解决方案1】：

这里有几点。首先，（也是最重要的），对于time code on the GPU，您需要使用gputimeit，或者您需要在调用toc 之前注入对wait(gpuDevice) 的调用。这是因为工作是在 GPU 上异步启动的，只有等待它完成才能获得准确的计时。通过这些细微的修改，在我的 GPU 上，gpuArray 方法需要 0.09 秒，arrayfun 版本需要 0.18 秒。

运行 GPU 操作的循环通常效率低下，因此您可以在这里获得的主要好处是通过将循环推入 arrayfun 函数体中，以便该循环直接在 GPU 上运行。像这样：

%%% Function to operate on matrices %%%
function x = test_function(x,Nt)
for ii = 1:Nt
    x = exp(-1i*(x + abs(x).^2));
end
end

您需要像A = arrayfun(@test_function, A, Nt) 一样调用它。在我的 GPU 上，这将 arrayfun 的时间缩短到了 0.05 秒，因此大约是普通 gpuArray 版本的两倍。

【讨论】：

嗨@Edric，非常感谢您的回答。 1) 感谢您对 gputimeit 的建议——我没有意识到获得适当的基准测试有多么重要。我将来会使用这个。 2）我想我现在从您的示例中可以更好地看到 arrayfun() 如何带来收益。不过，我认为这在我的实际代码中不起作用-我已经更新了问题，以便您可以看到更多内容。很抱歉之前没有包括这个。鉴于这些额外的细节，您对如何加快速度有何看法？感谢您的支持。