在 Julia 中有效地迭代自定义函数答案

【问题标题】：Iterating a custom function efficiently in Julia在 Julia 中有效地迭代自定义函数
【发布时间】：2021-04-04 04:00:57
【问题描述】：

我在 Julia 中非常有效地实现了一个运算符 T_，我想使用 while 循环进行迭代。我的操作员由以下人员给出：

% parameters
β = 0.987
δ = 0.012;

% grids
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500);
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]


% auxiliary functions to build my operator
F_(z,k) = exp(z) * (k^(1/3));  
u_(c) = (c^(1-2) - 1)/(1-2)

% T_operator
function T_(V, P, kgrid, zgrid, β, δ)
    E = V * P'
    T1 = similar(V)
    for i in axes(T1, 2)
        for j in axes(T1, 1)
            temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
            aux = -Inf
            for l in eachindex(kgrid)
                c = max(0.0, temp - kgrid[l])
                aux = max(aux, u_(c) + β * E[l, i])
            end
            T1[j,i] = aux
        end
    end
    return T1
end

简要说明。该运算符作为输入

V 是一个 500x7 矩阵，P 是一个 7x7 转换矩阵（即每一行加一个）
kgrid 是长度为 500 的网格，zgrid 是长度为 7 的网格
β 和 δ 特定参数

T_ 返回一个 T1 (500x7) 矩阵。有关此运算符的更多详细信息以及运行此运算符的正确方法可以在我提出的另一个问题中找到：Tricks to improve the performance of a cunstom function in Julia

只运行一次这个操作符，它花费的时间很短，几乎是瞬间的。但是，我需要迭代此运算符，直到获得可接受的容差错误，但我的实现导致效率低下的过程需要很长时间：

max_it = 1000
it = 1
tol = 1e-3
dist = tol +1
V0 = repeat(sqrt.(a_grid), outer = [1,7]);
while it < max_it && dist > tol
    TV= T_(V0,P,kgrid, zgrid, β, δ)
    dist = maximum(abs.(TV - V0)) % Computing distance or error
 
    V0 = TV % update
    it = it + 1 % Updating iterations
    
    % Some information about the state of the iteration 
    if rem(it, 100) == 0
        println("Current iteration:")
        println(it)
        println("Current norm:")
        println(dist)
    end
end

我认为更有效的解决方案是将while循环直接合并到T_运算符的实现中，但是我花了一整天的时间尝试这个并无法做到。帮助。

更新

这是 MATLAB 版本。效率更高

V0 = repmat(sqrt(kgrid), 1, 7);     % Concave and increasing guess
max_it = 1000;
tol = 1e-3;

%% Iteration
tic
norm = tol + 1;
it = 1;
tic;
[K, Z, new_K] = meshgrid(kgrid, zgrid, kgrid);
K = permute(K, [2, 1, 3]);
Z = permute(Z, [2, 1, 3]);
new_K = permute(new_K, [2, 1, 3]);

% Computing consumption on each possible state and choice
C = max(f(Z,K) + (1-delta)*K - new_K,0);
% All possible utilities
U = u(C);

disp('Starting value function iteration through the good and old brute force...')
while it < max_it & norm > tol
    EV = V0 * P';
    EV = permute(repmat(EV, 1, 1, nk), [3, 2, 1]);
    H = U + beta*EV;
    [TV, index] = max(H, [], 3);
    it = it + 1;        % Updating iterations
    norm = max(max(abs(TV - V0)));       % Computing error
    V0 = TV;
    
    if rem(it, 100) == 0
        disp('Current iteration:')
        disp(it)
        disp('Current norm:')
        disp(norm)
    end
end

V = TV;
toc;

【问题讨论】：

你是否将整个 while 循环包装在一个函数中？
比T_慢多少。它明显比1000x 慢吗？如果没有，将不会有太多可用的加速。
是的，您应该将F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j] 拉到函数T_ 之外，以避免在每次迭代时重做。
当前 Julia 代码在我尝试运行时抛出（例如，a_grid 和 P 未定义）。
是的，我想通了，可以参考另一个问题...但是您应该通过确保代码可复制粘贴并按原样运行来使这个问题自成一体！

标签： performance iterator julia

【解决方案1】：

只是为了了解我们从哪里开始，让我们将您的初始实现包装在一个函数中

function iterate_T_firstattempt(; max_it=1000, it=1, tol=1e-3, dist=tol+1)
    V0 = repeat(sqrt.(kgrid), outer = [1,7]) # Assuming the `a_grid` was a typo from your comments

    while it < max_it && dist > tol
        TV = T_(V0, P, kgrid, zgrid, β, δ)
        dist = maximum(abs.(TV - V0)) # Computing distance or error

        V0 = TV # update
        it += 1 # Updating iterations

        # Some information about the state of the iteration
        if rem(it, 100) == 0
            println("Current iteration:")
            println(it)
            println("Current norm:")
            println(dist)
        end
    end
end

并使用BenchmarkTools.jl对其进行基准测试

julia> @benchmark iterate_T_firstattempt()
 sample with 1 evaluation.
 Single result which took 7.056 s (0.00% GC) to evaluate,
 with a memory estimate of 52.33 MiB, over 5875 allocations.

哎呀，分配太多了。其中一些来自全局变量的使用，另一些来自类型不稳定性，还有一些来自函数设计。具体几点：

编译器可能已经做出了正确的调用，但我们不妨在u_(c) 和F_(z,k) 的定义中添加@inline 以确保它们被内联。为什么不在T_ 本身上呢？
您在嵌套的for 循环中进行了大量索引，不妨在其中抛出一个@inbounds，因为不应该有越界索引。
更好的是：T_ 中的循环看起来可以安全地重新排序，因此我们可以继续将@inbounds 从LoopVectorization.jl 升级到@turbo 或@tturbo，以获得更大的加速使用 CPU 的 SIMD 指令 /Advanced Vector Extensions。
dist = maximum(abs.(TV - V0)) 的计算至少涉及到两个大的分配，我们可以通过简单的mapreduce 来避免那些。或者再次使用这些 SIMD 指令，vmapreduce，来自 LoopVectorization.jl
TV = T_(V0, P, kgrid, zgrid, β, δ) 行也在分配，让我们把它换成就地版本T_!。
如上所述，全局变量是个坏消息。不过，我们可以轻松地将它们移动到 iterate_T 的函数签名中，这应该可以解决这个问题。

在此过程中，让我们从LinearAlgebra 标准库中拆分出三个参数mul!，以进行E = V * P' 的非分配计算。为了摆脱类型不稳定的最后一个偷偷摸摸的来源（导致最终的 ~2k 分配），我们应该将 outer=[1,7] 更改为 outer=(1,7) -- 一个很好的稳定元组而不是数组。

把它们放在一起：

using LinearAlgebra, LoopVectorization

# parameters
β = 0.987
δ = 0.012

# grids
Kss = 48.1905148382166
kgrid = range(0.75*Kss, stop=1.25*Kss, length=500)
zgrid = [-0.06725382459813659, -0.044835883065424395, -0.0224179415327122, 0 , 0.022417941532712187, 0.04483588306542438, 0.06725382459813657]
P = rand(7,7)
P ./= sum(P,dims=2) # Rows sum to one

# auxiliary functions to build operator
@inline F_(z,k) = exp(z) * (k^(1/3))
@inline u_(c) = (c^(1-2) - 1)/(1-2)

# T_operator, in-place version
@inline function T_!(TV, E, V, P, kgrid, zgrid, β, δ)
    mul!(E, V, P')
    @tturbo for i in axes(TV, 2)
        for j in axes(TV, 1)
            temp = F_(zgrid[i], kgrid[j]) + (1-δ)*kgrid[j]
            aux = -Inf
            for l in eachindex(kgrid)
                c = max(0.0, temp - kgrid[l])
                aux = max(aux, u_(c) + β * E[l, i])
            end
            TV[j,i] = aux
        end
    end
    return TV
end

function iterate_T(P, kgrid, zgrid, β, δ; max_it=1000, it=1, tol=1e-3, dist=tol+1)

    V0 = repeat(sqrt.(kgrid), outer=(1,7))

    # Preallocate temporary arrays
    TV = similar(V0)
    E = similar(V0)

    # Iterate
    for it = 1:max_it
        # Non-allocating in-place T_!
        TV = T_!(TV, E, V0, P, kgrid, zgrid, β, δ)
        # Compute distance or error
        dist = vmapreduce((a,b)->abs(a-b), max, TV, V0)

        copyto!(V0, TV) # update

        # # Some information about the state of the iteration
        # if rem(it, 100) == 0
        #     println("Current iteration:")
        #     println(it)
        #     println("Current norm:")
        #     println(dist)
        # end
        (dist < tol) &&  break
    end
    return V0
end

我们得到

julia> @benchmark iterate_T($P, $kgrid, $zgrid, $β, $δ)
BenchmarkTools.Trial: 11 samples with 1 evaluation.
 Range (min … max):  460.246 ms … 599.820 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     474.826 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   486.661 ms ±  40.359 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  █     █                                                        
  █▁▁▇▁▇█▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▇▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇ ▁
  460 ms           Histogram: frequency by time          600 ms <

 Memory estimate: 86.42 KiB, allocs estimate: 9.

这有点像！

【讨论】：