【问题标题】:thread performance in JuliaJulia 中的线程性能
【发布时间】:2023-03-13 16:53:01
【问题描述】:

我对并行 Julia 代码的尝试并没有随着线程数量的增加而提高性能。

无论我将 JULIA_NUM_THREADS 设置为 2 还是 32,以下代码的运行时间都差不多。

using Random
using Base.Threads

rmax = 10
dr = 1
Ngal = 100000000

function bin(id, Njobs, x, y, z, w)
    bin_array = zeros(10)
    for i in (id-1)*Njobs + 1:id*Njobs
        r = sqrt(x[i]^2 + y[i]^2 + z[i]^2)
        i_bin = floor(Int, r/dr) + 1
        if i_bin < 10
            bin_array[i_bin] += w[i]
        end
    end
    bin_array
end

Nthreads = nthreads()

x = rand(Ngal)*5
y = rand(Ngal)*5
z = rand(Ngal)*5
w = ones(Ngal)

V = let
    VV = [zeros(10) for _ in 1:Nthreads]
    jobs_per_thread = fill(div(Ngal, Nthreads),Nthreads)
    for i in 1:Ngal-sum(jobs_per_thread)
        jobs_per_thread[i] += 1
    end
    @threads for i = 1:Nthreads
        tid = threadid()
        VV[tid] = bin(tid, jobs_per_thread[tid], x, y, z, w)
    end
    reduce(+, VV)
end

我做错了吗?

【问题讨论】:

    标签: julia


    【解决方案1】:

    与其他操作相比,在线程循环中花费的时间可以忽略不计。您还根据线程数分配大小的数组,因此当使用多个线程时,您在内存分配上花费的时间甚至(稍微)更多。


    如果您关心性能,请查看https://docs.julialang.org/en/v1/manual/performance-tips/。特别是,不惜一切代价避免全局变量(它们会降低性能)并将所有内容都放在函数中,这也更容易测试和调试。例如,我将您的代码重写为:

    using Random
    using Base.Threads
    
    function bin(id, Njobs, x, y, z, w)
        dr = 1
    
        bin_array = zeros(10)
        for i in (id-1)*Njobs + 1:id*Njobs
            r = sqrt(x[i]^2 + y[i]^2 + z[i]^2)
            i_bin = floor(Int, r/dr) + 1
            if i_bin < 10
                bin_array[i_bin] += w[i]
            end
        end
        bin_array
    end
    
    function test()
        Ngal = 100000000
        x = rand(Ngal)*5
        y = rand(Ngal)*5
        z = rand(Ngal)*5
        w = ones(Ngal)
    
        Nthreads = nthreads()
        VV = [zeros(10) for _ in 1:Nthreads]
        jobs_per_thread = fill(div(Ngal, Nthreads),Nthreads)
        for i in 1:Ngal-sum(jobs_per_thread)
            jobs_per_thread[i] += 1
        end
        @threads for i = 1:Nthreads
            tid = threadid()
            VV[tid] = bin(tid, jobs_per_thread[tid], x, y, z, w)
        end
        reduce(+, VV)
    end
    
    test()
    

    单线程性能:

    julia> @time test();
      3.054144 seconds (33 allocations: 5.215 GiB, 11.03% gc time)
    

    4 线程的性能:

    julia> @time test();
      2.602698 seconds (65 allocations: 5.215 GiB, 9.92% gc time)
    

    如果我在test() 中评论for 循环,我会得到以下时间。一个线程:

    julia> @time test();
      2.444296 seconds (21 allocations: 5.215 GiB, 10.54% gc time)
    

    4 个线程:

    julia> @time test();
      2.481054 seconds (27 allocations: 5.215 GiB, 12.08% gc time)
    

    【讨论】:

      猜你喜欢
      • 2021-01-04
      • 2019-08-01
      • 1970-01-01
      • 2018-01-14
      • 1970-01-01
      • 2020-03-23
      • 1970-01-01
      • 1970-01-01
      • 2021-08-16
      相关资源
      最近更新 更多