用于双数组数学的 C# OpenCL GPU 实现答案

【问题标题】：C# OpenCL GPU implementation for double array math用于双数组数学的 C# OpenCL GPU 实现
【发布时间】：2018-03-05 15:38:01
【问题描述】：

如何使这个函数的 for 循环在 OpenCL 中使用 GPU？

    public static double[] Calculate(double[] num, int period)
    {          
        var final = new double[num.Length];
        double sum = num[0];
        double coeff = 2.0 / (1.0 + period);

        for (int i = 0; i < num.Length; i++)
        {
            sum += coeff * (num[i] - sum);
            final[i] = sum;
        }

        return final;
    }

【问题讨论】：

codeproject.com/Articles/1116907/How-to-Use-Your-GPU-in-NET
谢谢！我读过，但我不明白。我不懂内核和全局函数
tbh，你真的不需要了解它们。您的功能几乎与示例中的功能完全相同。只要确保您以正确的顺序传递参数，它应该可以工作。我没有设置合适的测试环境。
为什么这个问题会被否决？听起来像是新手提出的一个有效问题。为什么新手问题在这里总是不受欢迎？
num 和 period 的值是多少？

标签： c# .net opencl gpu

【解决方案1】：

正如评论者 Cory 所说，请参阅此链接进行设置。

How to use your GPU in .NET

您将如何使用此项目：

添加 Nuget 包 Cloo
添加对 OpenCLlib.dll 的引用
下载OpenCLLib.zip

使用 OpenCL 添加

static void Main(string[] args)
{
    int[] Primes = { 1,2,3,4,5,6,7 };
    EasyCL cl = new EasyCL();
    cl.Accelerator = AcceleratorDevice.GPU;
    cl.LoadKernel(IsPrime);
    cl.Invoke("GetIfPrime", 0, Primes.Length, Primes, 1.0);
}

static string IsPrime
{
    get
    {
        return @"
        kernel void GetIfPrime(global int* num, int period)
        {
            int index = get_global_id(0);

            int sum = (2.0 / (1.0 + period)) * (num[index] - num[0]);
            printf("" %d \n"",sum);
        }";
    }
}

【讨论】：

我知道这不是您所需要的，但它是一个开始？
问题是它返回一个字符串，我需要一个双精度数组，如果我必须从字符串转换回双精度数组，我没有做任何事情，让这个函数使用 GPU，因为双精度数组是很大，也计算不好，没有for循环了。
invoke 命令的执行次数与 Primes 的长度相同。因此为什么 Primes.Length 作为参数发送。这就是循环机制。至于返回的字符串，您将需要做额外的工作来找出那部分。我已经为你完成了大部分工作。
但是让我看看我是否可以进一步调整我的答案以满足您的需求。请给我一些您的参数的有效数据以及预期的结果应该是什么
周期=14, num [] = 0.241, 0.220, 0.532, 0.455, 0.778, 0.243, 0.882, 0.442, 0.990, 0.124, 0.550, 0.552, 0.500, 0.995, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4 0.720, 0.744, 0.550, 0.850, 0.200, 0.444, 0.560

【解决方案2】：

您所写的问题不适合在 GPU 上运行的东西。您不能并行化（以提高性能的方式）对单个数组的操作，因为第 n 个元素的值取决于元素 1 到 n。但是，您可以利用 GPU 来处理多个阵列，其中每个 GPU 核心在单独的阵列上运行。

解决方案的完整代码在答案的末尾，但测试的结果是计算 10,000 个数组，每个数组有 10,000 个元素，生成以下内容（在 GTX1080M 和具有 32GB RAM 的 i7 7700k 上):

Task Generating Data: 1096.4583ms
Task CPU Single Thread: 596.2624ms
Task CPU Parallel: 179.1717ms
GPU CPU->GPU: 89ms
GPU Execute: 86ms
GPU GPU->CPU: 29ms
Task Running GPU: 921.4781ms
Finished

在此测试中，我们测量了使用具有一个线程的 CPU、具有所有线程的 CPU、最后是使用所有内核的 GPU 将结果生成到托管 C# 数组中的速度。 我们使用 AreTheSame 函数验证每个测试的结果是否相同。

最快的时间是使用所有线程在 CPU 上处理数组（Task CPU Parallel：179ms）。

GPU 实际上是最慢的（运行 GPU 的任务：922 毫秒），但这是因为重新格式化 C# 数组以将它们传输到 GPU 上需要时间。

如果消除了这个瓶颈（这很有可能，取决于您的用例），GPU 可能是最快的。如果数据已经以可以立即传输到 GPU 的方式格式化，则 GPU 的总处理时间将为 204 毫秒（CPU->GPU：89 毫秒 + 执行：86 毫秒 + GPU->CPU：29 毫秒 = 204 毫秒）。这仍然比并行 CPU 选项慢，但在不同类型的数据集上，它可能会更快。

为了从 GPU 取回数据（实际使用 GPU 的最重要部分），我们使用函数 ComputeCommandQueue.Read。这会将 GPU 上更改的数组传输回 CPU。

要运行以下代码，请参考 Cloo Nuget 包（我使用的是 0.9.1）。并确保在 x64 上编译（您将需要内存）。如果找不到 OpenCL 设备，您可能还需要更新显卡驱动程序。

class Program
{
    static string CalculateKernel
    {
        get
        {
            return @"
            kernel void Calc(global int* offsets, global int* lengths, global double* doubles, double periodFactor) 
            {
                int id = get_global_id(0);
                int start = offsets[id];
                int length = lengths[id];
                int end = start + length;
                double sum = doubles[start];

                for(int i = start; i < end; i++)
                {
                    sum = sum + periodFactor * ( doubles[i] - sum );
                    doubles[i] = sum;
                }
            }";
        }
    }

    public static double[] Calculate(double[] num, int period)
    {
        var final = new double[num.Length];
        double sum = num[0];
        double coeff = 2.0 / (1.0 + period);

        for (int i = 0; i < num.Length; i++)
        {
            sum += coeff * (num[i] - sum);
            final[i] = sum;
        }

        return final;
    }


    static void Main(string[] args)
    {

        int maxElements = 10000;
        int numArrays = 10000;
        int computeCores = 2048;

        double[][] sets = new double[numArrays][];

        using (Timer("Generating Data"))
        {
            Random elementRand = new Random(1);
            for (int i = 0; i < numArrays; i++)
            {
                sets[i] = GetRandomDoubles(elementRand.Next((int)(maxElements * 0.9), maxElements), randomSeed: i);
            }
        }

        int period = 14;

        double[][] singleResults;
        using (Timer("CPU Single Thread"))
        {
            singleResults = CalculateCPU(sets, period);
        }

        double[][] parallelResults;
        using (Timer("CPU Parallel"))
        {
            parallelResults = CalculateCPUParallel(sets, period);
        }

        if (!AreTheSame(singleResults, parallelResults)) throw new Exception();

        double[][] gpuResults;
        using (Timer("Running GPU"))
        {
            gpuResults = CalculateGPU(computeCores, sets, period);
        }

        if (!AreTheSame(singleResults, gpuResults)) throw new Exception();


        Console.WriteLine("Finished");
        Console.ReadKey();
    }

    public static bool AreTheSame(double[][] a1, double[][] a2)
    {
        if (a1.Length != a2.Length) return false;
        for (int i = 0; i < a1.Length; i++)
        {
            var ar1 = a1[i];
            var ar2 = a2[i];
            if (ar1.Length != ar2.Length) return false;
            for (int j = 0; j < ar1.Length; j++)
                if (Math.Abs(ar1[j] - ar2[j]) > 0.0000001) return false;

        }
        return true;
    }

    public static double[][] CalculateGPU(int partitionSize, double[][] sets, int period)
    {
        ComputeContextPropertyList cpl = new ComputeContextPropertyList(ComputePlatform.Platforms[0]);
        ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu, cpl, null, IntPtr.Zero);


        ComputeProgram program = new ComputeProgram(context, new string[] { CalculateKernel });
        program.Build(null, null, null, IntPtr.Zero);

        ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);

        ComputeEventList events = new ComputeEventList();

        ComputeKernel kernel = program.CreateKernel("Calc");


        double[][] results = new double[sets.Length][];

        double periodFactor = 2d / (1d + period);

        Stopwatch sendStopWatch = new Stopwatch();
        Stopwatch executeStopWatch = new Stopwatch();
        Stopwatch recieveStopWatch = new Stopwatch();


        int offset = 0;
        while (true)
        {
            int first = offset;
            int last = Math.Min(offset + partitionSize, sets.Length);
            int length = last - first;

            var merged = Merge(sets, first, length);

            sendStopWatch.Start();

            ComputeBuffer<int> offsetBuffer = new ComputeBuffer<int>(
                context,
                ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
                merged.Offsets);

            ComputeBuffer<int> lengthsBuffer = new ComputeBuffer<int>(
                context,
                ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
                merged.Lengths);

            ComputeBuffer<double> doublesBuffer = new ComputeBuffer<double>(
                context,
                ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
                merged.Doubles);



            kernel.SetMemoryArgument(0, offsetBuffer);
            kernel.SetMemoryArgument(1, lengthsBuffer);
            kernel.SetMemoryArgument(2, doublesBuffer);
            kernel.SetValueArgument(3, periodFactor);

            sendStopWatch.Stop();

            executeStopWatch.Start();

            commands.Execute(kernel, null, new long[] { merged.Lengths.Length }, null, events);

            executeStopWatch.Stop();

            using (var pin = Pinned(merged.Doubles))
            {
                recieveStopWatch.Start();
                commands.Read(doublesBuffer, false, 0, merged.Doubles.Length, pin.Address, events);
                commands.Finish();
                recieveStopWatch.Stop();
            }

            for (int i = 0; i < merged.Lengths.Length; i++)
            {
                int len = merged.Lengths[i];
                int off = merged.Offsets[i];

                var res = new double[len];
                Array.Copy(merged.Doubles,off,res,0,len);

                results[first + i] = res;
            }


            offset += partitionSize;
            if (offset >= sets.Length) break;
        }

        Console.WriteLine("GPU CPU->GPU: " + recieveStopWatch.ElapsedMilliseconds + "ms");
        Console.WriteLine("GPU Execute: " + executeStopWatch.ElapsedMilliseconds + "ms");
        Console.WriteLine("GPU GPU->CPU: " + sendStopWatch.ElapsedMilliseconds + "ms");


        return results;
    }

    public static PinnedHandle Pinned(object obj) => new PinnedHandle(obj);
    public class PinnedHandle : IDisposable
    {
        public IntPtr Address => handle.AddrOfPinnedObject();
        private GCHandle handle;
        public PinnedHandle(object val)
        {
            handle = GCHandle.Alloc(val, GCHandleType.Pinned);
        }
        public void Dispose()
        {
            handle.Free();
        }
    }

    public class MergedResults
    {
        public double[] Doubles { get; set; }
        public int[] Lengths { get; set; }
        public int[] Offsets { get; set; }
    }



    public static MergedResults Merge(double[][] sets, int offset, int length)
    {
        List<int> lengths = new List<int>(length);
        List<int> offsets = new List<int>(length);

        for (int i = 0; i < length; i++)
        {
            var arr = sets[i + offset];
            lengths.Add(arr.Length);
        }
        var totalLength = lengths.Sum();

        double[] doubles = new double[totalLength];
        int dataOffset = 0;
        for (int i = 0; i < length; i++)
        {
            var arr = sets[i + offset];
            Array.Copy(arr, 0, doubles, dataOffset, arr.Length);
            offsets.Add(dataOffset);
            dataOffset += arr.Length;
        }

        return new MergedResults()
        {
            Doubles = doubles,
            Lengths = lengths.ToArray(),
            Offsets = offsets.ToArray(),
        };
    }


    public static IDisposable Timer(string name)
    {
        return new SWTimer(name);
    }

    public class SWTimer : IDisposable
    {
        private Stopwatch _sw;
        private string _name;
        public SWTimer(string name)
        {
            _name = name;
            _sw = Stopwatch.StartNew();
        }
        public void Dispose()
        {
            _sw.Stop();
            Console.WriteLine("Task " + _name + ": " + _sw.Elapsed.TotalMilliseconds + "ms");
        }

    }

    public static double[][] CalculateCPU(double[][] arrays, int period)
    {
        double[][] results = new double[arrays.Length][];
        for (var index = 0; index < arrays.Length; index++)
        {
            var arr = arrays[index];
            results[index] = Calculate(arr, period);
        }
        return results;
    }

    public static double[][] CalculateCPUParallel(double[][] arrays, int period)
    {
        double[][] results = new double[arrays.Length][];
        Parallel.For(0, arrays.Length, i =>
         {
             var arr = arrays[i];
             results[i] = Calculate(arr, period);
         });
        return results;
    }


    static double[] GetRandomDoubles(int num, int randomSeed)
    {
        Random r = new Random(randomSeed);
        var res = new double[num];
        for (int i = 0; i < num; i++)
            res[i] = r.NextDouble() * 0.9 + 0.05;
        return res;
    }
}

【讨论】：

这需要几个小时，因为它对 100k 个值进行了 200 万次计算，但值不同。我已经分析了应用程序的性能，瓶颈是这个功能，它花费了 60% 的时间。
2M次的时候，所有的数组都是等长的吗？数组的数据源是什么 - 你可以一次批处理 1000 个吗？
数组不相等，因为它需要几天的数据并使用不同的时间（按 5 分钟、30 分钟、1 小时）计算每日价格
我已经使用 Tasks 使这个函数在多个 CPU 线程上运行，它在我的 i7 6700K CPU 上比在单个线程上运行快 5 倍以上。你做错了什么，在单个线程上它不能比填充所有 CPU 内核的多个线程更快。
在单线程上并不快 - 在多线程上快 5 倍。 5秒处理单个，1秒处理多个......

【解决方案3】：

    for (int i = 0; i < num.Length; i++)
    {
        sum += coeff * (num[i] - sum);
        final[i] = sum;
    }

表示第一个元素乘以 coeff 1 次，然后从第二个元素中减去。第一个元素也乘以 coeff 的平方，这次添加到第三个元素。然后第一个元素乘以 coeff 的立方并从第 4 个元素中减去。

事情是这样的：

-e0*c*c*c + e1*c*c - e2*c = f3
e0*c*c*c*c - e1*c*c*c + e2*c*c - e3*c = f4
-e0*c*c*c*c*c + e1*c*c*c*c - e2*c*c*c + e3*c*c - e4*c =f5

对于所有元素，扫描所有较小的 id 元素并计算：

如果元素的 id 值（我们称之为 k）的差异是奇数，则进行减法，如果不是，则进行加法。在加法或减法之前，将该值乘以 coeff 的 k 次方。最后，将当前 num 值乘以系数并将其添加到当前单元格。当前单元格值为 final(i)。

这是 O(N*N) 并且看起来像一个全对计算内核。使用开源 C# OpenCL 项目的示例：

ClNumberCruncher cruncher = new ClNumberCruncher(ClPlatforms.all().gpus(), @"
    __kernel void foo(__global double * num, __global double * final, __global int *parameters)
    {
        int threadId           = get_global_id(0);
        int period             = parameters[0];
        double coeff           = 2.0 / (1.0 + period);    
        double sumOfElements   = 0.0;
        for(int i=0;i<threadId;i++)
        {
            // negativity of coeff is to select addition or subtraction for different powers of coeff
            double powKofCoeff =  pow(-coeff,threadId-i);
            sumOfElements     +=  powKofCoeff * num[i];                     
        }
        final[threadId]        =  sumOfElements + num[threadId] * coeff;
    }
");
cruncher.performanceFeed = true; // getting benchmark feedback on console
double[] numArray = new double[10000];
double[] finalArray = new double[10000];
int[] parameters = new int[10];
int period = 15;
parameters[0] = period;
ClArray<double> numGpuArray = numArray;
numGpuArray.readOnly = true; // gpus read this from host
ClArray<double> finalGpuArray = finalArray; // finalArray will have results
finalGpuArray.writeOnly = true; // gpus write this to host
ClArray<int> parametersGpu = parameters;
parametersGpu.readOnly = true;

// calculate kernels with exact same ordering of parameters
// num(double),final(double),parameters(int)
// finalGpuArray points to __global double * final
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);

// first compute always lags because of compiling the kernel so here are repeated computes to get actual performance
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);
numGpuArray.nextParam(finalGpuArray, parametersGpu).compute(cruncher, 1, "foo", 10000, 100);

结果位于finalArray 数组中，包含 10000 个元素，每个工作项组使用 100 个工作项。

GPGPU 部分在 64 位与 32 位计算性能的比率非常低的 rx550 gpu 上需要 82 毫秒（因为消费类游戏卡不擅长新系列的双精度）。 Nvidia Tesla 或 Amd Vega 可以轻松计算此内核，而不会影响性能。 Fx8150（8 核）在 683 毫秒内完成。如果您只需要专门选择集成 GPU 及其 CPU，则可以使用

ClPlatforms.all().gpus().devicesWithHostMemorySharing() + ClPlatforms.all().cpus() 创建ClNumberCruncher 实例时。

api 的二进制文件：

https://www.codeproject.com/Articles/1181213/Easy-OpenCL-Multiple-Device-Load-Balancing-and-Pip

或在您的电脑上编译的源代码：

https://github.com/tugrul512bit/Cekirdekler

如果您有多个 GPU，它会使用它们而无需任何额外代码。在计算中包含一个 cpu 会降低此示例中第一次迭代的 gpu 效率（使用 cpu+gpu 在 76 毫秒内完成重复），因此最好使用 2-3 个 GPU 而不是 CPU+GPU。

我没有检查数值稳定性（在将数百万或更多值添加到同一个变量时，您应该使用 Kahan-Summation，但我没有使用它来提高可读性，也不知道 64 位值是否需要这个太像 32 位的）或任何值正确性，你应该这样做。 foo 内核也没有优化。它使 %50 的核心时间处于空闲状态，因此应该更好地安排如下：

thread-0: compute element 0 and element N-1
thread-1: compute element 1 and element N-2
thread-m: compute element N/2-1 and element N/2

所以所有工作项都得到相似的工作量。最重要的是，使用 100 作为工作组大小并不是最优的。它应该是 128,256,512 或 1024（对于 Nvidia），但这意味着数组大小也应该是这个的整数倍。然后它需要内核中的额外控制逻辑才能不超出数组边界。为了获得更高的性能，for 循环可以有多个部分和来执行“循环展开”。

【讨论】：

在 1 个 GPU 核心（分配给最后一个元素，并随其写入）或 10000 个 GPU 核心上的性能与您编写的相同......只是在单核上等待最后一个元素完成。以这种方式并行化它没有意义......
您对计算的依赖性是正确的。单个 cpu 内核将完成 10000 个周期，不应超过一毫秒。这只是一个并行版本。我还告知了答案中的O(N*N)。但同样，这是一个并行版本，并且每个最终元素都是独立计算的，因此如果他要求第 J 个元素，他就不需要另一个元素。
CPU 在 GPU 计算时处于空闲状态，因此 CPU 可以同时处理其他任务，这意味着更多的工作，但当然会浪费电能。
感谢您的回答，非常详细。在您的答案和 MineR 之间很难选择最佳答案
我试图找到一种“减少”的方法，但无法实现。即使减少的 O(Log(N)*N) 仍然比简单的单核 O(N) 添加慢，所以我只是制作了一些具有可扩展性的令人尴尬的并行版本，仅此而已:)