改进感知器神经网络的 OpenCL 内核答案

【问题标题】：Improving an OpenCL kernel for a Perceptron neural network改进感知器神经网络的 OpenCL 内核
【发布时间】：2014-03-06 06:43:15
【问题描述】：

我之前做过很多 OpenGL 和着色器，现在，我决定尝试一下 OpenCL。我看了一些在线教程，并开始阅读有关该主题的书籍。为了更好地理解，并且因为我相信最好的学习方式是明智地尝试并从这样做时出现的问题中学习，我决定开始为全连接感知器实现内核。

对于那些不知道那是什么的人，我将解释基本概念。它是一个神经网络，其中一层的每个神经元都连接到下一层的每个神经元。每个神经元只有一个动作要执行：执行上一层所有神经元的总和，每个神经元按不同的值加权。

这看起来很容易实现，在阅读了论文“Parallel Neural Network Training with OpenCL”后，我通过以下方式实现了它

每一层都依赖于前一层，它们由主机按顺序运行
为了计算一个层，我使用层内神经元数量的全局工作大小（可能非常大，例如数万个）运行我的内核。这使得所有神经元都相互独立地执行求和。
每个神经元（由其 global_work_id 标识）与前一层的所有神经元进行加权求和。

这是我功能齐全的 opencl 内核：

/**
* @brief Computes one layer of the perceptron given the previous one and the
* weights
* The kernel is run once for each layer.
* The work items are each tasked with computing the output of a single neuron
* of the out layer.
*
* @param out_layer_size
*   Size of the output layer (number of elements in the output array that will
*   contain the result for each neuron).
* @param in_layer_size
*   Number of elements of the input layer
* @param in_value
*   Values of the neuron in the previous layer
* @param in_weights
*   Array containing the weights for each input neuron. It is organised as a
*   two dimensional matrix, written by concatenating each line in the array
*   [ w11, w12, w13, ...
*     w21, w22, w23, ...
*     ..., ..., ..., ...
*   ]
*   Where wij is the weight linking the neuron i of the input layer to the
*   neuron j of the output layer
* @param out_values
*   Computed values for the current layer
*/
void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
    private const int global_id = get_global_id(0);
    private const int out_layer_s = *out_layer_size;
    private const int in_layer_s = *in_layer_size;
    private const int offset = out_layer_s * global_id;

    private float sum = 0.;
    for(int i=0; i < in_layer_s; i++) {
        sum += in_weights[i*out_layer_s+global_id] * in_value[i];
    }
    //out_values[global_id] = sigma(sum);
    out_values[global_id] = sum;
}

这是我调用它的方式：

queue.enqueueNDRangeKernel(kernel, cl::NullRange,cl::NDRange(number of neurons within layer),cl::NullRange);

我意识到这个内核的瓶颈是加权和的实现。如果有人能解释我如何改进它以使其更快，那将非常有帮助。

我可能没有正确使用不同的内存区域，我主要考虑的是我什至不使用的本地内存。

只是为了让您了解性能（即在 Nvidia GTX 660M 上），我将向您展示我取得的一些成绩。每个值是每层的神经元数量：

2500、10 000、2500：0.018s ~ 60FPS。它比我的处理器（运行在 2.40GHz 的 Intel Core i7）上快 4 到 5 倍
100 000, 100 000, 500: 140s -> 我想这并不奇怪，因为第二层中的每个神经元都必须执行 100 000 个元素的加权和。在我的处理器上运行它会产生大致相同的结果。

【问题讨论】：

您是在寻找针对 100k、100k、500 情况的优化，还是一般的性能提升？第一种情况（2500,10k,1500），第二种情况，还是其他一些输入大小范围，哪一种更常见？
这个问题比较笼统。我想第一种情况要常见得多。很少需要比这更多的神经元。这个想法更多是为了了解如何改进内核本身，也许可以更好地利用内存，优化循环......

标签： performance opencl

【解决方案1】：

您可以通过在本地内存中缓存 in_values 来进行重大改进。从全局内存中读取 in_values 的每个元素的次数越少越好。

我想出了一个解决方案，它可以缓存最大数量的输入值，并且每个工作组只从全局内存中读取每个元素一次。这是通过一次复制一个 in_values 块，针对所有 out_values 处理它，然后移动到下一个块来完成的。还有一个本地浮点数组用于减少每个块的工作项的总和。

伪代码：

  output elements assumed to be set to 0 already
  for each block of input values:
    cache the input block
    for each target output value:
      reset local sum to 0
      for each element this work item is responsible for:
        read the weight, multiply, and add to sum
      reduce sums to a single value, ADD value to output element

我还没有机会通过分析器或调试器运行它，但当我回到我的家用电脑时我会尝试一下。（我的办公室工作站上没有 opencl 工具）。确保将组大小等于 GROUP_SIZE 常量的内核排队。此外，您的设备上的每个计算单元只能创建一个组。

真实代码：

//experiment with GROUP_SIZE to discover the optimal value for your device
//this needs to be equal to local_work_size passed into clEnqueueNDRangeKernel
//use a multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
//max. for most devices is 256
#define GROUP_SIZE = 64;

// IN_VALUE_CACHE_SIZE is the number of floats from in_value to copy to local memory at a time
//assuming GROUP_SIZE can be up to 256, sizeof(float)=4,  and local memory size is 32kb, full saturation can be achieved with the following:
//(32768 - (256 * 4)) /4 = 7936
//try another multiple of 1024 (6144, 4096... )if there is trouble with this value
#define IN_VALUE_CACHE_SIZE = 7936;

void kernel perceptron(global const int* in_layer_size, global const int* out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
    private const int global_id = get_global_id(0);
    private const int out_layer_s = *out_layer_size;
    private const int in_layer_s = *in_layer_size;
    private const int offset = out_layer_s * global_id;

    private const int item_id = get_local_id(0);    
    private const int group_id = get_group_id(0);   
    private const int group_count = get_num_groups(0);  


    local float result_buffer[GROUP_SIZE];

    local float in_value_cache[IN_VALUE_CACHE_SIZE];
    int i,j,k;

    //init the block to 0, in case there are fewer than IN_VALUE_CACHE_SIZE values in total
    for(i=item_id; i<IN_VALUE_CACHE_SIZE; i+= GROUP_SIZE){
        in_value_cache[i] = 0.0;
    }
    barrier(CL_LOCAL_MEM_FENCE);


    private float sum = 0.0;
    event_t e;
    int copy_total = 0;
    int copy_offset;

    for(i=0; i<in_layer_s; i+=IN_VALUE_CACHE_SIZE){
        //cap the number of values to copy to local memory if loop is near the end of the input data
        copy_total = IN_VALUE_CACHE_SIZE;
        if((copy_total + i*IN_VALUE_CACHE_SIZE) > in_layer_s){
            copy_total = in_layer_s - i*IN_VALUE_CACHE_SIZE;
        }           
        //copy the next block of values
        e = async_work_group_copy(in_value_cache, in_value + i * 4, copy_total, 0);
        wait_group_events(1, &e);

        for(j=group_id; j<out_layer_s; j+=group_count){
            sum = 0.0;

            //need to reset result_buffer[item_id] as well
            //this is in case there are fewer than GROUP_SIZE input values remaining  ie copy_total < GROUP_SIZE
            result_buffer[item_id] = 0.0;

            for(k=item_id; k<copy_total; k+=GROUP_SIZE){
                sum += in_value_cache[k] * in_weights[(k+i) + j * out_layer_s];
            }
            result_buffer[item_id] = sum;

            //simple O(n) reduction can be optimized further
            if(item_id == 0){
                for(k=1;k<GROUP_SIZE;k++){
                    sum += result_buffer[k];
                }
                out_values[j] += sum;
            }
            barrier(CL_LOCAL_MEM_FENCE);
        }

    }
}

这将处理任何大小的输入，因此您可以尝试使用与全局内存一样多的元素。

【讨论】：

【解决方案2】：

您可以通过多种方式做到这一点。但是，在不改变内核行为方式的情况下，最通用的方法是重用您的工作组大小（无论您选择什么，或默认）并重用组中的内存访问。

我会建议这样的事情：

注意：我删除了单个值的丑陋指针。 OpenCL 支持这一点，而且要容易得多。不需要创建内存区，直接做clSetKernelArg(kernel, arg_index, sizeof(cl_float), &size);cl_float size = the_size;即可。

#define IN_LOCAL_SIZE 4096 //Because 16KB/4B (for each float)

void kernel perceptron(global const int in_layer_size, global const int out_layer_size, global const float *in_value, global const float* in_weights, global float* out_values)
{
    const int global_id = get_global_id(0);
    __local float in_buffer[IN_LOCAL_SIZE];

    float sum = 0.0f;
    event_t ev;
    int j;
    //For each full buffer
    for(j=0; j < (in_layer_size/IN_LOCAL_SIZE)-1; i++) {
        ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, IN_LOCAL_SIZE, ev);
        wait_group_events(1,&ev);
        barrier(CLK_LOCAL_MEM_FENCE);
        for(int i=0; i < IN_LOCAL_SIZE; i++) {
            sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
        }
    }
    //Last one
    ev = async_work_group_copy(in_buffer, in_value+j*IN_LOCAL_SIZE, in_layer_size%IN_LOCAL_SIZE, ev);
    wait_group_events(1,&ev);
    barrier(CLK_LOCAL_MEM_FENCE);
    for(int i=0; i < in_layer_size%IN_LOCAL_SIZE; i++) {
        sum += in_weights[(i+j*IN_LOCAL_SIZE)*out_layer_size+global_id] * in_buffer[i];
    }
    out_values[global_id] = sum;
}

但是，如果输出大小很小（100k、250k、500），那么您将只有 500 个工作项，这不是最优的。在这种情况下，您应该重塑算法。

一种可能的方法是，每个工作项都在内层工作，执行求和，整个工作组从所有工作项中创建一个输出。这很容易，因为您可以轻松控制工作组内的总和。

但也许其他方法更适合您的问题。

【讨论】：

在一般情况下，您不能确定您的设备是否可以执行具有固定本地组大小的特定内核。因此，在代码中硬编码本地 WG 大小是 IMO，除了小型本地组之外，这不是最佳实践
我没有修复本地组的大小。我正在修复内核使用的本地内存。至少为 16KB。我正在正确回答 OP 问题，因为他使用的是默认工作尺寸。

【解决方案3】：

正如您所说，瓶颈是加权和。这并不难，因为在每一层，与算术运算的数量相比，每个 WI（工作项）都在进行大量的 IO 操作。我在神经网络方面没有经验，但对我来说，问题看起来像是 GPU 上糟糕的内存访问模式。

这可以通过将您的 WI 组织到本地 WG（工作组）来解决。因为每个 WI 都需要处理 prev 中的所有数据。层，我猜WG中的所有WI都可以将一些数据加载到本地内存中，处理它们而不是下一组数据。这将使您的算法对缓存更加友好。内核伪代码如下：

void kernel Kernel(
__global const int  in_layer_size, 
__global const int  out_layer_size, 
__global const float    *in_value, 
__global const float    *in_weights, 
__global float      *out_values){

__local float buffer[SOME_SIZE];
__global const float* p_in  = in_value;
__global float* p_out = out_values;

const int 
    global_id   = get_global_id(0),
    local_id    = get_local_id(0),
    num_buffers = in_layer_size / SOME_SIZE,
    offset      = out_layer_size * global_id;

float sum = 0.0f;
for(int i=0; i < num_buffers; i++){
    buffer[local_id] = p_in[local_id];
    barrier(CLK_LOCAL_MEM_FENCE);

    //Process all data inside buffer by every WI in WG
    //...

    p_in += SOME_SIZE;
    out_values += SOME_SIZE;
    }

//...
return;

}

因此，您正在使用固定大小的窗口滑动并在其中计算数据，然后转到下一个窗口。所有数据操作都是独立完成的，工作项仅在同一时间使用相同的数据。本地组的最佳大小取决于设备和内核。

【讨论】：