C++ 将 5 字节结构对齐到缓存线答案

【问题标题】：C++ Align 5 byte structs to cachelineC++ 将 5 字节结构对齐到缓存线
【发布时间】：2019-07-14 17:50:29
【问题描述】：

我正在开发一个 CPU-FPGA 协同处理框架，因此我需要完全控制我的数据对齐方式。我有一个只需要 5 个字节的数据结构：

typedef struct __attribute__ ((packed))  {
    uint32_t dst;
    uint8_t weight;
} edg_t;

我的 FPGA 接口可以在每个周期（2 亿次读取/秒）读取 1 个缓存行（64 字节）。将尽可能多的元素塞入一个缓存行对我的表现至关重要，因此填充结构是不可能的。

5 个字节：12 个元素/读取
8 个字节：8 个元素/读取（填充）
填充 -> 1.5 倍性能降低

但是，我不能让结构跨越高速缓存行之间的边缘，这需要我在 FPGA 上构建逻辑以不断转移读取的数据。

我当前构建缓冲区时的解决方案如下所示：

int num_elements = 1000;
int num_cachelines = num_elements / 12 + 1;

uint8_t* buffer = new uint8_t[num_cachelines * 64]
uint8_t* buf_ptr = buffer - 4;

for (int i = 0; i < num_elements; i++) {
    if (i % 12 == 0) buf_ptr += 4; //skip the last 4 bytes of each cache-line

    edg_t* edg_ptr = (edg_t*) buf_ptr;
    edg_ptr->dst = i; //example, I have random generators here
    edg_ptr->weight = i % 256;
    buf_ptr++;

}

现在 FPGA 自己完成所有工作时这很好，现在我希望 FPGA 和 CPU 合作。这意味着 CPU 现在也必须读取缓冲区。

我想知道是否有更好的方法让编译器自动处理填充，或者我是否必须像在上面的缓冲区创建代码中那样每次都手动跳过字节？

【问题讨论】：

您为buffer 分配的内存不能保证返回64 字节对齐的内存，除非您提供了一个自定义的::operator new 这样做。这可能会影响你正在做的事情。

标签： c++ memory data-structures alignment

【解决方案1】：

我假设您将创建此缓冲区结构一次，然后一遍又一遍地填充它以供 FPGA 读取（反之亦然）。如果是这样，这个布局应该可以工作：

constexpr size_t cacheline_size = 64;
constexpr size_t num_elements = 1000;

struct __attribute__ ((packed)) edg_t  {
    /*volatile*/ uint32_t dst;   // volatile if the FPGA writes too
    /*volatile*/ uint8_t weight;
};

constexpr size_t elements_per_cachline = cacheline_size/sizeof(edg_t);
constexpr size_t num_cachelines = num_elements / elements_per_cachline + 1;

struct alignas(cacheline_size) cacheline_t {
    std::array<edg_t, elements_per_cachline> edg;
    inline auto begin() { return edg.begin(); }
    inline auto end() { return edg.end(); }
};

struct cacheline_collection_t {
    std::array<cacheline_t, num_cachelines> cl;
    inline void* address_for_fpga() { return this; }
    inline auto begin() { return cl.begin(); }
    inline auto end() { return cl.end(); }
};

int main() {
    cacheline_collection_t clc;
    std::cout << "edg_t                 : "
       << alignof(edg_t) << " " << sizeof(clc.cl[0].edg[0]) << "\n";
    std::cout << "cacheline_t           : "
       << alignof(cacheline_t) << " " << sizeof(clc.cl[0]) << "\n";
    std::cout << "cacheline_collection_t: "
       << alignof(cacheline_collection_t) << " " << sizeof(clc) << "\n";

    // access
    for(auto& cl : clc) {
        for(auto& edg : cl) {
            std::cout << edg.dst << " " << (unsigned)edg.weight << "\n";
        }
    }
}

assembly @ godbolt 看起来不错。内部循环已完全内联到 12 个代码块，其中每个代码块的 rax 偏移量增加了 5。然后它通过 3 次操作（有条件地）进入下一个缓存行：

    add     rax, 64
    cmp     rax, rcx
    jne     .LBB0_1

【讨论】：