为什么memset慢？答案

【问题标题】：Why is memset slow?为什么memset慢？
【发布时间】：2014-06-15 23:01:06
【问题描述】：

我的 CPU 规格表明它应该获得 5.336GB/s 的内存带宽。为了测试这一点，我编写了一个简单的程序，它在一个大数组上运行 memset（或 memcpy）并报告时间。我在 memset 上显示 3.8GB/s，在 memcpy 上显示 1.9GB/s。 http://en.wikipedia.org/wiki/Intel_Core_(microarchitecture) 说我的 Q9400 应该达到 5.336MB/s。怎么了？

我尝试用赋值循环替换 memset 或 memcpy。我已经四处搜索以尝试了解内存对齐。我尝试了不同的编译器标志。我为此花费了令人尴尬的几个小时。感谢您提供的任何帮助！

我正在使用带有 libc-dev 版本 2.15-0ubuntu10.5 和内核 3.8.0-37-generic 的 Ubuntu 12.04

代码：

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))

int main(int argc,char**argv){
    if(argc!=3){
        printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]);
        return -1;
    }
    char*__restrict__ source=(char*)malloc(numBytes);
    char*__restrict__ dest=(char*)malloc(numBytes);
    struct timespec start,end;
    long totalTimeMs;
    int i;

    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memset(source,0,numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);

    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memcpy( dest, source, numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);

    free(source);
    free(dest);

    return EXIT_SUCCESS;
}

编译命令：

gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt

示例输出：

memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).

根据 lshw 我的硬件：

  product: OptiPlex 960 ()
  vendor: Winbond Electronics
  width: 64 bits
*-core
     description: Motherboard
     product: 0Y958C
     vendor: Winbond Electronics
   *-firmware
        description: BIOS
        capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
   *-cpu
        product: Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
        physical id: 400
        size: 2666MHz
        width: 64 bits
        clock: 1333MHz
        capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
        configuration: cores=4 enabledcores=4 threads=4
      *-cache:0
           description: L1 cache
           physical id: 700
           size: 256KiB
           capacity: 256KiB
           capabilities: internal write-back unified
      *-cache:1
           description: L2 cache
           physical id: 701
           size: 6MiB
           capacity: 6MiB
           capabilities: internal varies unified
   *-memory
        description: System Memory
        physical id: 1000
        slot: System board or motherboard
        size: 4GiB
      *-bank:0
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
           product: CT51264AA667.M16FC
           vendor: 7F7F7F7F7F9B0000
           slot: DIMM_1
           size: 4GiB
           width: 64 bits
           clock: 667MHz (1.5ns)
      *-bank:1
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:2
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:3
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]

【问题讨论】：

杰夫盖伊，你的memcpy 是什么？它是您 Linux 中的默认设置吗？我们需要您的 glibc（或 eglibc）库的准确版本，以及您的操作系统发行版的准确版本。 “32GB/s”内存带宽的规格是什么？这听起来对于 Q9400 来说太高了，它可能最多只有两个 DDR2-667 通道，理论上每个通道的峰值为 5.3 GBytes/s。但实际上，即使使用更好的 DDR3，它也只有 7.4 GB/s legitreviews.com/intel-core-2-quad-q9400-processor-review_939/3
DDR2 @667 MHz 绝不会比您测量的更多。无论如何都远不及 32 GB/s。
你是对的。我在两个系统上运行这个测试：高端服务器和我的桌面。服务器应该达到 32GB/s（它有两个 E5645 Xeons ark.intel.com/products/48768/…），但我的桌面应该只有 5.336MB/s。我为错字道歉。我不确定服务器有什么样的内存——你可能是对的，内存是瓶颈。我将编辑问题以修正错字。

标签： optimization memcpy memset memory-bandwidth

【解决方案1】：

内存地址是“虚拟化的”，程序使用的地址被转换为真实地址。这种转换可以从当时方便的任何部分分配您的程序视为连续内存的内容。每个通用 CPU 都会这样做。翻译需要查表，这需要访问内存。 CPU 有缓存，但是很长一段虚拟地址很容易破坏它的缓存，即“TLB”（“翻译后备缓冲区”）。因此，每 4KB（Linux 系统上的 2MB 就知道你在做什么）CPU 会停止寻找真正发送内存流量的位置。这些摊位可能需要相当长的时间。您可以尝试运行两个基准测试副本，TLB 未命中似乎是合理的，并且您将获得更接近额定容量的总带宽。

（编辑：嗯，你可能想用

替换你的 #defines

size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);

在主体中...）

编辑：顺便说一句：我在盒子上的这个测试中看到的带宽（并以 cmets 报告）远远低于我的 cpu 的额定容量，这让我调查了我自己的系统。我的盒子制造商在这些插槽中放置了非常垃圾的内存。我早就用知名品牌取代了它们，报告的吞吐量增加了一倍以上，并且非常明显地提高了我机器的性能。

【讨论】：

TLB 档位很短。但是什么是长的 - 页面错误是。当你malloc内存时，系统不会为分配的内存提供物理页面。它只会设置从 malloced 虚拟空间到特殊“空”（零）页面的映射。当程序尝试向此类页面写入任何内容时，就会发生页面错误。 Pagefault handler 将分配真正的物理页面，将其归零，然后返回（这是一个缓慢的过程，比 TLB 未命中要长得多）。
Jeff Guy，因此，为了获得好的数字，您在测量之前为所有内存添加一个 memset 循环。在这样的循环之后，所有内存都将被预先故障，并且在测量期间不会发生页面故障。
@osgx 在我的带有 DDR3/1333 的 linux 3.14.1 3570k 上，在没有预置零的情况下运行 8 圈 128MB 得到 7.35GB/s，使用预置零运行 8 圈得到 7.87GB/s。并行执行四次运行得到总计 8.48GB/s 非预置零和 8.79GB/s 预置零。因此，预置零的总收益约为 0.5GB/s，填充 TLB 未命中阴影的总收益约为 1GB/s
完全正确！现在我得到了 5.1MB/s，这已经足够接近 Q9400 的 5.336MB/s 规格了。谢谢你这么清楚的解释！

【解决方案2】：

最后我检查了 memcpy 和 memset 在 GCC 中没有优化。这仍然是真的in 2012。请参阅 Agner Fog 的 Optimizing software in C++ 第 2.6 节 2.6“函数库的选择”和表2.1。他比较了几种不同的编译器和操作系统。

GCC 内置了用于执行 memcpy 的函数。显然，它们甚至比 Glib 中的 memcpy 还要糟糕。据我了解，GCC 开发人员和 Glib 开发人员独立工作。要从 Glib 获取库，您需要使用 -fno-builtin。然而，尽管 Glib （或至少是）更好，但它仍然不是最优的。要获得最佳效果，请使用 Agner Fog 的asmlib。他优化了 memcpy 和 memset 以及汇编中的许多其他常用函数，以利用 SSE 和 AVX 以及其他优化。

【讨论】：