【问题标题】:CPU cache understanding using a C program使用 C 程序理解 CPU 缓存
【发布时间】:2013-06-12 21:07:43
【问题描述】:

我正在尝试使用 C 程序来理解 CPU 缓存和缓存行,就像我对大多数 C 概念所做的那样。我使用的程序如下所示。我从博客中得到了这个想法。

http://igoro.com/archive/gallery-of-processor-cache-effects/

现在下面程序在我的机器上的输出如下所示。这是 CFLAGS="-g -O0 -Wall" 的输出。

./cache
CPU time for loop 1 0.460000 secs.
CPU time for loop 2 (j = 8) 0.050000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.050000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.050000 secs.
CPU time for loop 2 (j = 17) 0.040000 secs.
CPU time for loop 2 (j = 18) 0.050000 secs.
CPU time for loop 2 (j = 19) 0.040000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.040000 secs.
CPU time for loop 2 (j = 24) 0.030000 secs.
CPU time for loop 2 (j = 25) 0.040000 secs.
CPU time for loop 2 (j = 26) 0.030000 secs.
CPU time for loop 2 (j = 27) 0.040000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.040000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

优化后的输出 (CFLAGS=-g -O3 -Wall)

CPU time for loop 1 0.130000 secs.
CPU time for loop 2 (j = 8) 0.040000 secs.
CPU time for loop 2 (j = 9) 0.050000 secs.
CPU time for loop 2 (j = 10) 0.050000 secs.
CPU time for loop 2 (j = 11) 0.040000 secs.
CPU time for loop 2 (j = 12) 0.040000 secs.
CPU time for loop 2 (j = 13) 0.050000 secs.
CPU time for loop 2 (j = 14) 0.050000 secs.
CPU time for loop 2 (j = 15) 0.040000 secs.
CPU time for loop 2 (j = 16) 0.040000 secs.
CPU time for loop 2 (j = 17) 0.050000 secs.
CPU time for loop 2 (j = 18) 0.040000 secs.
CPU time for loop 2 (j = 19) 0.050000 secs.
CPU time for loop 2 (j = 20) 0.040000 secs.
CPU time for loop 2 (j = 21) 0.040000 secs.
CPU time for loop 2 (j = 22) 0.040000 secs.
CPU time for loop 2 (j = 23) 0.030000 secs.
CPU time for loop 2 (j = 24) 0.040000 secs.
CPU time for loop 2 (j = 25) 0.030000 secs.
CPU time for loop 2 (j = 26) 0.040000 secs.
CPU time for loop 2 (j = 27) 0.030000 secs.
CPU time for loop 2 (j = 28) 0.030000 secs.
CPU time for loop 2 (j = 29) 0.030000 secs.
CPU time for loop 2 (j = 30) 0.030000 secs.
CPU time for loop 2 (j = 31) 0.030000 secs.

博客中指出

第一个循环将数组中的每个值乘以 3,第二个循环仅每 16 次乘以 >。第二个循环只做 大约是第一个循环的 6%,但在现代机器上, 两个 for 循环大约需要相同的时间:分别为 80 和 78 毫秒 我的机器。

我的机器上似乎不是这种情况。可以看到,执行的时间

loop 1 is 0.46 seconds.

那是为了

loop 2 is 0.03 seconds or 0.04 seconds or 0.05 seconds

对于不同的 j 值。

为什么会这样?

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (64*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack. This is an unfortunate rough edge of the way the stack works.
     It lives in a fixed-size buffer, set by the program executable's configuration according to the
     operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */
    start = clock();

    /* Loop 1 */
    for (i = 0; i < MAX_SIZE; i++)
        arr[i] *= 3;

    /* CPU clock ticks count stop */
    end = clock();

    cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

    printf("CPU time for loop 1 %.6f secs.\n", cpu_time);

    for (j = 8 ; j < 32 ; j++)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] *= 3;

        /* CPU clock ticks count stop*/
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }

    return 0;
}

【问题讨论】:

  • 不是真的重复。我发布了关于段错误的另一个问题。我修复了这个问题,这是试图解释结果。
  • @hit:虽然两个问题的代码相同,但实际提出的问题却大不相同……
  • 您能否报告您正在使用的编译器,以及您传递的标志?
  • 哇,这是一个非常古老的编译器。此外,当您进行基准测试时,您应该在启用优化的情况下进行编译。我敢问你在什么硬件上运行它?

标签: c performance caching time


【解决方案1】:

我稍微修改了代码。先总结一下修改:

  1. 使 MAX_SIZE 显着增大,以确保在情况发生变化时存在真正的差异。 (它现在使用完整的 2 GB 内存,所以不要在 32 位操作系统上这样做)
  2. 运行循环 1 几次(在我的机器上,这会有所不同,因为我的机器第一次运行速度会变慢 - 这可能是因为 malloc 实际上并没有将内存映射到进程中'地址空间,所以在第一个循环中,我们得到了一些额外的内存映射开销)。它还确保 CPU 在执行其他循环时以“全速”运行,而不是“省电”速度。
  3. 在第二个循环中通过乘以 2 更快地改变 j 值(在这种情况下,&lt;&lt;= 1*= 2 相同 - 使用 shift 的旧习惯)
  4. 使用+= 3 而不是*= 3。 (乘法比 += 慢一点,但在这种情况下差别不大。
  5. 添加一个loop3,它执行与loop2 完全相同的操作数量,但在较小的内存范围内[使用带有2n-1 值的&amp; 来限制范围]。

我使用 gcc -Wall -O3 -sdc=c99 编译了代码,使用版本 4.6.3 并在四核 Athlon 965、Fedora Core 16 x86-64 和 16 GB RAM 上运行。

代码如下:

#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>

#define MAX_SIZE (512*1024*1024)

int main()
{
    clock_t start, end;
    double cpu_time;
    int i = 0;
    int j = 0;
    /* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
       It lives in a fixed-size buffer, set by the program executable's configuration according to the
       operating system, but its actual size is seldom checked against the available space. */
    /* int arr[MAX_SIZE]; */

    int *arr = (int*)malloc(MAX_SIZE * sizeof(int));

    /* CPU clock ticks count start */

    for(int k = 0; k < 3; k++)
    {
        start = clock();

        /* Loop 1 */
        for (i = 0; i < MAX_SIZE; i++)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 1 %.6f secs.\n", cpu_time);
    }

    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 2 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 2 (j = %d) %.6f secs.\n", j, cpu_time);
    }


    // Third loop, performing the same operations as loop 2,
    // but only touching 16KB of memory
    for (j = 1 ; j <= 1024 ; j <<= 1)
    {
        /* CPU clock ticks count start */
        start = clock();

        /* Loop 3 */
        for (i = 0; i < MAX_SIZE; i += j)
            arr[i & 0xfff] += 3;

        /* CPU clock ticks count stop */
        end = clock();

        cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;

        printf("CPU time for loop 3 (j = %d) %.6f secs.\n", j, cpu_time);
    }
    return 0;
}

结果:

CPU time for loop 1 2.950000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 1 0.630000 secs.
CPU time for loop 2 (j = 1) 0.780000 secs.
CPU time for loop 2 (j = 2) 0.700000 secs.
CPU time for loop 2 (j = 4) 0.610000 secs.
CPU time for loop 2 (j = 8) 0.540000 secs.
CPU time for loop 2 (j = 16) 0.560000 secs.
CPU time for loop 2 (j = 32) 0.280000 secs.
CPU time for loop 2 (j = 64) 0.140000 secs.
CPU time for loop 2 (j = 128) 0.090000 secs.
CPU time for loop 2 (j = 256) 0.060000 secs.
CPU time for loop 2 (j = 512) 0.030000 secs.
CPU time for loop 2 (j = 1024) 0.040000 secs.
CPU time for loop 3 (j = 1) 0.470000 secs.
CPU time for loop 3 (j = 2) 0.240000 secs.
CPU time for loop 3 (j = 4) 0.120000 secs.
CPU time for loop 3 (j = 8) 0.050000 secs.
CPU time for loop 3 (j = 16) 0.030000 secs.
CPU time for loop 3 (j = 32) 0.020000 secs.
CPU time for loop 3 (j = 64) 0.010000 secs.
CPU time for loop 3 (j = 128) 0.000000 secs.
CPU time for loop 3 (j = 256) 0.000000 secs.
CPU time for loop 3 (j = 512) 0.000000 secs.
CPU time for loop 3 (j = 1024) 0.000000 secs.

如您所见,loop2 的前几行花费了相同的时间——一旦我们达到 32,时间开始下降,因为处理器不需要每个缓存行,但在loop3 情况下,每个循环中的操作数直接影响总时间。

编辑:

乘法 (*=3) 与加法 (+=3) 并没有太大区别,除了在循环 3 的情况下,它增加了大约 30% 的循环时间。

【讨论】:

  • 您可以在 mallocing 后通过 memsetting 数组来解决循环问题。
  • @SoapBox:是的,我想是这样,但这并不能阻止其他事情,例如前几十个 CPU 没有全速运行(由于省电模式)一秒钟[假设 memset 足够快,无法将 CPU 速度提高到最大值 - 不确定是否是]
猜你喜欢
  • 1970-01-01
  • 2013-11-21
  • 2011-12-18
  • 2012-07-01
  • 1970-01-01
  • 2017-07-09
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多