比较各种 pthread 构造的性能答案

【问题标题】：Comparing performance of various pthread constructs比较各种 pthread 构造的性能
【发布时间】：2019-04-01 07:55:58
【问题描述】：

我需要通过设计一些实验来比较各种 pthread 结构（如互斥锁、信号量、读写锁以及相应的串行程序）的性能。主要问题是决定如何衡量分析代码的执行时间？

我已经阅读了一些 C 函数，如 clock()、gettimeofday() 等。据我所知 - 我们可以使用 clock() 来获取程序使用的实际 CPU 周期数（通过减去返回的值我们要测量其时间的代码开始和结束处的函数），gettimeofday() 返回程序执行的挂钟时间。

但问题是总 CPU 周期对我来说似乎不是一个好的标准，因为它将所有并行运行的线程所花费的 CPU 时间相加（因此我认为 clock() 并不好）。挂钟时间也不好，因为可能有其他进程在后台运行，所以时间最终取决于线程如何安排（所以 gettimeofday() 在我看来也不好）。

我知道的其他一些功能也更可能与上述两个功能相同。所以，我想知道是否有一些函数可以用于我的分析，或者我在上面的结论中是否有错误？

【问题讨论】：

你的执行记录如何？你的操作系统是什么？如果你想比较单线程/多线程比较实时而不是 CPU 时间
我用的是linux
执行时间有多长？你有多少 CPU/核心？
我必须针对各种输入大小进行比较 - 例如，我必须对数组求和，然后我必须改变大小，例如 10^7、10^8、10^9 .
你应该展示你尝试的代码。关于测量时间的方法，您应该使用clock_getttime() 或__rdtsc()。不要忘记禁用 CPU 频率更改。在您的编译器上始终至少使用-O2。执行多项测量并使用统计方法去除异常值：修剪后的平均值，甚至是更简单且结果更稳定的最小值。

标签： c performance parallel-processing pthreads execution-time

【解决方案1】：

来自linux clock_gettime：

   CLOCK_PROCESS_CPUTIME_ID (since Linux 2.6.12)
          Per-process CPU-time clock (measures CPU time consumed by all
          threads in the process).

   CLOCK_THREAD_CPUTIME_ID (since Linux 2.6.12)
          Thread-specific CPU-time clock.

我相信 clock() 在某处实现为 clock_gettime(CLOCK_PROCESS_CPUTIME_ID，但我看到它是在 glibc 中使用 times() 实现的。

因此，如果您想测量特定线程的 CPU 时间，您可以在 GNU/Linux 系统上使用 clock_gettimer(CLOCK_THREAD_CPUTIME_ID, ...。

切勿使用gettimeofday 或clock_gettime(CLOCK_REALTIME 来衡量程序的执行情况。甚至不去想那个。 gettimeofday 是“挂钟” - 您可以将其显示在房间的墙上。如果你想衡量时间的流动，忘记gettimeofday。

如果您愿意，您甚至可以通过在线程中使用 pthread_getcpuclockid 并将其返回的 clock_id 值与 clock_gettime 一起使用来保持完全 posixly 兼容。

【讨论】：

一切都取决于你想测量什么，对我来说这是实时的，因为这是我感觉的时间，如果我需要 1 分钟来加载我的程序，我不在乎它需要 1 秒来执行，对我来说时间是 1min1sec 而不是 1sec ;-)
然后使用CLOCK_MONOTONIC，而不是gettimeofday。 gettimeofday 是挂钟，而不是“测量间隔时钟”。它可以跳。如果您使用gettimeofday 来衡量程序的执行情况，看到负时间间隔不要感到惊讶。或者错误的间隔。它可以跳。 gettimeofday仅用于与 UTC 同步的漂亮用户时钟时间。因为一旦闰秒开始，你的测量就会出错。或者ntp 启动并同步系统 - 你的测量结果将是错误的。

【解决方案2】：

我不确定对数组求和是一个很好的测试，你不需要任何互斥锁等来对多线程中的数组求和，每个线程只需对数组的一个专用部分求和，并且有很多少量 CPU 计算的内存访问。示例（编译时给出了SZ和NTHREADS的值），测量的时间是实时的（单调的）：

#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#include <pthread.h>

static int Arr[SZ];

void * thSum(void * a)
{
  int s = 0, i;
  int sup = *((int *) a) + SZ/NTHREADS;

  for (i = *((int *) a); i != sup; ++i)
    s += Arr[i];

  *((int *) a) = s;
}

int main()
{
  int i;

  for (i = 0; i != SZ; ++i)
    Arr[i] = rand();

  struct timespec t0, t1;

  clock_gettime(CLOCK_MONOTONIC, &t0);

  int s = 0;

  for (i = 0; i != SZ; ++i)
    s += Arr[i];

  clock_gettime(CLOCK_MONOTONIC, &t1);
  printf("mono thread : %d %lf\n", s,
         (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);

  clock_gettime(CLOCK_MONOTONIC, &t0);

  int n[NTHREADS];
  pthread_t ths[NTHREADS];

  for (i = 0; i != NTHREADS; ++i) {
    n[i] = SZ / NTHREADS * i;
    if (pthread_create(&ths[i], NULL, thSum, &n[i])) {
      printf("cannot create thread %d\n", i);
      return -1;
    }
  }

  int s2 = 0;

  for (i = 0; i != NTHREADS; ++i) {
    pthread_join(ths[i], NULL);
    s2 += n[i];
  }

  clock_gettime(CLOCK_MONOTONIC, &t1);
  printf("%d threads : %d %lf\n", NTHREADS, s2,
         (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec)/1000000000.0);
}

编译和执行：

（100.000.000 个元素的数组）

/tmp % gcc -DSZ=100000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035217
2 threads : 563608529 0.020407
/tmp % ./a.out
mono thread : 563608529 0.034991
2 threads : 563608529 0.022659
/tmp % gcc -DSZ=100000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035212
4 threads : 563608529 0.014234
/tmp % ./a.out
mono thread : 563608529 0.035184
4 threads : 563608529 0.014163
/tmp % gcc -DSZ=100000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : 563608529 0.035229
8 threads : 563608529 0.014971
/tmp % ./a.out
mono thread : 563608529 0.035142
8 threads : 563608529 0.016248

（1000.000.000 个元素的数组）

/tmp % gcc -DSZ=1000000000 -DNTHREADS=2 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.343761
2 threads : -1471389927 0.197303
/tmp % ./a.out
mono thread : -1471389927 0.346682
2 threads : -1471389927 0.197669
/tmp % gcc -DSZ=1000000000 -DNTHREADS=4 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346859
4 threads : -1471389927 0.130639
/tmp % ./a.out
mono thread : -1471389927 0.346506
4 threads : -1471389927 0.130751
/tmp % gcc -DSZ=1000000000 -DNTHREADS=8 -O3 s.c -lpthread -lrt
/tmp % ./a.out
mono thread : -1471389927 0.346954
8 threads : -1471389927 0.123572
/tmp % ./a.out
mono thread : -1471389927 0.349652
8 threads : -1471389927 0.127059

你可以看到即使执行时间不除以线程数，瓶颈可能是对内存的访问

【讨论】：

您不应使用gettimeofday() 进行性能测量。任何 ntp 同步都会破坏您的措施。
@AlainMerigot 测量的执行时间存在差异，但可能不是因为 ntp，时钟足够好并且追赶速度很小。对我来说，必须测量实时而不是 CPU 时间
@AlainMerigot 不管怎样我都进入了单调时间