更新 glibc 后 std::tan() 非常慢答案

【问题标题】：std::tan() extremely slow after updating glibc更新 glibc 后 std::tan() 非常慢
【发布时间】：2016-09-27 00:42:45
【问题描述】：

我有一个调用大量三角函数的 C++ 程序。一年多来一直运行良好。我最近安装了 gcc-4.8，并同时更新了 glibc。这导致我的程序几乎慢了 1000 倍。 Using gdb 我发现减速的原因是对 std::tan() 的调用。当参数是 pi 或 pi/2 时，函数需要很长时间才能返回。

这是一个 MWE，如果在没有优化的情况下编译会重现问题（实际程序在有和没有 -O2 标志的情况下都有相同的问题）。

#include <cmath>

int main() {
    double pi = 3.141592653589793;
    double approxPi = 3.14159;
    double ret = 0.;

    for(int i = 0; i < 100000; ++i) ret = std::tan(pi); //Very slow
    for(int i = 0; i < 100000; ++i) ret = std::tan(approxPi); //Not slow
}

这是来自 gdb 的示例回溯（使用 Ctrl+c 随机中断程序后获得）。从调用 tan 开始，MWE 和我的真实程序中的回溯是相同的。

#0  0x00007ffff7b1d048 in __mul (p=32, z=0x7fffffffc740, y=0x7fffffffcb30, x=0x7fffffffc890) at ../sysdeps/ieee754/dbl-64/mpa.c:458
#1  __mul (x=0x7fffffffc890, y=0x7fffffffcb30, z=0x7fffffffc740, p=32) at ../sysdeps/ieee754/dbl-64/mpa.c:443
#2  0x00007ffff7b1e348 in cc32 (p=32, y=0x7fffffffc4a0, x=0x7fffffffbf60) at ../sysdeps/ieee754/dbl-64/sincos32.c:111
#3  __c32 (x=<optimized out>, y=0x7fffffffcf50, z=0x7fffffffd0a0, p=32) at ../sysdeps/ieee754/dbl-64/sincos32.c:128
#4  0x00007ffff7b1e170 in __mptan (x=<optimized out>, mpy=0x7fffffffd690, p=32) at ../sysdeps/ieee754/dbl-64/mptan.c:57
#5  0x00007ffff7b45b46 in tanMp (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:503
#6  __tan_avx (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:488
#7  0x00000000004005b8 in main ()

我尝试在四个不同的系统上运行代码（MWE 和实际程序）。其中两个在我运行代码的集群中。两台是我的笔记本电脑。 MWE 在其中一个集群和一台笔记本电脑上运行没有问题。我检查了每个系统使用哪个版本的 libm.so.6，以防万一。以下列表显示了系统描述（取自cat /etc/*-release），CPU是32位还是64位，MWE是否慢，最后是运行/lib/libc.so.6和cat /proc/cpuinfo的输出。

SUSE Linux Enterprise Server 11 (x86_64)，64 位，使用 libm-2.11.1.so（MWE 速度很快）

GNU C Library stable release version 2.11.1 (20100118), by Roland McGrath et al.
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Configured for x86_64-suse-linux.
Compiled by GNU CC version 4.3.4 [gcc-4_3-branch revision 152973].
Compiled on a Linux 2.6.32 system on 2012-04-12.
Available extensions:
        crypt add-on version 2.1 by Michael Glad and others
        GNU Libidn by Simon Josefsson
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
stepping        : 2
microcode       : 53
cpu MHz         : 1200.000
cache size      : 30720 KB
physical id     : 0
siblings        : 24
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips        : 5000.05
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

CentOS 版本 6.7（最终版），64 位，使用 libm-2.12.so（MWE 很慢）

GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-16).
Compiled on a Linux 2.6.32 system on 2015-09-22.
Available extensions:
        The C stubs add-on version 2.1.2.
        crypt add-on version 2.1 by Michael Glad and others
        GNU Libidn by Simon Josefsson
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
        RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5507  @ 2.27GHz
stepping        : 5
cpu MHz         : 1596.000
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4533.16
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Ubuntu 精确（12.04.5 LTS），64 位，使用 libm-2.15.so（我的第一台笔记本电脑，MWE 很慢）

GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.15) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.79 system on 2016-05-26.
Available extensions:
    crypt add-on version 2.1 by Michael Glad and others
    GNU Libidn by Simon Josefsson
    Native POSIX Threads Library by Ulrich Drepper et al
    BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 42
model name  : Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz
stepping    : 7
microcode   : 0x1a
cpu MHz     : 800.000
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5387.59
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Ubuntu 精确（12.04.5 LTS），32 位，使用 libm-2.15.so（我的第二台笔记本电脑，MWE 很快）

GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.12) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.68 system on 2015-03-26.
Available extensions:
    crypt add-on version 2.1 by Michael Glad and others
    GNU Libidn by Simon Josefsson
    Native POSIX Threads Library by Ulrich Drepper et al
    BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.

processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Core(TM)2 Duo CPU     T5800  @ 2.00GHz
stepping    : 13
microcode    : 0xa3
cpu MHz        : 800.000
cache size    : 2048 KB
physical id    : 0
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fdiv_bug    : no
hlt_bug        : no
f00f_bug    : no
coma_bug    : no
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm
bogomips    : 3989.79
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

我希望我已经设法提供了足够的背景信息。这些是我的问题。

为什么 std::tan() 变慢了？
有没有办法恢复到正常速度？

我非常喜欢不需要安装/替换一堆库的解决方案。这可能适用于我的笔记本电脑，但我在集群节点上没有必要的权限。

更新 #1： 正如 Sam Varshavchik 所解释的那样，我删除了关于将常量传递给 tan 的观察。我将运行/lib/libc.so.6 的输出添加到我的系统列表中。还增加了第四个系统。至于时间，这里是运行time ./mwe 和pi 循环的输出（approxPi 被注释掉）。

real    0m11.483s
user    0m11.465s
sys 0m0.004s

这里是 approxPi 循环（pi 已注释掉）。

real    0m0.011s
user    0m0.008s
sys 0m0.000s

更新 #2： 对于每个系统，添加 CPU 是 32 位还是 64 位以及第一个内核的 cat /proc/cpuinfo 的输出。

【问题讨论】：

我对你的时间陈述有疑问（“非常慢/不慢”）。对于前 2 个循环，您传递了一个双精度数，因此结果不能不同！我只是无法想象这些位的值会如何影响时序......你能解释一下你是如何做这些循环的时序的吗？也许在之前/之后添加一些计时器，以获得可靠的东西？
将常量传递给 tan() 确实会导致编译器在编译时执行计算。
“快”和“慢”的描述性不是很强。你有使用分析工具的硬数字吗？您能否证明问题出在 同一系统上的 libm 版本上？你是如何升级 glibc 的，你是通过包管理器升级的，还是从源代码编译的（一个很大的禁忌）。此外，您是否认为该问题与 GCC 4.8 或 libm 版本有关？问题的变数太多，我怀疑你能否得到有用的答案。
请把 running /lib/libc.so.6 的完整输出粘贴到每个系统上（glibc 很神奇，它既是库又是程序）。
@sleeptightpupper 我使用sudo add-apt-repository ppa:ubuntu-toolchain-r/test 和sudo apt-get install gcc-4.8 g++-4.8 更新了编译器（以及glibc，因为编译器依赖于它）。这确实造成了问题。我认为它与编译器无关；我使用 gcc-4.4、gcc-4.6 和 gcc-4.8 时也会出现同样的减速。我在列表中添加了第四个系统（我的另一台笔记本电脑）。它使用 libc-2.15.so 但不会受到任何减速的影响。如果问题出在 libc 上，则它不是普通版本号。也许libc.so.6 输出之间的细微差别有线索。

标签： c++ linux performance gcc glibc

【解决方案1】：

超越函数（例如三角函数和指数函数）的准确性一直是个问题¹。

为什么有些三角函数调用比其他函数慢

对于三角函数的许多参数，有一个快速近似可以为大多数参数产生高度准确的结果。然而，对于某些论点，近似值可能是非常错误的。因此，需要采用更精确的方法，但这些方法需要更长的时间（如您所见）。

为什么现在新库可能会变慢

很长一段时间以来，英特尔就其浮点型三角函数的准确性做出了误导性声明，称它们比实际情况要准确得多²。这么多，那个 glibc 过去只是将 sin(double) 作为 fsin(float)³ 的包装器。您可能已升级到已纠正此错误的 glibc 版本。我不能代表 AMD 的 libm，但它可能仍然依赖于对三角函数浮点版本准确性的错误声明^4,5。

做什么

如果您想要速度并且不太在意准确性，请使用浮点版本的 tan (ftan)。否则，如果您需要准确性，那么您将无法使用较慢的方法。最好的办法是缓存 tan(pi) 和 tan(pi/2) 的结果，并在您认为可能需要它们时使用预先计算的值。

【讨论】：

虽然很高兴引用 Bruce@randomascii，但这与手头的问题无关。从 GDB 堆栈跟踪第 6 行，可以检查是否调用了 __tan_avx。这意味着 glibc 配置为使用 SSE2（加上 AVX，如果适用，加上 AVX2）单元进行 FP 数学运算。 SSE2 单元没有fsin 或类似指令，只有 IEEE 强制的 +、-、*、/ 和 sqrt。三角函数通常通过范围缩减和 Pade 近似来完成。
好吧，一切都很好，但是对于 tan(0) = 0, tan(±π) = 0, tan(→±π/2) → ±∞ (sign取决于你从哪一侧接近极限）。