提高字符串的分配性能答案

【问题标题】：Increasing allocation performance for strings提高字符串的分配性能
【发布时间】：2014-04-20 22:35:14
【问题描述】：

我将 Java GC 测试程序移植到 C++（参见下面的代码）和 Python。 Java 和 Python 的性能远高于 C++，我认为这是由于每次都必须调用 new 来创建字符串。我试过使用 Boost 的fast_pool_allocator，但这实际上使性能从 700ms 恶化到 1200ms。我是否使用了错误的分配器，或者我应该做些什么？

编辑：使用g++ -O3 -march=native --std=c++11 garbage.cpp -lboost_system 编译。 g++ 是版本 4.8.1 Python 的一次迭代大约需要 300 毫秒，而 Java 大约需要 50 毫秒。 std::allocator 大约需要 700 毫秒，boost::fast_pool_allocator 大约需要 1200 毫秒。

#include <string>
#include <vector>
#include <chrono>
#include <list>
#include <iostream>
#include <boost/pool/pool_alloc.hpp>
#include <memory>
//#include <gc/gc_allocator.h>


using namespace std;
#include <sstream>
typedef boost::fast_pool_allocator<char> c_allocator;
//typedef std::allocator<char> c_allocator;
typedef basic_string<char, char_traits<char>, c_allocator> pool_string;
namespace patch {
    template <typename T> pool_string to_string(const T& in) {
        std::basic_stringstream<char, char_traits<char>, c_allocator> stm;
        stm << in;
        return stm.str();
    }
}


#include "mytime.hpp"

class Garbage {
public:
    vector<pool_string> outer;
    vector<pool_string> old;
    const int nThreads = 1;
    //static auto time = chrono::high_resolution_clock();

    void go() {
//        outer.resize(1000000);
        //old.reserve(1000000);
        auto tt = mytime::msecs();
        for (int i = 0; i < 10; ++i) {
            if (i % 100 == 0) {
                cout << "DOING AN OLD" << endl;
                doOld();
                tt = mytime::msecs();
            }

            for (int j = 0; j < 1000000/nThreads; ++j)
                outer.push_back(patch::to_string(j));

            outer.clear();
            auto t = mytime::msecs();
            cout << (t - tt) << endl;
            tt = t;
        }
    }

    void doOld() {
        old.clear();
        for (int i = 0; i < 1000000/nThreads; ++i)
            old.push_back(patch::to_string(i));
    }
};

int main() {
    Garbage().go();
}

【问题讨论】：

fast_pool_allocator docs 似乎表明你确实用错了：pool_allocator 用于连续块（例如new char[n]），fast_pool_allocators 用于单个事物（例如new char）。
谢谢。我刚试过这个，但我厌倦了等待它打印一个数字（即它花了很长时间）
@user315118 - 您的帖子中没有提及使用的编译器、构建应用程序时在优化方面使用的编译器选项等。如果您要发布声称在某些方面执行的代码时间单位，您还必须发布您使用的编译器和选项。否则，我们可以正确地假设您使用的是旧的编译器、损坏的编译器，或者在没有完全启用优化的情况下进行编译。
你是说每个循环需要0.7-1.2s？在我的机器上，使用标准分配器大约需要 0.19 秒。要么你在一些相当慢的硬件上运行，要么你没有启用优化。
@PaulMcKenzie 谢谢。我已经用这些信息更新了问题。

标签： c++ linux memory-management boost memory-pool

【解决方案1】：

问题是你每次都使用一个新的字符串流来转换一个整数。

修复它：

namespace patch {
    template <typename T> pool_string to_string(const T& in) {
        return boost::lexical_cast<pool_string>(in);
    }
}

现在时间是：

DOING AN OLD
0.175462
0.0670085
0.0669926
0.0687969
0.0692518
0.0669318
0.0669196
0.0669187
0.0668962
0.0669185

real    0m0.801s
user    0m0.784s
sys 0m0.016s

看Live On Coliru

完整代码供参考：

#include <boost/pool/pool_alloc.hpp>
#include <chrono>
#include <iostream>
#include <list>
#include <memory>
#include <sstream>
#include <string>
#include <vector>
#include <boost/lexical_cast.hpp>
//#include <gc/gc_allocator.h>

using string = std::string;

namespace patch {
    template <typename T> string to_string(const T& in) {
        return boost::lexical_cast<string>(in);
    }
}

class Timer
{
    typedef std::chrono::high_resolution_clock clock;
    clock::time_point _start;
  public:
    Timer() { reset(); }
    void reset() { _start = now(); }
    double elapsed()
    {
        using namespace std::chrono;
        auto e = now() - _start;
        return duration_cast<nanoseconds>(e).count()*1.0e-9;
    }
    clock::time_point now()
    {
        return clock::now();
    }
};


class Garbage {
    public:
        std::vector<string> outer;
        std::vector<string> old;
        const int nThreads = 1;

        void go() {
            outer.resize(1000000);
            //old.reserve(1000000);
            Timer timer;

            for (int i = 0; i < 10; ++i) {
                if (i % 100 == 0) {
                    std::cout << "DOING AN OLD" << std::endl;
                    doOld();
                }

                for (int j = 0; j < 1000000/nThreads; ++j)
                    outer.push_back(patch::to_string(j));

                outer.clear();
                std::cout << timer.elapsed() << std::endl;
                timer.reset();
            }
        }

        void doOld() {
            old.clear();
            for (int i = 0; i < 1000000/nThreads; ++i)
                old.push_back(patch::to_string(i));
        }
};

int main() {
    Garbage().go();
}

【讨论】：

像往常一样，这是“不是你认为它花费时间的地方”的问题......好地方。
@MatsPetersson 确实如此。像往常一样，它是“个人资料，个人资料，个人资料”（虽然我没有用这个）:)
@sehe 太棒了！我没有考虑过stringstream的分配。将其声明为静态会产生类似的结果。尽管在我的计算机上，一次迭代的最佳性能仍然只有 175 毫秒左右。我用你的替换了我的计时器，现在大约是 100 毫秒。因此，对垫子 Petersson 进行双重双重强调。
@MatsPetersson 绝对......我被我的 GCed 语言宠坏了哈哈。我实际上做了一个调用图，但不可否认地看了一眼它列出的所有模板化函数并关闭了它：/
这就像说前轮驱动汽车比后轮驱动汽车快。真正影响速度的不是哪个车轮驱动汽车，而是整体设计、发动机的功率，如果涉及到弯角，还有轮胎、悬架等——前轮驱动或后轮驱动在整个方案中都是很小的因素。我相信这个类比与“垃圾收集与非垃圾收集”相比相当好。在我正在处理的编译器项目中，我只是懒洋洋地忽略了所有的内存释放，并依赖于在退出时清理整个事情。显然不会永远工作，但是......

【解决方案2】：

由于我没有在我的机器上使用 boost，我简化了代码以使用标准 C++11 to_string（因此意外地“修复”了他发现的问题），并得到了这个：

#include <string>
#include <vector>
#include <chrono>
#include <list>
#include <iostream>
#include <memory>
//#include <gc/gc_allocator.h>
#include <sstream>
using namespace std;


class Timer
{
    typedef std::chrono::high_resolution_clock clock;
    clock::time_point _start;
    public:
    Timer() { reset(); }
    void reset() { _start = now(); }
    double elapsed()
    {
        using namespace std::chrono;
        auto e = now() - _start;
        return duration_cast<nanoseconds>(e).count()*1.0e-9;
    }
    clock::time_point now()
    {
        return clock::now();
    }
};


class Garbage {
public:
    vector<string> outer;
    vector<string> old;
    const int nThreads = 1;
Timer timer;

    void go() {
//        outer.resize(1000000);
        //old.reserve(1000000);
        for (int i = 0; i < 10; ++i) {
            if (i % 100 == 0) {
                cout << "DOING AN OLD" << endl;
                doOld();
            }

            for (int j = 0; j < 1000000/nThreads; ++j)
                outer.push_back(to_string(j));

            outer.clear();
            cout << timer.elapsed() << endl;
            timer.reset();
        }
    }

    void doOld() {
        old.clear();
        for (int i = 0; i < 1000000/nThreads; ++i)
            old.push_back(to_string(i));
    }
};

int main() {
    Garbage().go();
}

编译：

$ g++ -O3 -std=c++11 gc.cpp
$ ./a.out
DOING AN OLD
0.414637
0.189082
0.189143
0.186336
0.184449
0.18504
0.186302
0.186055
0.183123
0.186835

使用 2014 年 4 月 18 日星期五的源代码构建的 clang 3.5 使用相同的编译器选项提供了类似的结果。

我的处理器是 AMD Phenom(tm) II X4 965，运行频率为 3.6GHz（如果我没记错的话）。

【讨论】：

感谢 STL 的回答。当我实际编写/调试代码时，我使用的是cygwin，它的to_string 实现已损坏，因此使用的是字符串流。
是的，不使用windows有它的优势... ;)