为什么在 C++ 中拆分字符串比 Python 慢？答案

【问题标题】：Why is splitting a string slower in C++ than Python?为什么在 C++ 中拆分字符串比 Python 慢？
【发布时间】：2012-03-11 19:39:18
【问题描述】：

我正在尝试将一些代码从 Python 转换为 C++，以提高一点速度并提高我生疏的 C++ 技能。昨天，当从标准输入读取行的天真实现在 Python 中比 C++ 快得多时，我感到震惊（请参阅this）。今天，我终于弄清楚了如何在 C++ 中使用合并分隔符拆分字符串（类似于 python 的 split() 的语义），我现在正在体验似曾相识的感觉！我的 C++ 代码需要更长的时间才能完成这项工作（尽管没有像昨天课程那样多一个数量级）。

Python 代码：

#!/usr/bin/env python
from __future__ import print_function                                            
import time
import sys

count = 0
start_time = time.time()
dummy = None

for line in sys.stdin:
    dummy = line.split()
    count += 1

delta_sec = int(time.time() - start_time)
print("Python: Saw {0} lines in {1} seconds. ".format(count, delta_sec), end='')
if delta_sec > 0:
    lps = int(count/delta_sec)
    print("  Crunch Speed: {0}".format(lps))
else:
    print('')

C++ 代码：

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the vector
        tokens.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
}

void split2(vector<string> &tokens, const string &str, char delim=' ') {
    stringstream ss(str); //convert string to stream
    string item;
    while(getline(ss, item, delim)) {
        tokens.push_back(item); //add token to vector
    }
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp

请注意，我尝试了两种不同的拆分实现。一个（split1）使用字符串方法来搜索标记，并且能够合并多个标记以及处理大量标记（它来自here）。第二个（split2）使用 getline 将字符串作为流读取，不合并分隔符，并且仅支持单个分隔符（该分隔符由多个 StackOverflow 用户在字符串拆分问题的答案中发布）。

我以不同的顺序运行了多次。我的测试机是 Macbook Pro（2011，8GB，四核），这并不重要。我正在测试一个 20M 行的文本文件，其中包含三个以空格分隔的列，每列看起来都类似于：“foo.bar 127.0.0.1 home.foo.bar”

结果：

$ /usr/bin/time cat test_lines_double | ./split.py
       15.61 real         0.01 user         0.38 sys
Python: Saw 20000000 lines in 15 seconds.   Crunch Speed: 1333333
$ /usr/bin/time cat test_lines_double | ./split1
       23.50 real         0.01 user         0.46 sys
C++   : Saw 20000000 lines in 23 seconds.  Crunch speed: 869565
$ /usr/bin/time cat test_lines_double | ./split2
       44.69 real         0.02 user         0.62 sys
C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444

我做错了什么？有没有更好的方法在 C++ 中进行字符串拆分，它不依赖于外部库（即不提升），支持合并分隔符序列（如 python 的拆分），是线程安全的（所以没有 strtok），其性能至少是和python一样吗？

编辑 1 / 部分解决方案？：

我尝试通过让 python 重置虚拟列表并每次都附加到它来使其成为更公平的比较，就像 C++ 所做的那样。这仍然不是 C++ 代码正在做的事情，但它更接近一些。基本上，现在的循环是：

for line in sys.stdin:
    dummy = []
    dummy += line.split()
    count += 1

python 的性能现在与 split1 C++ 实现大致相同。

/usr/bin/time cat test_lines_double | ./split5.py
       22.61 real         0.01 user         0.40 sys
Python: Saw 20000000 lines in 22 seconds.   Crunch Speed: 909090

我仍然感到惊讶的是，即使 Python 对字符串处理进行了如此优化（正如 Matt Joiner 所建议的那样），这些 C++ 实现也不会更快。如果有人对如何使用 C++ 以更优化的方式执行此操作有任何想法，请分享您的代码。（我认为下一步将尝试在纯 C 中实现这一点，尽管我不会牺牲程序员的生产力来用 C 重新实现我的整个项目，所以这只是一个字符串拆分速度的实验。）

感谢大家的帮助。

最终编辑/解决方案：

请参阅 Alf 接受的答案。由于 python 严格按照引用处理字符串，并且 STL 字符串经常被复制，因此使用 vanilla python 实现的性能更好。为了比较，我通过 Alf 的代码编译并运行了我的数据，这是与所有其他运行在同一台机器上的性能，基本上与天真的 python 实现相同（尽管比重置/附加列表的 python 实现更快，如显示在上面的编辑中）：

$ /usr/bin/time cat test_lines_double | ./split6
       15.09 real         0.01 user         0.45 sys
C++   : Saw 20000000 lines in 15 seconds.  Crunch speed: 1333333

我唯一剩下的一点抱怨是在这种情况下让 C++ 执行所需的代码量。

从这个问题和昨天的标准输入行阅读问题（上面链接）中得到的一个教训是，人们应该始终进行基准测试，而不是对语言的相对“默认”性能做出幼稚的假设。我很欣赏教育。

再次感谢大家的建议！

【问题讨论】：

你是如何编译 C++ 程序的？是否开启了优化？
@interjay：在他的源代码中的最后一条评论中：g++ -Wall -O3 -o split1 split_1.cpp @JJC：当您实际分别使用dummy 和spline 时，您的基准测试结果如何，也许Python 删除了对line.split() 因为它没有副作用？
如果去掉分割，只留下标准输入的读取行，你会得到什么结果？
Python 是用 C 编写的。这意味着在 C 中有一种有效的方法。也许有比使用 STL 更好的分割字符串的方法？
Why do std::string operations perform poorly? 的可能副本

标签： c++ python string split benchmarking

【解决方案1】：

我认为下面的代码更好，使用了一些 C++17 和 C++14 的特性：

// These codes are un-tested when I write this post, but I'll test it
// When I'm free, and I sincerely welcome others to test and modify this
// code.

// C++17
#include <istream>     // For std::istream.
#include <string_view> // new feature in C++17, sizeof(std::string_view) == 16 in libc++ on my x86-64 debian 9.4 computer.
#include <string>
#include <utility>     // C++14 feature std::move.

template <template <class...> class Container, class Allocator>
void split1(Container<std::string_view, Allocator> &tokens, 
            std::string_view str,
            std::string_view delimiter = " ") 
{
    /* 
     * The model of the input string:
     *
     * (optional) delimiter | content | delimiter | content | delimiter| 
     * ... | delimiter | content 
     *
     * Using std::string::find_first_not_of or 
     * std::string_view::find_first_not_of is a bad idea, because it 
     * actually does the following thing:
     * 
     *     Finds the first character not equal to any of the characters 
     *     in the given character sequence.
     * 
     * Which means it does not treeat your delimiters as a whole, but as
     * a group of characters.
     * 
     * This has 2 effects:
     *
     *  1. When your delimiters is not a single character, this function
     *  won't behave as you predicted.
     *
     *  2. When your delimiters is just a single character, the function
     *  may have an additional overhead due to the fact that it has to 
     *  check every character with a range of characters, although 
     * there's only one, but in order to assure the correctness, it still 
     * has an inner loop, which adds to the overhead.
     *
     * So, as a solution, I wrote the following code.
     *
     * The code below will skip the first delimiter prefix.
     * However, if there's nothing between 2 delimiter, this code'll 
     * still treat as if there's sth. there.
     *
     * Note: 
     * Here I use C++ std version of substring search algorithm, but u
     * can change it to Boyer-Moore, KMP(takes additional memory), 
     * Rabin-Karp and other algorithm to speed your code.
     * 
     */

    // Establish the loop invariant 1.
    typename std::string_view::size_type 
        next, 
        delimiter_size = delimiter.size(),  
        pos = str.find(delimiter) ? 0 : delimiter_size;

    // The loop invariant:
    //  1. At pos, it is the content that should be saved.
    //  2. The next pos of delimiter is stored in next, which could be 0
    //  or std::string_view::npos.

    do {
        // Find the next delimiter, maintain loop invariant 2.
        next = str.find(delimiter, pos);

        // Found a token, add it to the vector
        tokens.push_back(str.substr(pos, next));

        // Skip delimiters, maintain the loop invariant 1.
        //
        // @ next is the size of the just pushed token.
        // Because when next == std::string_view::npos, the loop will
        // terminate, so it doesn't matter even if the following 
        // expression have undefined behavior due to the overflow of 
        // argument.
        pos = next + delimiter_size;
    } while(next != std::string_view::npos);
}   

template <template <class...> class Container, class traits, class Allocator2, class Allocator>
void split2(Container<std::basic_string<char, traits, Allocator2>, Allocator> &tokens, 
            std::istream &stream,
            char delimiter = ' ')
{
    std::string<char, traits, Allocator2> item;

    // Unfortunately, std::getline can only accept a single-character 
    // delimiter.
    while(std::getline(stream, item, delimiter))
        // Move item into token. I haven't checked whether item can be 
        // reused after being moved.
        tokens.push_back(std::move(item));
}

容器的选择：

std::vector.

假设分配的内部数组的初始大小为1，最终大小为N，您将分配和释放log2（N）次，您将复制（2 ^（log2（N）+ 1）- 1) = (2N - 1) 次。正如Is the poor performance of std::vector due to not calling realloc a logarithmic number of times? 中所指出的，当向量的大小不可预测并且可能非常大时，这可能会导致性能下降。但是，如果你能估计它的大小，这将不是一个问题。
std::list。

对于每个 push_back，它消耗的时间是一个常数，但它可能比单个 push_back 上的 std::vector 花费更多的时间。使用每线程内存池和自定义分配器可以缓解这个问题。
std::forward_list。

与 std::list 相同，但每个元素占用的内存更少。由于缺少 API push_back，需要包装类才能工作。
std::array。

如果你能知道增长的极限，那么你可以使用std::array。当然，你不能直接使用它，因为它没有 API push_back。但是你可以定义一个包装器，我认为这是最快的方法，如果你的估计非常准确，可以节省一些内存。
std::deque。

此选项允许您用内存换取性能。不会有 (2 ^ (N + 1) - 1) 次元素副本，只有 N 次分配，并且没有释放。此外，您将拥有恒定的随机访问时间，并且能够在两端添加新元素。

根据std::deque-cppreference

另一方面，双端队列通常具有较大的最小内存成本；一种仅包含一个元素的双端队列必须分配其完整的内部数组（例如，64 位 libstdc++ 上的对象大小的 8 倍；对象大小的 16 倍或 4096 字节，以较大者为准，在 64 位 libc++ 上）

或者您可以使用以下组合：

std::vector< std::array<T, 2 ^ M> >

这类似于std::deque，不同的是这个容器不支持在前面添加元素。但它的性能仍然更快，因为它不会将底层 std::array 复制 (2 ^ (N + 1) - 1) 次，它只会复制指针数组 (2 ^ (N - M + 1) - 1) 次，并且仅在当前数组已满且不需要释放任何内容时才分配新数组。顺便说一句，您可以获得恒定的随机访问时间。
std::list< std::array<T, ...> >

大大缓解内存框架的压力。它只会在当前已满时分配新数组，不需要复制任何内容。与组合 1 相比，您仍然需要为额外的指针付出代价。
std::forward_list< std::array<T, ...> >

与 2 相同，但消耗与组合 1 相同的内存。

【讨论】：

如果你使用 std::vector 和一些合理的初始大小，比如 128 或 256，总副本（假设增长因子为 2），你完全避免任何复制到该限制的大小.然后，您可以缩小分配以适应您实际使用的元素数量，因此对于小输入来说并不可怕。不过，这对非常大的N 案例的副本总数没有多大帮助。太糟糕了std::vector can't use realloc to potentially allow mapping more pages at the end of the current allocation，所以它慢了大约 2 倍。
stringview::remove_prefix 是否与在普通字符串中跟踪当前位置一样便宜？ std::basic_string::find 有一个可选的第二个参数 pos = 0 让您从偏移量开始搜索。
@Peter Cordes 没错。我检查了libcxx impl
我也查了libstdc++ impl，也是一样。
您对向量性能的分析已关闭。考虑第一次插入时初始容量为 1 的向量，每次需要新容量时它都会加倍。如果您需要放入 17 个项目，第一次分配为 1、2、4、8、16、最后 32 腾出空间。这意味着总共有 6 个分配（log2(size - 1) + 2，使用整数日志）。第一次分配移动了 0 个字符串，第二次移动了 1，然后是 2，然后是 4，然后是 8，最后是 16，总共移动了 31 次 (2^(log2(size - 1) + 1) - 1))。这是 O(n)，而不是 O(2^n)。这将大大优于std::list。

【解决方案2】：

作为一种猜测，Python 字符串是引用计数的不可变字符串，因此在 Python 代码中不会复制任何字符串，而 C++ std::string 是可变值类型，并且会在最小的机会被复制。

如果目标是快速拆分，则可以使用恒定时间子字符串操作，这意味着仅引用原始字符串的一部分，如在 Python（和 Java，以及 C#...）中。

不过，C++ std::string 类有一个可取之处：它是标准，因此它可以用于在效率不是主要考虑因素的情况下安全且可移植地传递字符串。不过聊够了。代码——在我的机器上这当然比 Python 快，因为 Python 的字符串处理是用 C 实现的，它是 C++ 的一个子集（呵呵）：

#include <iostream>                                                              
#include <string>
#include <sstream>
#include <time.h>
#include <vector>

using namespace std;

class StringRef
{
private:
    char const*     begin_;
    int             size_;

public:
    int size() const { return size_; }
    char const* begin() const { return begin_; }
    char const* end() const { return begin_ + size_; }

    StringRef( char const* const begin, int const size )
        : begin_( begin )
        , size_( size )
    {}
};

vector<StringRef> split3( string const& str, char delimiter = ' ' )
{
    vector<StringRef>   result;

    enum State { inSpace, inToken };

    State state = inSpace;
    char const*     pTokenBegin = 0;    // Init to satisfy compiler.
    for( auto it = str.begin(); it != str.end(); ++it )
    {
        State const newState = (*it == delimiter? inSpace : inToken);
        if( newState != state )
        {
            switch( newState )
            {
            case inSpace:
                result.push_back( StringRef( pTokenBegin, &*it - pTokenBegin ) );
                break;
            case inToken:
                pTokenBegin = &*it;
            }
        }
        state = newState;
    }
    if( state == inToken )
    {
        result.push_back( StringRef( pTokenBegin, &*str.end() - pTokenBegin ) );
    }
    return result;
}

int main() {
    string input_line;
    vector<string> spline;
    long count = 0;
    int sec, lps;
    time_t start = time(NULL);

    cin.sync_with_stdio(false); //disable synchronous IO

    while(cin) {
        getline(cin, input_line);
        //spline.clear(); //empty the vector for the next line to parse

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        //split2(spline, input_line);

        vector<StringRef> const v = split3( input_line );
        count++;
    };

    count--; //subtract for final over-read
    sec = (int) time(NULL) - start;
    cerr << "C++   : Saw " << count << " lines in " << sec << " seconds." ;
    if (sec > 0) {
        lps = count / sec;
        cerr << "  Crunch speed: " << lps << endl;
    } else
        cerr << endl;
    return 0;
}

//compiled with: g++ -Wall -O3 -o split1 split_1.cpp -std=c++0x

免责声明：我希望没有任何错误。我没有测试功能，但只检查了速度。但我认为，即使有一两个错误，纠正也不会显着影响速度。

【讨论】：

是的，Python 字符串是引用计数的对象，因此 Python 的复制要少得多。但是，它们仍然包含以 null 结尾的 C 字符串，而不是像您的代码那样的 (pointer, size) 对。
换句话说 - 对于更高级别的工作，如文本操作，坚持使用更高级别的语言，数十年来数十位开发人员已经累积努力高效地完成它 - 或者只是准备与所有那些开发人员一样多地工作，以在较低级别拥有可比的东西。
@JJC：对于StringRef，您可以很容易地将子字符串复制到std::string，只需string( sr.begin(), sr.end() )。
我希望 CPython 字符串被复制的更少。是的，它们是引用计数且不可变的，但 str.split() allocates new strings for each item 使用调用 PyObject_MALLOC() 的 PyString_FromStringAndSize()。因此，没有利用字符串在 Python 中不可变的共享表示进行优化。
维护者：请不要通过尝试修复感知错误来引入错误（尤其是不参考cplusplus.com）。 TIA。

【解决方案3】：

如果您采用 split1 实现并将签名更改为更接近 split2 的签名，请更改以下内容：

void split1(vector<string> &tokens, const string &str, const string &delimiters = " ")

到这里：

void split1(vector<string> &tokens, const string &str, const char delimiters = ' ')

您会在 split1 和 split2 之间获得更显着的差异，以及更公平的比较：

split1  C++   : Saw 10000000 lines in 41 seconds.  Crunch speed: 243902
split2  C++   : Saw 10000000 lines in 144 seconds.  Crunch speed: 69444
split1' C++   : Saw 10000000 lines in 33 seconds.  Crunch speed: 303030

【讨论】：

【解决方案4】：

我没有提供任何更好的解决方案（至少在性能方面），但提供了一些可能有趣的额外数据。

使用strtok_r（strtok 的可重入变体）：

void splitc1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    char *saveptr;
    char *cpy, *token;

    cpy = (char*)malloc(str.size() + 1);
    strcpy(cpy, str.c_str());

    for(token = strtok_r(cpy, delimiters.c_str(), &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters.c_str(), &saveptr)) {
        tokens.push_back(string(token));
    }

    free(cpy);
}

另外使用字符串作为参数，fgets 作为输入：

void splitc2(vector<string> &tokens, const char *str,
        const char *delimiters) {
    char *saveptr;
    char *cpy, *token;

    cpy = (char*)malloc(strlen(str) + 1);
    strcpy(cpy, str);

    for(token = strtok_r(cpy, delimiters, &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters, &saveptr)) {
        tokens.push_back(string(token));
    }

    free(cpy);
}

而且，在某些情况下，破坏输入字符串是可以接受的：

void splitc3(vector<string> &tokens, char *str,
        const char *delimiters) {
    char *saveptr;
    char *token;

    for(token = strtok_r(str, delimiters, &saveptr);
        token != NULL;
        token = strtok_r(NULL, delimiters, &saveptr)) {
        tokens.push_back(string(token));
    }
}

这些时间安排如下（包括我对问题的其他变体和接受的答案的结果）：

split1.cpp:  C++   : Saw 20000000 lines in 31 seconds.  Crunch speed: 645161
split2.cpp:  C++   : Saw 20000000 lines in 45 seconds.  Crunch speed: 444444
split.py:    Python: Saw 20000000 lines in 33 seconds.  Crunch Speed: 606060
split5.py:   Python: Saw 20000000 lines in 35 seconds.  Crunch Speed: 571428
split6.cpp:  C++   : Saw 20000000 lines in 18 seconds.  Crunch speed: 1111111

splitc1.cpp: C++   : Saw 20000000 lines in 27 seconds.  Crunch speed: 740740
splitc2.cpp: C++   : Saw 20000000 lines in 22 seconds.  Crunch speed: 909090
splitc3.cpp: C++   : Saw 20000000 lines in 20 seconds.  Crunch speed: 1000000

正如我们所见，接受答案的解决方案仍然是最快的。

对于任何想要进行进一步测试的人，我还提供了一个 Github 存储库，其中包含问题中的所有程序、接受的答案、这个答案，以及一个 Makefile 和一个生成测试数据的脚本：https://github.com/tobbez/string-splitting .

【讨论】：

我做了一个拉取请求（github.com/tobbez/string-splitting/pull/2），通过“使用”数据（计算单词和字符的数量）使测试更加真实。通过这一更改，所有 C/C++ 版本都击败了 Python 版本（除了我添加的基于 Boost 的标记器的版本），并且基于“字符串视图”的方法（如 split6 的方法）的真正价值大放异彩。
您应该使用memcpy，而不是strcpy，以防编译器无法注意到优化。 strcpy 通常使用较慢的启动策略，在短字符串的快速与长字符串的全 SIMD 之间取得平衡。 memcpy 立即知道大小，并且不必使用任何 SIMD 技巧来检查隐式长度字符串的结尾。（在现代 x86 上没什么大不了的）。如果您可以从saveptr-token 中获得它，那么使用(char*, len) 构造函数创建std::string 对象可能也会更快。显然，存储char* 令牌是最快的：P

【解决方案5】：

void split5(vector<string> &tokens, const string &str, char delim=' ') {

    enum { do_token, do_delim } state = do_delim;
    int idx = 0, tok_start = 0;
    for (string::const_iterator it = str.begin() ; ; ++it, ++idx) {
        switch (state) {
            case do_token:
                if (it == str.end()) {
                    tokens.push_back (str.substr(tok_start, idx-tok_start));
                    return;
                }
                else if (*it == delim) {
                    state = do_delim;
                    tokens.push_back (str.substr(tok_start, idx-tok_start));
                }
                break;

            case do_delim:
                if (it == str.end()) {
                    return;
                }
                if (*it != delim) {
                    state = do_token;
                    tok_start = idx;
                }
                break;
        }
    }
}

【讨论】：

谢谢n.m.！不幸的是，这似乎以与我的数据集和机器上的原始（拆分 1）实现大致相同的速度运行：$ /usr/bin/time cat test_lines_double | ./split8 21.89 real 0.01 user 0.47 sys C++：在 22 秒内看到 20000000 行。紧缩速度：909090
在我的机器上：split1 — 54s，split.py — 35s，split5 — 16s。我不知道。
嗯，你的数据符合我上面提到的格式吗？我假设您每次运行几次以消除初始磁盘缓存填充等瞬时影响？

【解决方案6】：

我怀疑这是因为 std::vector 在 push_back() 函数调用过程中调整大小的方式。如果您尝试使用std::list 或std::vector::reserve() 为句子保留足够的空间，您应该会获得更好的性能。或者您可以将两者结合使用，如下所示 split1()：

void split1(vector<string> &tokens, const string &str,
        const string &delimiters = " ") {
    // Skip delimiters at beginning
    string::size_type lastPos = str.find_first_not_of(delimiters, 0);

    // Find first non-delimiter
    string::size_type pos = str.find_first_of(delimiters, lastPos);
    list<string> token_list;

    while (string::npos != pos || string::npos != lastPos) {
        // Found a token, add it to the list
        token_list.push_back(str.substr(lastPos, pos - lastPos));
        // Skip delimiters
        lastPos = str.find_first_not_of(delimiters, pos);
        // Find next non-delimiter
        pos = str.find_first_of(delimiters, lastPos);
    }
    tokens.assign(token_list.begin(), token_list.end());
}

EDIT：我看到的另一件明显的事情是 Python 变量 dummy 每次都被赋值，但没有被修改。所以这不是与 C++ 的公平比较。您应该尝试将您的 Python 代码修改为 dummy = [] 以对其进行初始化，然后执行 dummy += line.split()。你能在这之后报告运行时吗？

EDIT2：为了更公平，您可以将 C++ 代码中的 while 循环修改为：

    while(cin) {
        getline(cin, input_line);
        std::vector<string> spline; // create a new vector

        //I'm trying one of the two implementations, per compilation, obviously:
//        split1(spline, input_line);  
        split2(spline, input_line);

        count++;
    };

【讨论】：

感谢您的想法。我实现了它，不幸的是，这个实现实际上比原来的 split1 慢。我还在循环之前尝试了 spline.reserve(16)，但这对我的 split1 的速度没有影响。每行只有三个标记，并且每行之后都会清除向量，所以我没想到会有太大帮助。
我也尝试了您的编辑。请参阅更新的问题。性能现在与 split1 相当。
我试过你的 EDIT2。性能有点差：$/usr/bin/time cat test_lines_double | ./split7 33.39 real 0.01 user 0.49 sys C++：在 33 秒内看到 20000000 行。紧缩速度：606060

【解决方案7】：

您错误地假设您选择的 C++ 实现一定比 Python 快。 Python 中的字符串处理经过高度优化。有关更多信息，请参阅此问题：Why do std::string operations perform poorly?

【讨论】：

我没有对整体语言性能做出任何声明，只是对我的特定代码。所以，这里没有假设。感谢您对另一个问题的良好指示。我不确定您是说 C++ 中的这个特定实现是次优的（您的第一句话）还是 C++ 在字符串处理方面比 Python 慢（您的第二句话）。另外，如果您知道一种快速的方法来做我在 C++ 中尝试做的事情，请分享它以使每个人都受益。谢谢。澄清一下，我喜欢 python，但我不是盲人，这就是为什么我试图学习最快的方法来做到这一点。
@JJC：鉴于 Python 的实现速度更快，我会说你的实现并不理想。请记住，语言实现可以为您偷工减料，但最终算法复杂性和手动优化会胜出。在这种情况下，默认情况下，Python 在此用例中占上风。

【解决方案8】：

我怀疑这与 Python 中 sys.stdin 上的缓冲有关，但在 C++ 实现中没有缓冲。

有关如何更改缓冲区大小的详细信息，请参阅此帖子，然后再次尝试比较： Setting smaller buffer size for sys.stdin?

【讨论】：

嗯...我不听。在 C++ 中仅读取行（没有拆分）比 Python 更快（在包含 cin.sync_with_stdio(false); 行之后）。这就是我昨天遇到的问题，上面提到过。