如何安全地从 std::istream 中读取一行？答案

【问题标题】：How to safely read a line from an std::istream?如何安全地从 std::istream 中读取一行？
【发布时间】：2014-01-05 21:05:27
【问题描述】：

我想安全地从std::istream 中读取一行。流可以是任何东西，例如，Web 服务器上的连接或处理未知来源提交的文件的东西。有很多答案开始做这个代码的道德等价物：

void read(std::istream& in) {
    std::string line;
    if (std::getline(in, line)) {
        // process the line
    }
}

鉴于in 的来源可能可疑，使用上述代码会导致漏洞：恶意代理可能会使用大行对该代码进行拒绝服务攻击。因此，我想将行长度限制在一个相当高的值，比如 400 万 chars。虽然可能会遇到几行大行，但为每个文件分配一个缓冲区并使用std::istream::getline() 是不可行的。

如何限制行的最大大小，理想情况下不会严重扭曲代码并且不会预先分配大块内存？

【问题讨论】：

如果要求分配超出阈值时会抛出的自定义分配器呢？使用该分配器构造一个basic_string 对象并读入它。
也许继承 std::string 并提供一个 max_size() 函数来吐出一些小东西？
@Praetorian：我想，使用分配器是一种选择。可悲的是，它改变了std::string 的类型。
您可以将in 的streambuf 替换为您自己的实现，该实现包装原始streambuf 并在读取一定数量的字符时发送'\n'。
@DietmarKühl：也许您可以尝试在提取之前简单地检查缓冲区中的字符数：if (in.rdbuf()->in_avail() > max_size) { /* end */ }...

标签： c++

【解决方案1】：

您可以编写自己的std::getline 版本，并使用最大字符数读取参数，称为getline_n 或其他名称。

#include <string>
#include <iostream>

template<typename CharT, typename Traits, typename Alloc>
auto getline_n(std::basic_istream<CharT, Traits>& in, std::basic_string<CharT, Traits, Alloc>& str, std::streamsize n) -> decltype(in) {
    std::ios_base::iostate state = std::ios_base::goodbit;
    bool extracted = false;
    const typename std::basic_istream<CharT, Traits>::sentry s(in, true);
    if(s) {
        try {
            str.erase();
            typename Traits::int_type ch = in.rdbuf()->sgetc();
            for(; ; ch = in.rdbuf()->snextc()) {
                if(Traits::eq_int_type(ch, Traits::eof())) {
                    // eof spotted, quit
                    state |= std::ios_base::eofbit;
                    break;
                }
                else if(str.size() == n) {
                    // maximum number of characters met, quit
                    extracted = true;
                    in.rdbuf()->sbumpc();
                    break;
                }
                else if(str.max_size() <= str.size()) {
                    // string too big
                    state |= std::ios_base::failbit;
                    break;
                }
                else {
                    // character valid
                    str += Traits::to_char_type(ch);
                    extracted = true;
                }
            }
        }
        catch(...) {
            in.setstate(std::ios_base::badbit);
        }
    }

    if(!extracted) {
        state |= std::ios_base::failbit;
    }

    in.setstate(state);
    return in;
}

int main() {
    std::string s;
    getline_n(std::cin, s, 10); // maximum of 10 characters
    std::cout << s << '\n';
}

虽然可能有点矫枉过正。

【讨论】：

写一个版本的getline() 可能是一种选择（尤其是我在过去已经实现了所有的 IOStreams 库）。我不知道为什么我没有想到：也许我太专注于其他两种解决方案（目前只提到其中一种）。
+1。我唯一质疑的是对reserve 的调用，因为 OP 正在以 4Mbytes 的大小作为保护，但可能只处理更小的字符串。让用户自己执行保留可能会更好。
@DaveS 很有趣，我写这篇文章的原始版本没有reserve 调用，但我添加了它是为了更好地衡量。如果由我决定，我也不会这样做。我想我会删除它。
我很困惑——这段代码实际上在哪里检查换行符？

【解决方案2】：

通过在std::istream::getline 周围创建一个包装器来替换std::getline：

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    try
       {
       str.resize(n);
       is.getline(&str[0],n,delim);
       str.resize(is.gcount());
       return is;
       }
    catch(...) { str.resize(0); throw; }
    }

如果您想避免过多的临时内存分配，您可以使用一个循环来根据需要增加分配（每次传递的大小可能翻倍）。不要忘记在 istream 对象上可能会或可能不会启用异常。

这是一个更有效的分配策略的版本：

std::istream& my::getline( std::istream& is, std::streamsize n, std::string& str, char delim )
    {
    std::streamsize base=0;
    do {
       try
          {
          is.clear();
          std::streamsize chunk=std::min(n-base,std::max(static_cast<std::streamsize>(2),base));
          if ( chunk == 0 ) break;
          str.resize(base+chunk);
          is.getline(&str[base],chunk,delim);
          }
       catch( std::ios_base::failure ) { if ( !is.gcount () ) str.resize(0), throw; }
       base += is.gcount();
       } while ( is.fail() && is.gcount() );
    str.resize(base);
    return is;
    }

【讨论】：

在实现时，这两个都会在生成的字符串中留下一个终止符'\0'。这对于 C++ 字符串是不正常的，因此改进方法是在返回之前弹出最后一个字符。请注意，基于扫描字符串以查找 '\0' 来调整大小可能被认为是错误的，因为 '\0' 可能是字符串中的有效字符（这不是 'C'）。此外，我不确切知道这与 Microsoft 的“文本”模式如何交互，其中文本行通常由两个字符终止。如果我理解文档，字符串中会留下一个 '\r' 因为 is.getline() 是“未格式化的”。
...if ( !str.empty() ) str.resize(str.size()-1);
try { my::getline( is, 4096, s, '\n' ); } catch ( std::ios::failure const & ) {}

【解决方案3】：

已经有getline这样的函数作为istream的成员函数，你只需要将它包装起来进行缓冲区管理即可。

#include <assert.h>
#include <istream>
#include <stddef.h>         // ptrdiff_t
#include <string>           // std::string, std::char_traits

typedef ptrdiff_t Size;

namespace my {
    using std::istream;
    using std::string;
    using std::char_traits;

    istream& getline(
        istream& stream, string& s, Size const buf_size, char const delimiter = '\n'
        )
    {
        s.resize( buf_size );  assert( s.size() > 1 );
        stream.getline( &s[0], buf_size, delimiter );
        if( !stream.fail() )
        {
            Size const n = char_traits<char>::length( &s[0] );
            s.resize( n );      // Downsizing.
        }
        return stream;
    }
}  // namespace my

【讨论】：

&s[0] 让我感到不安
@Inverse：没有什么可不安的。您不妨对除法感到不安，并推断它可能会变成除以零。在此代码中，对值的相关约束（字符串长度必须>0）由assert 表示，这通常是一种很好的做法，并且使代码比没有它更安全。对于assert，必须努力工作才能产生UB的鼻守护程序，即调用buf_size的参数无效的函数并在定义NDEBUG的情况下执行此操作，以便抑制assert。这就是为什么你应该使用assert。
哦，我的意思是，我的理解是std::string 中的数据不保证是连续的，只有.c_str() 是这样。所以&v[0] 适合std::vector，但不适合std::string
@Inverse: std::string 缓冲区自 2005 年 Lillehammer 会议以来一直保证是连续的。当然它直到 C++11 才正式发布，但没有编译器供应商能够出售编译器考虑到现有的代码依赖性和未来 C++11 标准中的已知措辞，他们在其中引入了相反的内容。当时，识别假冒语言律师的一种方法是讨论是否可以依赖这一点。 ;-)
写时复制 std::string 是个问题吗？（在 2000 年代初期肯定有这样的实现，尽管 C++11 禁止它们）

【解决方案4】：

根据cmets和answers，似乎有三种方法：

编写getline() 的自定义版本，可能在内部使用std::istream::getline() 成员来获取实际字符。
使用过滤流缓冲区来限制可能接收的数据量。
不要读取std::string，而是使用带有自定义分配器的字符串实例化来限制存储在字符串中的内存量。

并非所有建议都附带代码。这个答案提供了所有方法的代码，并对所有三种方法进行了一些讨论。在进入实现细节之前，首先值得指出的是，如果接收到过长的输入会发生什么，有多种选择：

读取过长的行可能会导致成功读取部分行，即结果字符串包含读取的内容，并且流没有设置任何错误标志。但是，这样做意味着无法区分一条线正好达到极限还是太长。不过，由于限制有些随意，它可能并不重要。
读取过长的行可能会被视为失败（即设置std::ios_base::failbit 和/或std::ios_base::bad_bit），并且由于读取失败，因此会产生一个空字符串。显然，产生一个空字符串可以防止潜在地查看到目前为止读取的字符串以了解发生了什么。
读取过长的行可能会提供部分行读取，并且还会在流上设置错误标志。这似乎是合理的行为，既可以检测到有问题，也可以为潜在的检查提供输入。

虽然已经有多个代码示例实现了getline() 的有限版本，但这里还有一个！我认为它更简单（尽管可能更慢；必要时可以处理性能），它还保留了std::getline()s 接口：它使用流的width() 来传达限制（可能考虑到width() 是一个合理的扩展到std::getline()):

template <typename cT, typename Traits, typename Alloc>
std::basic_istream<cT, Traits>&
safe_getline(std::basic_istream<cT, Traits>& in,
             std::basic_string<cT, Traits, Alloc>& value,
             cT delim)
{
    typedef std::basic_string<cT, Traits, Alloc> string_type;
    typedef typename string_type::size_type size_type;

    typename std::basic_istream<cT, Traits>::sentry cerberos(in);
    if (cerberos) {
        value.clear();
        size_type width(in.width(0));
        if (width == 0) {
            width = std::numeric_limits<size_type>::max();
        }
        std::istreambuf_iterator<char> it(in), end;
        for (; value.size() != width && it != end; ++it) {
            if (!Traits::eq(delim, *it)) {
                value.push_back(*it);
            }
            else {
                ++it;
                break;
            }
        }
        if (value.size() == width) {
            in.setstate(std::ios_base::failbit);
        }
    }
    return in;
}

这个版本的getline() 的使用与std::getline() 一样，但是当限制读取的数据量似乎合理时，设置width()，例如：

std::string line;
if (safe_getline(in >> std::setw(max_characters), line)) {
    // do something with the input
}

另一种方法是仅使用过滤流缓冲区来限制输入量：过滤器只会计算处理的字符数并将数量限制为合适的字符数。这种方法实际上比单个行更容易应用于整个流：当只处理一行时，过滤器不能只从底层流中获取充满字符的缓冲区，因为没有可靠的方法将字符放回原处。实现无缓冲版本仍然很简单，但可能不是特别有效：

template <typename cT, typename Traits = std::char_traits<char> >
class basic_limitbuf
    : std::basic_streambuf <cT, Traits> {
public:
    typedef Traits                    traits_type;
    typedef typename Traits::int_type int_type;

private:
    std::streamsize                   size;
    std::streamsize                   max;
    std::basic_istream<cT, Traits>*   stream;
    std::basic_streambuf<cT, Traits>* sbuf;

    int_type underflow() {
        if (this->size < this->max) {
            return this->sbuf->sgetc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
    int_type uflow()     {
        if (this->size < this->max) {
            ++this->size;
            return this->sbuf->sbumpc();
        }
        else {
            this->stream->setstate(std::ios_base::failbit);
            return traits_type::eof();
        }
    }
public:
    basic_limitbuf(std::streamsize max,
                   std::basic_istream<cT, Traits>& stream)
        : size()
        , max(max)
        , stream(&stream)
        , sbuf(this->stream->rdbuf(this)) {
    }
    ~basic_limitbuf() {
        std::ios_base::iostate state = this->stream->rdstate();
        this->stream->rdbuf(this->sbuf);
        this->stream->setstate(state);
    }
};

此流缓冲区已设置为在构造时插入自身并在销毁时移除自身。也就是说，它可以像这样简单地使用：

std::string line;
basic_limitbuf<char> sbuf(max_characters, in);
if (std::getline(in, line)) {
    // do something with the input
}

添加一个设置限制的操纵器也很容易。这种方法的一个优点是，如果可以限制流的总大小，则不需要触及任何读取代码：可以在创建流之后立即设置过滤器。当不需要退出过滤器时，过滤器也可以使用缓冲区，这将大大提高性能。

建议的第三种方法是将std::basic_string 与自定义分配器一起使用。分配器方法有两个方面有点尴尬：

正在读取的字符串实际上具有一种不能立即转换为 std::string 的类型（尽管转换也不难）。
可以轻松限制最大数组大小，但字符串的随机大小或多或少会小于此值：当流分配失败时会抛出异常，并且不会尝试将字符串增大较小的大小。

这是限制分配大小的分配器的必要代码：

template <typename T>
struct limit_alloc
{
private:
    std::size_t max_;
public:
    typedef T value_type;
    limit_alloc(std::size_t max): max_(max) {}
    template <typename S>
    limit_alloc(limit_alloc<S> const& other): max_(other.max()) {}
    std::size_t max() const { return this->max_; }
    T* allocate(std::size_t size) {
        return size <= max_
            ? static_cast<T*>(operator new[](size))
            : throw std::bad_alloc();
    }
    void  deallocate(void* ptr, std::size_t) {
        return operator delete[](ptr);
    }
};

template <typename T0, typename T1>
bool operator== (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return a0.max() == a1.max();
}
template <typename T0, typename T1>
bool operator!= (limit_alloc<T0> const& a0, limit_alloc<T1> const& a1) {
    return !(a0 == a1);
}

分配器可以像这样使用（代码可以用最新版本的clang 编译，但不能用gcc）：

std::basic_string<char, std::char_traits<char>, limit_alloc<char> >
    tmp(limit_alloc<char>(max_chars));
if (std::getline(in, tmp)) {
    std::string(tmp.begin(), tmp.end());
    // do something with the input
}

总之，有多种方法，每种方法都有自己的小缺点，但每种方法都可以合理地实现限制基于超长线路的拒绝服务攻击的既定目标：

使用自定义版本的getline() 意味着需要更改阅读代码。
除非可以限制整个流的大小，否则使用自定义流缓冲区会很慢。
使用自定义分配器可以减少控制，并且需要对读取代码进行一些更改。

【讨论】：