multipart/form-data 在大文件上丢失字节答案

【问题标题】：multipart/form-data losing bytes on large filesmultipart/form-data 在大文件上丢失字节
【发布时间】：2018-12-28 05:59:48
【问题描述】：

我正在用 C++ 编写一个multipart/form-data 解析器，因为可用的选项似乎非常稀缺。

我最初的方法是使用istream::getline 一次缓冲一行（或部分行），以便检测边界。但是，虽然这适用于较小的文件，但不适用于较大的文件。对于大 (>50MB) 文件，cin 的坏位偶尔会被设置，在清除 istream 后，我注意到我会丢失字节。我不知道为什么，这就是这个问题的目的。

但是，如果我将缓冲区大小增加到 4MB 并使用 istream::read 将整个 multipart/form-data 请求转储到文件中，我不会丢失任何字节并且 cin 永远不会设置错误位。然后，我可以将转储文件重新打开为ifstream，而不是使用cin，并且我原来的小缓冲区getline 方法非常有效。

对这里发生的事情有任何见解吗？会不会是 FastCGI 或 Lighttpd 的副作用？

编辑：

以下是相关代码sn-ps：

#include <fcgio.h>
//...

int main()
{
    //...
    FCGX_Request request;

    FCGX_Init();
    FCGX_InitRequest(&request, 0, 0);

    const size_t LEN = 1024;
    vector<char> v(LEN); // Workaround for getting duplicates of every byte?
    while (FCGX_Accept_r(&request) == 0) {
        fcgi_streambuf cin_fcgi_streambuf(request.in, &v[0], v.size());
        //... (eventually calls _parseMultipartFormFieldFile)
    }

    //...
}

/*
    Extract a file from a multipart form section

    istream should already have boundary and headers removed up throguh the final "\r\n"

    Note that there are a lot of potential off-by-one errors here. Need to pay special attention
    to gcount() and what is present in the buffer in each given scenario. Hence why you see:

    gcount
    gcount-1
    gcount-2

    These offsets are due to null terminator sometimes being appended, sometimes not, and/or '\r' being present or not.

    It is possible for a few rare things to happen that will break this function:

    1. Malicious content length

    Client could lie about content length and send much more than we have room for. Should count bytes eventually, but easy enough to configure webserver to protect us.
*/
bool _parseMultipartFormFieldFile(
    Request & req,
    istream & input,
    const string & name,
    const string & upload_dir,
    const string & boundary,
    const string & end_boundary
)
{
    static unsigned int file_id = 0; //used to generate unique file names

    //Need fixed buffer size to prevent running out of RAM (malicious or not)
    char buf[4096];

    string file_name = upload_dir + ECPP_TMP_FILE + to_string(file_id++);

    ofstream f(file_name, std::ofstream::out | std::ofstream::binary);
    if (!f.is_open())
        return false;

    bool eof = false;
    while (!eof) {
        //Out of space in flash?
        if (!f.good())
            return false;

        f.flush();

        input.getline(buf, sizeof(buf));
        unsigned int gcount = input.gcount();

        if (input.bad()) {
            //Crap! If we're here, we have most likely lost a few bytes...
            input.clear();
            continue;
        }
        else if (input.eof()) {
            //If we are here, the multipart/form-data request was malformed
            f.close();
            remove(file_name.c_str()); //Delete malformed file
            return false;
        }
        else if (input.fail()) {
            //If we are in this condition, it means we encountered a line longer than our buffer
            //There is no null terminator in this case, so write out what we have
            f.write(buf, gcount);
            input.clear(); //clear fail flag
            continue;
        }

        if (gcount >= 2 && buf[gcount-2] == '\r') {
            string peek = peekLine(input); //uses putback - modifies gcount()
            if (peek == boundary || peek == end_boundary) {
                //If we are in here, it means we encountered the last line in the section
                //That means there is a trailing '\r' which we need to remove in addition to the null terminator
                f.write(buf, gcount-2); // Remove null terminator and \r before writing
                req.file[name] = file_name;
                eof = true;
                continue;
            }
        }

        //If we are here it means we read in the entire line.
        //Write out everything (minus the null terminator), and also add in the newline that was stripped by getline()
        f.write(buf, gcount-1);
        f.write("\n", 1);
    }

    return true;
}

所以，简而言之，问题是如果我将cin_fcgi_streambuf 传递给_parseMultipartFormFieldFile，我会丢失字节（触发坏位），但如果我不加选择地将cin_fcgi_streambuf 转储到带有char buf[4000000] 的文件中+ input.read()，然后将该文件的ifstream 传递给_parseMultipartFormFieldFile，就可以正常工作了。

【问题讨论】：

显示您的代码。 50MB 在现代系统上并不是很大，所以我认为您没有与大小相关的问题。 badbit 通常是不可恢复的，所以重置它是不安全的，除非你能处理可能丢失的读取。
您是使用库来处理 FastCGI 协议还是仅使用您自己的代码？
@FireLancer 我使用了来自（现已不复存在的）fastcgi.com 的“官方”FastCGI 库（我在 GitHub 中放了一份副本：github.com/RPGillespie6/FastCGI）
@FireLancer 我更新了（希望）相关代码。如果您认为这会有所帮助，我可以提供更多背景信息，但这只会使问题变成一堵巨大的代码墙。
@Gillespie 你检查了fcgiapp.c line num aprx。 2215 reqDataPtr->in = NewReader(reqDataPtr, 8192, 0); 所以我认为你必须分配最大 8192 的缓冲区大小

标签： c++ multipartform-data fastcgi lighttpd

【解决方案1】：

主要事实input.getline 将在没有CRLF 的情况下返回。所以如果你发布一个binary 文件，会发生什么？否则，您的示例 source code 不会管理 multiple posted file request。案例，您刚刚打开了一个文件流。这就是为什么你必须改变你的源代码模式。

您可以上传无限大小的data|file。试试这个解决方案

const char* ctype = "multipart/form-data; boundary=----WebKitFormBoundaryfm9qwXVLSbFKKR88";
size_t content_length = 1459606;
http_payload* hp = new http_payload(ctype, content_length);
if (hp->is_multipart()) {
    int ret = hp->read_all("C:\\temp\\");
    if (ret < 0) {
        std::cout << hp->get_last_error() << std::endl;
        hp->clear();
    }
    else {
        std::string dir_str("C:\\upload_dir\\");
        ret = hp->read_files([&dir_str](http_posted_file* file) {
            std::string path(dir_str.c_str());
            path.append(file->get_file_name());
            file->save_as(path.c_str());
            file->clear(); path.clear();
            std::string().swap(path);
        });
        hp->clear();
        std::cout << "Total file uploaded :" << ret << std::endl;
    }
}
else {
    int ret = hp->read_all();
    if (ret < 0) {
        std::cout << hp->get_last_error() << std::endl;
        hp->clear();
    }
    else {
        std::cout << "Posted data :" << hp->get_body() << std::endl;
        hp->clear();

    }
}

https://github.com/safeonlineworld/web_jsx/blob/0d08773c95f4ae8a9799dbd29e0a4cd84413d108/src/web_jsx/core/http_payload.cpp#L402

【讨论】：

这并不能真正解决问题。
@LightnessRacesBY-SA3.0 你检查了 git 源代码吗？
不，但即使答案在那里，你的答案也应该在网站上
@LightnessRacesBY-SA3.0 这里我想解释一下如何在不丢失数据的情况下从istream 读取发布的文件（multipart/form-data）。但是你这么说，这不是解决问题。如何？请解释我...
它不会告诉 OP 他们的方法有什么问题或如何直接修复它。一个完全替代的解决方案在这里是一种钝器，尤其是在我们尚未排除外部因素的情况下。