在 Indy HTTP 服务器上处理多部分请求时出现编码问题答案

【问题标题】：Encoding problem while processing a multipart request on Indy HTTP server在 Indy HTTP 服务器上处理多部分请求时出现编码问题
【发布时间】：2022-01-07 07:28:36
【问题描述】：

我有一个基于 TIdHTTPServer 的 Web 服务器。它建于悉尼德尔福。从我收到以下多部分/表单数据发布流的网页：

-----------------------------16857441221270830881532229640 
Content-Disposition: form-data; name="d"

83AAAFUaVVs4Q07z
-----------------------------16857441221270830881532229640 
Content-Disposition: form-data; name="dir"

Upload
-----------------------------16857441221270830881532229640 
Content-Disposition: form-data; name="file_name"; filename="ÄŤeskĂˇ teÄŤka.png"
Content-Type: image/png

PNG_DATA    
-----------------------------16857441221270830881532229640--

问题是没有正确接收文本部分。我阅读了Indy MIME decoding of Multipart/Form-Data Requests returns trailing CR/LF 并将传输编码更改为8bit，这有助于正确接收文件，但接收到的文件名仍然错误（目录应为Upload，文件名应为česká tečka.png）。

d=83AAAFUaVVs4Q07z
dir=UploadW
??esk?? te??ka.png 75

为了演示这个问题，我将代码简化为控制台应用程序（请注意 MIME.txt 文件包含的内容与上面的帖子流中的内容相同）：

program MIMEMultiPartTest;

{$APPTYPE CONSOLE}

{$R *.res}

uses
  System.Classes, System.SysUtils,
  IdGlobal, IdCoder, IdMessage, IdMessageCoder, IdGlobalProtocols, IdCoderMIME, IdMessageCoderMIME,
  IdCoderQuotedPrintable, IdCoderBinHex4;


procedure ProcessAttachmentPart(var Decoder: TIdMessageDecoder; var MsgEnd: Boolean);
var
  MS: TMemoryStream;
  Name: string;
  Value: string;
  NewDecoder: TIdMessageDecoder;
begin
  MS := TMemoryStream.Create;
  try
    // http://stackoverflow.com/questions/27257577/indy-mime-decoding-of-multipart-form-data-requests-returns-trailing-cr-lf
    TIdMessageDecoderMIME(Decoder).Headers.Values['Content-Transfer-Encoding'] := '8bit';
    TIdMessageDecoderMIME(Decoder).BodyEncoded := False;
    NewDecoder := Decoder.ReadBody(MS, MsgEnd);
    MS.Position := 0; // nutne?
    if Decoder.Filename <> EmptyStr then // je to atachment
    begin
      try
        Writeln(Decoder.Filename + ' ' + IntToStr(MS.Size));
      except
        FreeAndNil(NewDecoder);
        Writeln('Error processing MIME');
      end;
    end
    else // je to parametr
    begin
      Name := ExtractHeaderSubItem(Decoder.Headers.Text, 'name', QuoteHTTP);
      if Name <> EmptyStr then
      begin
        Value := string(PAnsiChar(MS.Memory));
        try
          Writeln(Name + '=' + Value);
        except
          FreeAndNil(NewDecoder);
        Writeln('Error processing MIME');
        end;
      end;
    end;
    Decoder.Free;
    Decoder := NewDecoder;
  finally
    MS.Free;
  end;
end;

function ProcessMultiPart(const ContentType: string; Stream: TStream): Boolean;
var
  Boundary: string;
  BoundaryStart: string;
  BoundaryEnd: string;
  Decoder: TIdMessageDecoder;
  Line: string;
  BoundaryFound: Boolean;
  IsStartBoundary: Boolean;
  MsgEnd: Boolean;
begin
  Result := False;
  Boundary := ExtractHeaderSubItem('multipart/form-data; boundary=---------------------------16857441221270830881532229640', 'boundary', QuoteHTTP);
  if Boundary <> EmptyStr then
  begin
    BoundaryStart := '--' + Boundary;
    BoundaryEnd := BoundaryStart + '--';
    Decoder := TIdMessageDecoderMIME.Create(nil);
    try
      TIdMessageDecoderMIME(Decoder).MIMEBoundary := Boundary;
      Decoder.SourceStream := Stream;
      Decoder.FreeSourceStream := False;
      BoundaryFound := False;
      IsStartBoundary := False;
      repeat
        Line := ReadLnFromStream(Stream, -1, True);
        if Line = BoundaryStart then
        begin
          BoundaryFound := True;
          IsStartBoundary := True;
        end
        else
        begin
          if Line = BoundaryEnd then
            BoundaryFound := True;
        end;
      until BoundaryFound;
      if BoundaryFound and IsStartBoundary then
      begin
        MsgEnd := False;
        repeat
          TIdMessageDecoderMIME(Decoder).MIMEBoundary := Boundary;
          Decoder.SourceStream := Stream;
          Decoder.FreeSourceStream := False;
          Decoder.ReadHeader;
          case Decoder.PartType of
            mcptText,
            mcptAttachment:
              begin
                ProcessAttachmentPart(Decoder, MsgEnd);
              end;
            mcptIgnore:
              begin
                Decoder.Free;
                Decoder := TIdMessageDecoderMIME.Create(nil);
              end;
            mcptEOF:
              begin
                Decoder.Free;
                MsgEnd := True;
              end;
          end;
        until (Decoder = nil) or MsgEnd;
        Result := True;
      end
    finally
      Decoder.Free;
    end;
  end;
end;

var
  Stream: TMemoryStream;
begin
  Stream := TMemoryStream.Create;
  try
    Stream.LoadFromFile('MIME.txt');
    ProcessMultiPart('multipart/form-data; boundary=---------------------------16857441221270830881532229640', Stream);
  finally
    Stream.Free;
  end;
  Readln;
end.

有人可以帮我看看我的代码有什么问题吗？谢谢。

【问题讨论】：

标签： delphi multipartform-data indy mime

【解决方案1】：

您在ProcessMultiPart() 中对ExtractHeaderSubItem() 的调用是错误的，它需要传入ContentType 字符串参数，而不是硬编码的字符串文字。

您在ProcessAttachmentPart() 中对ExtractHeaderSubItem() 的调用也是错误的，它只需要传入Content-Disposition 标头的内容，而不是整个Headers.Text。 ExtractHeaderSubItem() 设计为一次仅对 1 个标头进行操作。

关于dir MIME 部分，正文数据以'UploadW' 而不是'Upload' 结尾的原因是因为在将MS.Memory 分配给Value 字符串时没有考虑MS.Size。 TMemoryStream 数据不以空值结尾！因此，您需要使用SetString() 而不是:= 运算符，例如：

var
  Value: AnsiString;
...
SetString(Value, PAnsiChar(MS.Memory), MS.Size);

关于Decoder.FileName，该值根本不受Content-Transfer-Encoding 标头的影响。 MIME 标头根本不允许 未编码 Unicode 字符。目前，根据RFC 7578 Section 5.1.3，Indy 的 MIME 解码器支持标头中 Unicode 字符的RFC2047 样式编码，但您的流数据未使用该格式。 看起来您的数据使用的是原始 UTF-8 八位字节 ¹（5.1.3 也提到了可能的编码，但解码器目前不查找）。因此，您可能必须根据需要自己手动提取和解码原始filename。如果您知道 filename 将始终编码为 UTF-8，您可以尝试将 Indy 的全局 IdGlobal.GIdDefaultTextEncoding 变量设置为 encUTF8（默认为 encASCII），然后将 @ 987654348@ 应该准确。但是，这是一个全局设置，因此可能在 Indy 的其他地方产生不良副作用，具体取决于上下文和数据。因此，我建议将GIdDefaultTextEncoding 设置为enc8Bit，以便最大限度地减少不必要的副作用，并且Decoder.FileName 将按原样包含原始原始字节（仅扩展为16 位字符）。这样，您可以通过简单地将Decoder.FileName 原样传递给IndyTextEncoding_8Bit.GetBytes() 来恢复原始filename 字节，然后根据需要对它们进行解码（例如使用IndyTextEncoding_UTF8.GetString()，在验证字节是有效的UTF-8 后） )。

^{1：但是，ÄŤeskĂˇ teÄŤka.png 不是 česká tečka.png 的正确 UTF-8 格式，看起来数据可能已被双重编码，即 česká tečka.png 是UTF-8 编码，然后将得到的字节再次 UTF-8 编码}

【讨论】：

在 SO 上编辑我的帖子时，流数据格式错误。对此感到抱歉。我在原始帖子中修复了它。发布流来自 RequestInfo.PostStream.SaveTofile。在 ProcessMultiPart() 中调用 ExtractHeaderSubItem() 只是为了演示。
当我使用 TIdHTTP + TIdMultiPartFormDataStream 上传相同的文件时，它可以工作，因为它为所有 MIME 部分添加了编码规范并对它们进行编码。但如果请求来自网络浏览器，我也需要让它工作。如何在服务器上处理它？或者如何更改网站表单以与服务器兼容？
我已经更新了关于dir 零件数据的答案。您的代码中有一个错误，我解释了修复方法。至于文件名，如果浏览器以 Indy 尚不支持的格式发送文件名（而且我无法理解为什么网络浏览器会像您显示的那样对文件名进行双重 UTF 编码），那么您将只有如我在回答中所述，手动解码文件名。
Remy，感谢您修复 dir 值。在我看来，对于文件名，数据只被 UTF-8 编码一次，而不是你写的两倍。我可以解码它，但它非常复杂，因为我需要从 UTF-8 解码中排除二进制部分（在我的情况下为 PNG_DATA），因为它没有被编码。所以我非常接近手动处理整个消息而无需 Indy 类...
我已经更新了关于FileName 解码的答案。

【解决方案2】：

如今，filename 参数仅应出于后备原因添加，而应添加 filename* 以清楚地说明文件名具有哪种文本编码。否则每个客户都只是猜测和假设。这可能会出错。

RFC 5987 §3.2 定义了filename* 参数的格式：

charset'[language]'value-chars

...而：

charset 可以是 UTF-8 或 ISO-8859-1 或任何 MIME 字符集

...语言是可选的。

RFC 6266 §4.3 定义应该使用filename* 并在§5 中提供示例：

Content-Disposition: attachment; filename="EURO rates"; filename*=utf-8''%e2%82%ac%20rates`

你发现星号*了吗？你发现文本编码utf-8了吗？您是否发现了两个撇号''，没有指定进一步指定的语言（参见RFC 5646 § 2.1）？然后根据指定的文本编码输入八位字节：百分比编码，或（如果允许）纯 ASCII。

其他例子：

```
Content-Disposition: attachment; filename="green.jpg"; filename*=UTF-8''%e3%82%b0%e3%83%aa%e3%83%bc%e3%83%b3.jpg
```
将在旧版网络浏览器上显示“green.jpg”，在兼容的网络浏览器上显示“グリーン.jpg”。
```
Content-Disposition: attachment; filename="Gruesse.txt"; filename*=ISO-8859-1''Gr%fc%dfe.txt
```
将在旧版网络浏览器上显示“Gruesse.txt”，在兼容的网络浏览器上显示“Grüße.txt”。
```
Content-Disposition: attachment; filename="Hello.png"; filename*=Shift_JIS'en-US'Howdy.png; filename*=EUC-KR'de'Hallo.png
```
将在较旧的网络浏览器上显示“Hello.png”，在首选语言设置为美式英语的兼容网络浏览器上显示“Howdy.png”，并且“Hallo.png”适用于首选语言为德语（Deutsch）的合规文件。请注意，只要八位字节在允许的范围内（拉丁字母和点一起），不同的文本编码就不会绑定到百分比编码。

根据我的经验，没有人关心这个不错的功能——每个人都只是将 UTF-8 塞进filename，这仍然违反了标准——不管有多少客户默默地支持它。链接How to encode the filename parameter of Content-Disposition header in HTTP? 和PHP: RFC-2231 How to encode UTF-8 String as Content-Disposition filename。

【讨论】：

RFC 7578 禁止在Content-Disposition 的Content-Disposition 标头中使用filename*: "注意：[RFC5987] 中描述的编码方法，这将添加一个“文件名* " Content-Disposition 标头字段的参数，不得使用。" 并且 RFC 6266 仅适用于 HTTP Content-Disposition 标头，不适用于 MIME 标头。它甚至这样说：“注意：本文档不适用于出现在通过 HTTP 传输的有效负载主体中的 Content-Disposition 标头字段，例如使用媒体类型“multipart/form-data”（[RFC2388]）时。 "
这些对我来说是新闻，感谢您的指出。我所描述的在大约 5 年的时间跨度内是合法的。我应该删除我的 A 吗？