Nginx：如何在通过代理传递它们之前将帖子数据编码从 UTF-8 转换为 TIS-620答案

【问题标题】：Nginx: How to convert post data encoding from UTF-8 to TIS-620 before pass them through proxyNginx：如何在通过代理传递它们之前将帖子数据编码从 UTF-8 转换为 TIS-620
【发布时间】：2018-06-04 19:19:41
【问题描述】：

我想转换从请求接收到的 POST 数据，并将其从 UTF-8 转换为 TIS-620，然后使用下面的代码通过 proxy_pass 将其传递到后端，但我不确定采用哪种方法

location / {
   proxy_pass http://targetwebsite;
}

如果我没记错的话，我相信我必须使用Lua module 来操作请求，但我不知道他们是否支持任何字符转换。

任何人都可以帮助我使用示例代码来使用 LUA 将 POST 数据从 UTF-8 转换为 TIS-620，以及如何在转换之前验证 POST 数据是否为 UTF-8，或者是否有其他更好的方法来操作/转换 POST 数据在 nginx 中？

【问题讨论】：

标签： nginx lua

【解决方案1】：

此解决方案适用于 Lua 5.1/5.2/5.3

local function utf8_to_unicode(utf8str, pos)
   -- pos = starting byte position inside input string
   local code, size = utf8str:byte(pos), 1
   if code >= 0xC0 and code < 0xFE then
      local mask = 64
      code = code - 128
      repeat
         local next_byte = utf8str:byte(pos + size)
         if next_byte and next_byte >= 0x80 and next_byte < 0xC0 then
            code, size = (code - mask - 2) * 64 + next_byte, size + 1
         else
            return
         end
         mask = mask * 32
      until code < mask
   elseif code >= 0x80 then
      return
   end
   -- returns code, number of bytes in this utf8 char
   return code, size
end

function utf8to620(utf8str)
   local pos, result_620 = 1, {}
   while pos <= #utf8str do
      local code, size = utf8_to_unicode(utf8str, pos)
      if code then
         pos = pos + size
         code =
            (code < 128 or code == 0xA0) and code
            or (code >= 0x0E01 and code <= 0x0E3A or code >= 0x0E3F and code <= 0x0E5B) and code - 0x0E5B + 0xFB
      end
      if not code then
         return utf8str  -- wrong UTF-8 symbol, this is not a UTF-8 string, return original string
      end
      table.insert(result_620, string.char(code))
   end
   return table.concat(result_620)  -- return converted string
end

用法：

local utf8string = "UTF-8 Thai text here"
local tis620string = utf8to620(utf8string)

【讨论】：

看起来比我的解决方案更好，因为它检查字符是否具有 TIS-620 中的编码。它识别不间断的空间。不会对 UTF-8 进行完整验证（例如，拒绝过长的序列），但这可能不是必需的。

【解决方案2】：

我在Wikipedia 上查找了编码，并提出了以下从 UTF-8 转换为 TIS-620 的解决方案。它假定 UTF-8 字符串中的所有代码点都具有 TIS-620 编码。如果 UTF-8 字符串仅包含 ASCII 可打印字符（代码点 " " 到 "~"）或泰语字符（代码点 "ก" 到 "๛"），它将起作用。否则，它会给出错误且可能非常奇怪的结果。

这假设您拥有 Lua 5.3 的 utf8 库或等效库。如果您使用的是较早版本的 Lua，则一种可能性是来自 MediaWiki 的 ustring 库的 pure-Lua version（例如，由 Wikipedia 和 Wiktionary 使用）。它提供了一个验证 UTF-8 的函数，许多其他函数将自动验证字符串。（也就是说，如果字符串是无效的 UTF-8，它们会抛出错误。）如果您使用该库，您只需在下面的代码中将 utf8.codepoint 替换为 ustring.codepoint。

-- Add this number to TIS-620 values above 0x80 to get the Unicode codepoint.
-- 0xE00 is the first codepoint of Thai block, 0xA0 is the corresponding byte used in TIS-620.
local difference = 0xE00 - 0xA0

function UTF8_to_TIS620(UTF8_string)
    local TIS620_string = UTF8_string:gsub(
      '[\194-\244][\128-\191]+',
      function (non_ASCII)
          return string.char(utf8.codepoint(non_ASCII) - difference)
      end)
    return TIS620_string
end

【讨论】：