C++ ShiftJIS 到 UTF8 的转换答案

【问题标题】：C++ ShiftJIS to UTF8 conversionC++ ShiftJIS 到 UTF8 的转换
【发布时间】：2016-01-14 21:26:20
【问题描述】：

我需要转换双字节字符。在我的特殊情况下，Shift-Jis 可以更好地处理，最好使用标准 C++。

以下问题最终没有解决方法： Doublebyte encodings on MSVC (std::codecvt): Lead bytes not recognized

那么有没有人对如何使用 C++ 标准处理这种转换提出建议或参考？

【问题讨论】：

“更好地处理”究竟是为了什么？只有一个方向？（ShitJIS => somethingelse，但不是 somethingelse => ShiftJIS）
对不起，例如以 UTF-8 显示。只有一个方向。很高兴知道。
@gabriel - 适用于什么平台/操作系统？在任何平台上都很难没有 ICU。
基本上我只是想要一个像公认答案中的那个那样的迷你功能，但如果我在赏金结束之前无法获得它，我会只使用ICU
对不起，我并没有真正关注这些老问题，也没有注意到人们仍然想使用它。找不到原来的生成器了，但现在将编辑一个新的...

标签： c++ utf-8 character-encoding shift-jis double-byte

【解决方案1】：

通常我会推荐使用ICU 库，但仅就这一点而言，使用它的开销太大了。

首先是一个转换函数，它接受一个带有 Shiftjis 数据的 std::string，并返回一个带有 UTF8 的 std::string（注意 2019：不知道它是否有效:)）

它使用 25088 个元素（25088 字节）的 uint8_t 数组，在代码中用作 convTable。该函数不会填充此变量，您必须从例如加载它。首先是一个文件。下面的第二个代码部分是一个可以生成文件的程序。

转换函数不检查输入是否为有效的 ShiftJIS 数据。

std::string sj2utf8(const std::string &input)
{
    std::string output(3 * input.length(), ' '); //ShiftJis won't give 4byte UTF8, so max. 3 byte per input char are needed
    size_t indexInput = 0, indexOutput = 0;

    while(indexInput < input.length())
    {
        char arraySection = ((uint8_t)input[indexInput]) >> 4;

        size_t arrayOffset;
        if(arraySection == 0x8) arrayOffset = 0x100; //these are two-byte shiftjis
        else if(arraySection == 0x9) arrayOffset = 0x1100;
        else if(arraySection == 0xE) arrayOffset = 0x2100;
        else arrayOffset = 0; //this is one byte shiftjis

        //determining real array offset
        if(arrayOffset)
        {
            arrayOffset += (((uint8_t)input[indexInput]) & 0xf) << 8;
            indexInput++;
            if(indexInput >= input.length()) break;
        }
        arrayOffset += (uint8_t)input[indexInput++];
        arrayOffset <<= 1;

        //unicode number is...
        uint16_t unicodeValue = (convTable[arrayOffset] << 8) | convTable[arrayOffset + 1];

        //converting to UTF8
        if(unicodeValue < 0x80)
        {
            output[indexOutput++] = unicodeValue;
        }
        else if(unicodeValue < 0x800)
        {
            output[indexOutput++] = 0xC0 | (unicodeValue >> 6);
            output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
        }
        else
        {
            output[indexOutput++] = 0xE0 | (unicodeValue >> 12);
            output[indexOutput++] = 0x80 | ((unicodeValue & 0xfff) >> 6);
            output[indexOutput++] = 0x80 | (unicodeValue & 0x3f);
        }
    }

    output.resize(indexOutput); //remove the unnecessary bytes
    return output;
}

关于helper 文件：我以前这里有下载，但现在我只知道不可靠的文件托管程序。所以...http://s000.tinyupload.com/index.php?file_id=95737652978017682303 适合你，或者：

首先从ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/SHIFTJIS.TXT 下载“原始”数据。由于篇幅太长，我无法在此处粘贴此内容，因此我们必须希望至少 unicode.org 保持在线状态。

然后在管道/重定向上面的文本文件时使用这个程序，并将二进制输出重定向到一个新文件。（需要一个二进制安全的 shell，不知道它是否适用于 Windows）。

#include<iostream>
#include<string>
#include<cstdio>

using namespace std;

// pipe SHIFTJIS.txt in and pipe to (binary) file out
int main()
{
    string s;
    uint8_t *mapping; //same bigendian array as in converting function
    mapping = new uint8_t[2*(256 + 3*256*16)];

    //initializing with space for invalid value, and then ASCII control chars
    for(size_t i = 32; i < 256 + 3*256*16; i++)
    {
        mapping[2 * i] = 0;
        mapping[2 * i + 1] = 0x20;
    }
    for(size_t i = 0; i < 32; i++)
    {
        mapping[2 * i] = 0;
        mapping[2 * i + 1] = i;
    }

    while(getline(cin, s)) //pipe the file SHIFTJIS to stdin
    {
        if(s.substr(0, 2) != "0x") continue; //comment lines

        uint16_t shiftJisValue, unicodeValue;
        if(2 != sscanf(s.c_str(), "%hx %hx", &shiftJisValue, &unicodeValue)) //getting hex values
        {
            puts("Error hex reading");
            continue;
        }

        size_t offset; //array offset
        if((shiftJisValue >> 8) == 0) offset = 0;
        else if((shiftJisValue >> 12) == 0x8) offset = 256;
        else if((shiftJisValue >> 12) == 0x9) offset = 256 + 16*256;
        else if((shiftJisValue >> 12) == 0xE) offset = 256 + 2*16*256;
        else
        {
            puts("Error input values");
            continue;
        }

        offset = 2 * (offset + (shiftJisValue & 0xfff));
        if(mapping[offset] != 0 || mapping[offset + 1] != 0x20)
        {
            puts("Error mapping not 1:1");
            continue;
        }

        mapping[offset] = unicodeValue >> 8;
        mapping[offset + 1] = unicodeValue & 0xff;
    }

    fwrite(mapping, 1, 2*(256 + 3*256*16), stdout);
    delete[] mapping;
    return 0;
}

注意事项：
两个字节的大端原始 unicode 值（这里不需要超过两个字节）
单字节 ShiftJIS 字符的前 256 个字符（512 字节），无效字符的值为 0x20。
然后 0x8???, 0x9??? 组的 3 * 256*16 个字符和 0xE???
= 25088 字节

【讨论】：

感谢先生分享您的代码！我还没有设法让它工作。我传入了一个 std::string，它应该是一个不错的日文文本，输出始终为 0。有什么建议吗？
@deviantfan，文件链接“filedropper.com/shiftjis”不起作用。您能否提供一个有效的链接。谢谢。感谢您的帮助。
我会附和 Rak 所说的 - 你能修复死链接吗？我希望能够使用它。
现在可以了！感谢您回到这个超级老问题。
这太完美了！太感谢了！ :D

【解决方案2】：

对于那些寻找 Shift-JIS 转换表数据的人，您可以在此处获取 uint8_t 数组： https://github.com/bucanero/apollo-ps3/blob/master/include/shiftjis.h

另外，这里有一个非常简单的函数，可以将基本的 Shift-JIS 字符转换为 ASCII：

const char SJIS_REPLACEMENT_TABLE[] = 
    " ,.,..:;?!\"*'`*^"
    "-_????????*---/\\"
    "~||--''\"\"()()[]{"
    "}<><>[][][]+-+X?"
    "-==<><>????*'\"CY"
    "$c&%#&*@S*******"
    "*******T><^_'='";

//Convert Shift-JIS characters to ASCII equivalent
void sjis2ascii(char* bData)
{
    uint16_t ch;
    int i, j = 0;
    int len = strlen(bData);
    
    for (i = 0; i < len; i += 2)
    {
        ch = (bData[i]<<8) | bData[i+1];

        // 'A' .. 'Z'
        // '0' .. '9'
        if ((ch >= 0x8260 && ch <= 0x8279) || (ch >= 0x824F && ch <= 0x8258))
        {
            bData[j++] = (ch & 0xFF) - 0x1F;
            continue;
        }

        // 'a' .. 'z'
        if (ch >= 0x8281 && ch <= 0x829A)
        {
            bData[j++] = (ch & 0xFF) - 0x20;
            continue;
        }

        if (ch >= 0x8140 && ch <= 0x81AC)
        {
            bData[j++] = SJIS_REPLACEMENT_TABLE[(ch & 0xFF) - 0x40];
            continue;
        }

        if (ch == 0x0000)
        {
            //End of the string
            bData[j] = 0;
            return;
        }

        // Character not found
        bData[j++] = bData[i];
        bData[j++] = bData[i+1];
    }

    bData[j] = 0;
    return;
}

【讨论】：