在 macOS 和 Windows 上将 const char* 从 C 转换为 UTF16？答案

【问题标题】：Convert const char* to UTF16 from C on macOS and Windows?在 macOS 和 Windows 上将 const char* 从 C 转换为 UTF16？
【发布时间】：2021-11-18 03:07:03
【问题描述】：

我的尝试看起来很老套，而且过于复杂。有没有在 Windows 和 macOS 上将 ASCII 转换为 UTF16 的简单方法？

（注意prUTF16Char我不能改？？？？）

尝试（写自https://stackoverflow.com/a/54376330）

前奏

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#if defined(__APPLE__) && defined(__MACH__)
#include <xcselect.h>
#include <wchar.h>
#include <CoreFoundation/CoreFoundation.h>
typedef unsigned short int prUTF16Char;
#else
typedef wchar_t prUTF16Char;
#endif

#define WIDEN2(x) L ## x
#define WIDEN(x) WIDEN2(x)
#define PROJECT_NAME "foo"

功能

void copy2ConvertStringLiteralIntoUTF16(const wchar_t* inputString, prUTF16Char* destination) {
    size_t length = wcslen(inputString);
#if (defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)) && defined(PLUGIN_MODE)
    wcscpy_s(destination, length + 1, inputString);
#elif defined(__APPLE__) && defined(__MACH__)
    CFRange range = {0, 150}; range.length = length;
    CFStringRef inputStringCFSR = CFStringCreateWithBytes(
        kCFAllocatorDefault, reinterpret_cast<const UInt8 *>(inputString),
        length * sizeof(wchar_t), kCFStringEncodingUTF32LE, false);
    CFStringGetBytes( inputStringCFSR, range, kCFStringEncodingUTF16, 0, false,
                      reiterpret_cast<UInt8 *>(destination), length * (sizeof (prUTF16Char)), NULL);
    destination[length] = 0; // Set NULL-terminator
    CFRelease(inputStringCFSR);
#endif
}

const prUTF16Char * to_wchar(const char* message) {
    const size_t cSize = strlen(message);
    wchar_t *w_str = new wchar_t[cSize];
#if defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
    size_t outSize;
    mbstowcs_s(&outSize, w_str, cSize, message, cSize-1);
    return w_str;
#else
    mbstowcs(w_str, message, cSize);
#endif
#if defined(__APPLE__) && defined(__MACH__)
    prUTF16Char *ut16str = new prUTF16Char[cSize];
    copy2ConvertStringLiteralIntoUTF16(w_str, ut16str);
    return ut16str;
#else
    return w_str;
#endif
}

然后我可以定义一个全局变量：

static const prUTF16Char* PROJECT_NAME_W =
#if defined(__APPLE__) && defined(__MACH__)
    to_wchar
#elif defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
    WIDEN
#endif
        (PROJECT_NAME);

以及采用message 的通用打印函数的主体：

#if WCHAR_UTF16
wprintf(L"%s",
#else
    printf("%ls\n",
#endif
    message);

全面尝试：

https://github.com/SamuelMarks/premiere-pro-cmake-plugin/blob/f0d2278/src/common/logger.cpp [从 C++ 重写为 C]

错误：

错误：初始化元素不是编译时常量

编辑：超级hacky，但使用@barmak-shemirani 的解决方案我可以：

#if defined(__APPLE__) && defined(__MACH__)
extern
#elif defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
static
#endif
const prUTF16Char* PROJECT_NAME_W
#if defined(__APPLE__) && defined(__MACH__)
    ;
#elif defined(_WIN32) || defined(__WIN32__) || defined(__WINDOWS__)
    WIDEN(PROJECT_NAME);
#endif

...并且仅在 extern 变体上初始化和 free。

【问题讨论】：

这被标记为 C 但包含new，你是不是忘了替换它？
"Convert ASCII to [Unicode]" 被混淆了； ASCII 已经是 Unicode 的一个子集。您能否请edit 更详细地解释代码应该做什么？简单来说，纯 ASCII 字符串 "hello" 对应于 UTF-16 中的 "h\x00e\x00l\x00l\x00o\x00"（尽管在常规 C 字符串中空字节显然会出现问题......首选 utf8everywhere.org 的众多原因之一）
@BarmakShemirani - 是的，当我意识到这一切似乎太复杂时，我仍在从 C++ 转换的演员表。 @tripleee 另外我在 C90 中工作，所以我实际上不能在任何地方使用 UTF8……更不用说我符合其他人的 API [Adobe's] 并且需要在某些地方接受常规的 const char* 输入（我需要转换为 API 使用的 unicode 变体）

标签： c character-encoding wchar-t wchar multibyte-characters

【解决方案1】：

message 包含空终止字符。 strlen 不计算最后一个字符，cSize 必须增加 1。

通常你需要调用setlocal，例如message是在非英语电脑上输入的。不过message保证是ASCII就可以了。

Windows 示例：

const wchar_t* to_wchar(const char* message) 
{ 
    const size_t cSize = strlen(message) + 1;
    //wchar_t* w_str = new wchar_t[cSize]; using C++?
    wchar_t* w_str = malloc(cSize * sizeof(wchar_t));

    size_t outSize;
    mbstowcs(w_str, message, cSize);
    // or mbstowcs_s(&outSize, w_str, cSize, message, cSize);

    return w_str;
}

注意wchar_t 在 Windows 中是 2 个字节，在 POSIX 中是 4 个字节。那么 UTF-16 有 2 个不同的版本，little-endian 和 big-endian。 UTF-16 对于 ASCII 等效代码每个字符有 2 个字节，但对于某些非拉丁语言有 4 个字节。

您应该考虑 UTF-8 输出。大多数 Windows 程序都准备好从文件或网络中读取 UTF-8。

"123" 的 Windows 字节输出：

49 00 50 00 51 00 00 00 <- little-endian
0  49 00 50 00 51 00 00 <- big-endian

上述代码的 Linux 字节输出（Winodws 不会将其识别为 UTF-16）：

49 00 00 00 50 00 00 00 51 00 00 00 00 00 00 00

如果您 100% 确定消息是 ASCII，则可以编写自己的函数

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
 
typedef unsigned short prUTF16Char;//remove this line later

prUTF16Char* to_wchar(const char* message)
{
    if (!message) return NULL;

    size_t len = strlen(message);
    int bufsize = (len + 1) * 2;
    char* buf = malloc(bufsize);

    int little_endian = 1;
    little_endian = ((char*)&little_endian)[0];
    memset(buf, 0, bufsize);
    for (size_t i = 0; i < len; i++)
        buf[i * 2 + little_endian ? 0 : 1] = message[i];

    return (prUTF16Char*)buf;
}

prUTF16Char* wstr;
int main()
{
    wstr = to_wchar("ASCII");
    wprintf(L"%s\n", wstr);
    free(wstr);
    return 0;
}

【讨论】：

旁白：考虑w_str = malloc(cSize * sizeof(wchar_t)); --> w_str = malloc(sizeof *w_str * cSize);。更容易正确编码、审查和维护。
谢谢，很好的发现（off by 1 错误）。嗯，不知道字符串编码的整个字节顺序问题......但是是的，我确实需要 UTF16，因为我需要遵守非常旧的 API（Adobe 的，它有前面提到的 typedef 用于 UTF16）。就提供的解决方案而言，我相信这不会解决我的全局分配问题。这是否意味着我必须使我的全局非静态，还是有其他方法？
我不知道你说的全局非静态是什么意思。您可以假设 Windows 是 little-endian（x86/x64 上的任何东西，特别是旧程序）。我提到的问题是当你在不同的系统上使用mbstowcs时，输出字符串可能是错误的，否则C编译器不会关心这个。
我现在也很困惑你的程序是如何工作的。如果您的程序在 Windows 上运行并与另一个 Window 应用程序通信，那很好。
static const prUTF16Char* PROJECT_NAME_W = to_wchar(PROJECT_NAME); 是我所说的全局静态。（我不需要是freed）。我的项目有两种模式，“插件”和“独立”。 “插件”需要 Adobe SDK 并执行诸如输出到 Adobe 控制台之类的操作。 “独立”可以在没有 Adobe 的 SDK 甚至 C++ 的情况下工作，示例代码是我正在尝试使用的 API。 log_info(message) 应该使用“独立”上的标准输出和“插件”上的 Adobe 特定 API 输出 message。（这在 C++ 中工作；我正在用 C 重写）