【发布时间】:2014-03-05 21:35:33
【问题描述】:
我目前正在处理使用 icu 库将 UTF-8 字符串转换为 UCS-2 字符串的问题。在库中有多种方法可以做到这一点,但到目前为止,它们似乎都没有奏效,但考虑到这个库的受欢迎程度,我假设我做错了什么。
首先是通用代码。在所有情况下,我都是在对象上创建和传递字符串,但在它到达转换步骤之前,没有任何操作。
当前使用的 utf-8 字符串只是“ĩ”。
为简单起见,我将在此代码中将使用的字符串表示为uniString
UErrorCode resultCode = U_ZERO_ERROR;
UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);
// Change the callback to error out instead of the default
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);
int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];
printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
// outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
uniString.length(), &resultCode);
ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(),
outputLength ? target : "invalid_char", resultCode, outputLength);
if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
if (resultCode == U_INVALID_CHAR_FOUND)
{
printf("Unmapped input character, cannot be converted to Latin1");
m_pConv = ucnv_open("UCS-2", &resultCode);
if (U_SUCCESS(resultCode))
{
// outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
uniString.length(), &resultCode);
ucnv_close(m_pConv);
}
printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(),
outputLength ? target : "invalid_char", resultCode, outputLength);
if (U_SUCCESS(resultCode))
{
pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
}
}
else
{
printf("DecodeText(): Text contents does not appear to be valid UTF-8");
}
}
else
{
printf("DecodeText(): Text successfully converted to Latin1");
std::string newBody(target, outputLength);
pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}
问题是 ucnv_fromAlgorithmic 函数为 ucs-2 转换抛出错误 U_INVALID_CHAR_FOUND。这对 ISO-8859-1 尝试有意义,但对 ucs-2 无效。
另一种尝试是使用ucnv_convert,您可以看到它已被注释掉。此函数尝试转换,但在 ISO-8859-1 尝试中没有失败。
所以问题是,有没有人对这些函数有经验并看到不正确的地方,或者对于这个字符的转换假设有什么不正确的地方?
【问题讨论】:
-
@KevinPanko 更新了问题和问题。谢谢。
标签: c++ unicode utf-8 icu ucs2