【发布时间】:2018-09-08 16:15:06
【问题描述】:
我对 c++ 很陌生,这是我第一次尝试制作相对较大/有用的东西,所以如果有任何与我的问题无关的明显错误,我们将不胜感激。
对于一些utf-8相关的操作,我使用的是C库utf8proc。
问题
当为调试目标使用最新的 MSVC 15 构建时,使用此代码运行测试程序(基本上就像打印此函数的结果一样简单),它会产生错误提示:
调试错误!
[关于哪个 exe 文件失败的一些信息]
运行时检查失败 #2 - 变量“字符”周围的堆栈已损坏。
任何其他编译器(我尝试过)或发布目标都不会给出此错误,而是会为我抛出的任何内容提供正确的输出。
有一些有趣的事情需要注意(这些是用 gcc 测试的):
首先,codepoint 和 character 的内存似乎有时会随机更改(因此将codepoint 保存到codepointCopy 是一种很麻烦的缓解措施)。
其次,character,一旦编码,有时会有奇怪的尾随字符(我假设由于未初始化的内存,但是,尝试通过 memset 手动将 character 中的内存设置为 0 没有帮助,是有什么明显的遗漏吗?),因此是 hacky .substr(0, charSize),到目前为止工作正常。
代码
#include <string>
#include "../include/utf8proc.h"
std::string calculateUnicodeNormalization(const std::string &in, const std::string &mode) {
auto pString = (const utf8proc_uint8_t*) in.c_str();
utf8proc_uint8_t* pOutString;
// These two functions are from c and use malloc to allocate memory, so I free with free()
if (mode == "NFC") {
pOutString = utf8proc_NFC(pString);
} else {
pOutString = utf8proc_NFD(pString);
}
// Converts to a string
std::string retString = std::string((const char*) pOutString);
// Frees what was allocated by malloc
free(pOutString);
return retString;
}
std::string removeAccents(const std::string &in) {
std::string decomposedString = calculateUnicodeNormalization(in, "NFD");
auto pDecomposedString = (const utf8proc_uint8_t*) decomposedString.c_str();
size_t offset = 0;
std::string rebuiltString;
// Iterates through all of the characters, adding to the "offset" each time so the next character can be found
while (true) {
utf8proc_int32_t codepoint;
// This function takes a pointer to a uint8_t array and writes the next unicode character's codepoint into codepoint.
// The -1 means it reads up to 4 bytes (the max length of a utf-8 character).
utf8proc_iterate(pDecomposedString + offset, -1, &codepoint);
// Null terminator, end of string
if (codepoint == 0) {
break;
}
const utf8proc_int32_t codepointCopy = codepoint;
utf8proc_uint8_t character;
// This function takes a codepoint and puts the encoded utf-8 character into "character". It returns the bytes written.
auto charSize = (size_t) utf8proc_encode_char(codepointCopy, &character);
// I had been having some problems with trailing random characters (random unicode), but this seemed to fix it.
// Could that have been related to the error?
std::string realChar = std::string((const char*) &character).substr(0, charSize);
// God knows why this is needed, but the above function call seems to somehow alter codepoint
// Could be to do with the error?
codepoint = codepointCopy;
// Increments offset so the next character now would be read
offset += charSize;
// The actual useful part of the function: gets the category of the codepoint, and if it is Mark, Nonspacing (and not an iota subscript),
// does not add it to the rebuilt string
if ((utf8proc_category(codepoint) == UTF8PROC_CATEGORY_MN) && (codepoint != 0x0345)) {
continue;
}
rebuiltString += realChar;
}
// Returns the composed form of the rebuilt string
return calculateUnicodeNormalization(rebuiltString, "NFC");
}
您可以测试这段代码,例如,编写一个函数 main:
#include <iostream>
int main() {
std::cout << removeAccents("ᾤκεον") << std::endl;
}
并期待“ῳκεον”的结果。
我不太确定发生了什么,而且在我看来没有任何明显的内存错误(我的意思是,否则它似乎工作得非常好),但当然,由于我没有经验,可能会有我错过了一些东西。
感谢任何答案,并且一如既往,如果有任何遗漏,请发表评论,以便我补充。
【问题讨论】:
-
-
@RichardCritten 无需向
std::string添加终止空值
标签: c++ visual-c++ utf-8