【问题标题】：Run-Time Check failure #2 MSVC debug only with utf8proc运行时检查失败 #2 MSVC 仅使用 utf8proc 进行调试
【发布时间】：2018-09-08 16:15:06
【问题描述】：

我对 c++ 很陌生，这是我第一次尝试制作相对较大/有用的东西，所以如果有任何与我的问题无关的明显错误，我们将不胜感激。

对于一些utf-8相关的操作，我使用的是C库utf8proc。

问题

当为调试目标使用最新的 MSVC 15 构建时，使用此代码运行测试程序（基本上就像打印此函数的结果一样简单），它会产生错误提示：

调试错误！

[关于哪个 exe 文件失败的一些信息]

运行时检查失败 #2 - 变量“字符”周围的堆栈已损坏。

任何其他编译器（我尝试过）或发布目标都不会给出此错误，而是会为我抛出的任何内容提供正确的输出。

有一些有趣的事情需要注意（这些是用 gcc 测试的）：

首先，codepoint 和 character 的内存似乎有时会随机更改（因此将codepoint 保存到codepointCopy 是一种很麻烦的缓解措施）。

其次，character，一旦编码，有时会有奇怪的尾随字符（我假设由于未初始化的内存，但是，尝试通过 memset 手动将 character 中的内存设置为 0 没有帮助，是有什么明显的遗漏吗？），因此是 hacky .substr(0, charSize)，到目前为止工作正常。

代码

#include <string>

#include "../include/utf8proc.h"

std::string calculateUnicodeNormalization(const std::string &in, const std::string &mode) {
    auto pString = (const utf8proc_uint8_t*) in.c_str();

    utf8proc_uint8_t* pOutString;
    // These two functions are from c and use malloc to allocate memory, so I free with free()
    if (mode == "NFC") {
        pOutString = utf8proc_NFC(pString);
    } else {
        pOutString = utf8proc_NFD(pString);
    }

    // Converts to a string
    std::string retString = std::string((const char*) pOutString);
    // Frees what was allocated by malloc
    free(pOutString);

    return retString;
}

std::string removeAccents(const std::string &in) {
    std::string decomposedString = calculateUnicodeNormalization(in, "NFD");
    auto pDecomposedString = (const utf8proc_uint8_t*) decomposedString.c_str();

    size_t offset = 0;
    std::string rebuiltString;
    // Iterates through all of the characters, adding to the "offset" each time so the next character can be found
    while (true) {
        utf8proc_int32_t codepoint;

        // This function takes a pointer to a uint8_t array and writes the next unicode character's codepoint into codepoint.
        // The -1 means it reads up to 4 bytes (the max length of a utf-8 character).
        utf8proc_iterate(pDecomposedString + offset, -1, &codepoint);

        // Null terminator, end of string
        if (codepoint == 0) {
            break;
        }

        const utf8proc_int32_t codepointCopy = codepoint;

        utf8proc_uint8_t character;
        // This function takes a codepoint and puts the encoded utf-8 character into "character". It returns the bytes written.
        auto charSize = (size_t) utf8proc_encode_char(codepointCopy, &character);

        // I had been having some problems with trailing random characters (random unicode), but this seemed to fix it.
        // Could that have been related to the error?
        std::string realChar = std::string((const char*) &character).substr(0, charSize);

        // God knows why this is needed, but the above function call seems to somehow alter codepoint
        // Could be to do with the error?
        codepoint = codepointCopy;

        // Increments offset so the next character now would be read
        offset += charSize;

        // The actual useful part of the function: gets the category of the codepoint, and if it is Mark, Nonspacing (and not an iota subscript),
        // does not add it to the rebuilt string
        if ((utf8proc_category(codepoint) == UTF8PROC_CATEGORY_MN) && (codepoint != 0x0345)) {
            continue;
        }

        rebuiltString += realChar;
    }

    // Returns the composed form of the rebuilt string
    return calculateUnicodeNormalization(rebuiltString, "NFC");
}

您可以测试这段代码，例如，编写一个函数 main：

#include <iostream>

int main() {
    std::cout << removeAccents("ᾤκεον") << std::endl;
}

并期待“ῳκεον”的结果。

我不太确定发生了什么，而且在我看来没有任何明显的内存错误（我的意思是，否则它似乎工作得非常好），但当然，由于我没有经验，可能会有我错过了一些东西。

感谢任何答案，并且一如既往，如果有任何遗漏，请发表评论，以便我补充。

【问题讨论】：

@RichardCritten 无需向std::string添加终止空值

标签： c++ visual-c++ utf-8

【解决方案1】：

utf8proc_uint8_t character;
// This function takes a codepoint and puts the encoded utf-8 character into "character". It returns the bytes written.
auto charSize = (size_t) utf8proc_encode_char(codepointCopy, &character);

这会将最多 4 个字节写入单字节变量 character，从而破坏您的堆栈。

 std::string((const char*) &character).substr(0, charSize);

这样会更高效且更少崩溃（&character 不是以空字符结尾的字符串）：

 std::string((const char*) &character, charSize);

甚至更好：

 rebuiltString.append((const char*) &character, charSize);

【讨论】：

我会马上接受！很抱歉这么缺乏经验！
对不起，我一定是搞错了，即使在utf8proc_uint8_t character[4]; auto charSize = (size_t) utf8proc_encode_char(codepoint, character);的时候，它仍然显示同样的错误。有什么指点吗？
@AttoAllas 其他地方可能也有类似的错误，需要调试才能找到原因，我对 utf8proc 不熟悉。
非常感谢，以前从未真正担心过内存错误如此困扰！