在 C++ 中重命名名称中带有短划线的文件答案

【问题标题】：Renaming a file with an en dash in the name in C++在 C++ 中重命名名称中带有短划线的文件
【发布时间】：2016-11-26 22:13:47
【问题描述】：

在我正在进行的项目中，我使用文件并在继续之前检查它们是否存在。重命名甚至使用文件路径中带有“破折号”的文件似乎是不可能的。

std::string _old = "D:\\Folder\\This – by ABC.txt";
std::rename(_old.c_str(), "New.txt");

这里 _old 变量被 ABC.txt 解释为 D:\Folder\This û 我试过了

setlocale(LC_ALL, "");
//and
setlocale(LC_ALL, "C");
//or    
setlocale(LC_ALL, "en_US.UTF-8");

但他们都没有工作..应该怎么做？

【问题讨论】：

对 C++ 源代码中的连字符使用 Unicode 或 UTF-8 十六进制常量。您依靠您的 C++ 代码编辑器来确定 - 实际上是什么，因此如果编译器以不同方式解释该字符，那么所有这些“setlocale”调用都不会为您做任何事情。
如果将– 替换为"\xe2\x80\x93" (utf8) 会怎样？
你试过u8"This – by ABC.txt"吗？
我试过了，但我得到了那些奇怪的符号Image Here

标签： c++ file

【解决方案1】：

这取决于操作系统。在 Linux 中，文件名是简单的字节数组：忘记编码，只需重命名文件。

但您似乎使用的是 Windows 并且文件名实际上是一个包含 16 位字符的以空字符结尾的字符串。在这种情况下，最好的方法是使用wstring 而不是乱码。

不要试图编写独立于平台的代码来解决特定于平台的问题。 Windows 使用 Unicode 作为文件名，因此您必须编写特定于平台的代码，而不是使用标准函数 rename。

只需写L"D:\\Folder\\This \u2013 by ABC.txt" 并致电_wrename。

【讨论】：

【解决方案2】：

Windows ANSI Western 编码具有 Unicode n-dash、U+2013、“-”作为代码点 150（十进制）。当您将其输出到具有活动代码页 437、原始IBM PC character set 或兼容的控制台时，它会被解释为“û”。因此，您的字符串文字中有正确的代码页 1252 字符，或者是因为

您正在使用 Visual C++，它默认使用 Windows ANSI 代码页来编码窄字符串文字，或者
您使用的是旧版本的 g++，它不执行标准强制转换和检查，而是直接通过其机器传递窄字符字节，并且您的源代码被编码为 Windows ANSI Western（或兼容），或
我没想到。

前两种可能性中的任何一种

rename 调用会起作用。

我测试过它确实适用于 Visual C++。我没有旧版本的 g++，但我测试过它适用于 5.1 版。也就是我测试了一下，文件真的重命名为New.txt。

// Source encoding: UTF-8
// Execution character set: Windows ANSI Western a.k.a. codepage 1252.
#include <stdio.h>      // rename
#include <stdlib.h>     // EXIT_SUCCESS, EXIT_FAILURE
#include <string>       // std::string
using namespace std;

auto main()
    -> int
{
    string const a = ".\\This – by ABC.txt";    // Literal encoded as CP 1252.
    return rename( a.c_str(), "New.txt" ) == 0? EXIT_SUCCESS : EXIT_FAILURE;
}

例子：

[C:\我的\论坛\so\265] > dir /b *.txt 文件未找到 [C:\我的\论坛\so\265] > g++ r.cpp -fexec-charset=cp1252 [C:\我的\论坛\so\265] > type nul >"This – by ABC.txt" [C:\我的\论坛\so\265] > 运行一个 退出代码 0 [C:\我的\论坛\so\265] > dir /b *.txt 新建.txt [C:\我的\论坛\so\265] > _

...其中run 只是一个报告退出代码的批处理文件。

如果您的 Windows ANSI 代码页不是代码页 1252，那么您需要使用您的特定 Windows ANSI 代码页。

您可以通过GetACP API 函数检查 Windows ANSI 代码页，例如通过这个命令：

[C:\我的\论坛\so\265] > wmic os 获取代码集/值 |找到“=” 代码集=1252 [C:\我的\论坛\so\265] > _

如果该代码页支持 n 破折号字符，则该代码将起作用。

这种编码模型基于为每个相关的主要区域设置一个可执行文件版本（包括字符编码）。

另一种方法是使用 Unicode 进行所有操作。这可以通过 Boost 文件系统可移植地完成，该文件系统将被 C++17 中的标准库采用。或者您可以使用 Windows API，或 de facto Windows 标准库的标准扩展，即_rename。

在 Visual C++ 2015 中使用实验性文件系统模块的示例：

// Source encoding: UTF-8
// Execution character set: irrelevant (everything's done in Unicode).
#include <stdlib.h>     // EXIT_SUCCESS, EXIT_FAILURE

#include <filesystem>   // In C++17 and later, or Visual C++ 2015 and later.
using namespace std::tr2::sys;

auto main()
    -> int
{
    path const old_path = L".\\This – by ABC.txt";    // Literal encoded as wide string.
    path const new_path = L"New.txt";
    try
    {
        rename( old_path, new_path );
        return EXIT_SUCCESS;
    }
    catch( ... )
    {}
    return EXIT_FAILURE;
}

要为可移植代码正确执行此操作，您可以使用 Boost，或者您可以创建一个使用任何可用实现的包装头。

【讨论】：

如果程序需要在特定的语言环境下运行，那么它就是一个坏程序。不要将其放入生产代码中。附言为什么不直接使用int main()？
@AndreyNasonov：模型很糟糕，程序不一定很糟糕（它们只是非常有限）。不幸的是，这是 C++ 标准库和 C 标准库中固有的模型，以及 Visual C++ 编译器执行字符编码的选择。出于历史原因。
@AndreyNasonov：我终于添加了“使用 Unicode！”最后发表意见。我认为我们大致在同一页面上。 ;-)

【解决方案3】：

它确实依赖于平台，Unicode 令人头疼。取决于您使用的编译器。对于 MS 的旧版本（VS2010 或更早版本），您需要使用 MSDN 中描述的 API。此测试示例使用您有问题的名称创建文件，然后重命名它

// #define _UNICODE // might be defined in project
#include <string>

#include <tchar.h>
#include <windows.h>

using namespace std;

// Convert a wide Unicode string to an UTF8 string
std::string utf8_encode(const std::wstring &wstr)
{
    if( wstr.empty() ) return std::string();
    int size_needed = WideCharToMultiByte(CP_UTF8, 0, &wstr[0], (int)wstr.size(), NULL, 0, NULL, NULL);
    std::string strTo( size_needed, 0 );
    WideCharToMultiByte                  (CP_UTF8, 0, &wstr[0], (int)wstr.size(), &strTo[0], size_needed, NULL, NULL);
    return strTo;
}

// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
    if( str.empty() ) return std::wstring();
    int size_needed = MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)str.size(), NULL, 0);
    std::wstring wstrTo( size_needed, 0 );
    MultiByteToWideChar                  (CP_UTF8, 0, &str[0], (int)str.size(), &wstrTo[0], size_needed);
    return wstrTo;
}

int _tmain(int argc, _TCHAR* argv[] ) {
    std::string pFileName = "C:\\This \xe2\x80\x93 by ABC.txt";
    std::wstring pwsFileName = utf8_decode(pFileName);

    // can use CreateFile id instead
    HANDLE hf = CreateFileW( pwsFileName.c_str() ,
                      GENERIC_READ | GENERIC_WRITE,
                      0,
                      0,
                      CREATE_NEW,
                      FILE_ATTRIBUTE_NORMAL,
                      0);
    CloseHandle(hf);
    MoveFileW(utf8_decode("C:\\This \xe2\x80\x93 by ABC.txt").c_str(), utf8_decode("C:\\This \xe2\x80\x93 by ABC 2.txt").c_str());
}

这些助手仍然存在问题，因此您可以使用以空字符结尾的字符串。

std::string utf8_encode(const std::wstring &wstr)
{
    std::string strTo;
    char *szTo = new char[wstr.length() + 1];
    szTo[wstr.size()] = '\0';
    WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), -1, szTo, (int)wstr.length(), NULL, NULL);
    strTo = szTo;
    delete[] szTo;
    return strTo;
}


// Convert an UTF8 string to a wide Unicode String
std::wstring utf8_decode(const std::string &str)
{
    std::wstring wstrTo;
    wchar_t *wszTo = new wchar_t[str.length() + 1];
    wszTo[str.size()] = L'\0';
    MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, wszTo, (int)str.length());
    wstrTo = wszTo;
    delete[] wszTo;
    return wstrTo;
}

转换字符大小的问题.. 使用 0 调用 WideCharToMultiByte 作为目标缓冲区的大小允许获取转换所需的字符大小。然后它将返回目标缓冲区大小所需的字节数。所有这些代码的杂耍解释了为什么像 Qt 这样的框架得到如此复杂的代码来支持基于 Unicode 的文件系统。实际上，为您消除所有可能的错误的最佳成本效益方法是使用这样的框架。

对于 VS2015

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;

根据他们的文档。我无法检查那个。

为 mingw.

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();

输出包含正确的文件名...但是对于文件 API，您仍然需要进行正确的转换

【讨论】：

您不能在 Windows 中使用 UTF-8 语言环境，因此 UT8 编码的文件名将是乱码。
−1 因为这种方法不会匹配文件名作为源，或者会产生乱码文件名作为结果（在我注意到之后答案并没有固定）。
@alf 你确定吗？ NTFS 以 Unicode 格式存储名称。事实上，作为俄罗斯人，我经常处理问题——用俄语命名文件会在 NTFS 卷上产生 utf-8 文件名。顺便说一句，我没有在widechar 表中找到破折号。 msdn.microsoft.com/en-us/library/windows/desktop/… 它将它们存储在 UTF-16 中，正如我所说，你需要 API 来转换它。较新的运行时显然是在引擎盖下吗？它不适用于 fopen 和其他 POSIX 函数，它们仅适用于 ANSI
OP 使用的rename 操作（在实践中）实现为对 Windows ANSI 兼容层的调用。 IE。它需要 Windows ANSI 编码的字符串。 UTF-8 编码的字符串，或任何支持 n-dash 的基于 char 的编码，如果有一个需要该编码的重命名操作，但 Microsoft 不提供该编码，则可以。
你错过了我说的话，是吗？ 不要在带有 Unicode 的 Windows 上使用文件访问 POSIX 函数。使用使用 WCHAR 的 Win API。不是他们包含 fopen 等的“POSIX”层。对于 MoveFile() 的移动（重命名）文件。如果需要代码的可移植性，则需要一些类似于 Qt 的多平台框架（Qt 的解决方案几乎是单线）