将纯文本和字节信息存储在同一个文件中 - 转换问题答案

【问题标题】：Storing plain text and byte information in the same file - Conversion problems将纯文本和字节信息存储在同一个文件中 - 转换问题
【发布时间】：2015-08-09 16:14:18
【问题描述】：

我应该开发一个子系统来将某些业务数据存储在一个文件中，但我遇到了一个问题，但首先我有一些要求：

整个数据必须是 1 个文件。
数据包含人类可读的纯文本和字节数据。
字节数据可能很大（并且在未来还会增长），因此我应该尽可能将其缩小。

我以为我只是将所有内容都放在一个字符串中，用 UTF8（一种不会很快消失的格式）对其进行编码，然后将其写入文件。问题是，UTF8 不允许某些字节组合，当我稍后再次读取文件时会更改它们。

这是一个显示问题的示例代码：

    // The charset we use to encode the strings / file
    Charset charSet = StandardCharsets.UTF_8;

    // The byte data we want to store (as ints here because in the app it is used as ints)
    int idsToStore[] = new int[] {360, 361, 390, 391};

    // We transform our ints to bytes
    byte[] bytesToStore = new byte[idsToStore.length * 4];
    for (int i = 0; i < idsToStore.length; i++) {
        int id = idsToStore[i];
        bytesToStore[i * 4 + 0] = (byte) ((id >> 24) & 0xFF);
        bytesToStore[i * 4 + 1] = (byte) ((id >> 16) & 0xFF);
        bytesToStore[i * 4 + 2] = (byte) ((id >> 8) & 0xFF);
        bytesToStore[i * 4 + 3] = (byte) (id & 0xFF);
    }
    // We transform our bytes to a String
    String stringToStore = new String(bytesToStore, charSet);

    System.out.println("idsToStore="+Arrays.toString(idsToStore));
    System.out.println("BytesToStore="+Arrays.toString(bytesToStore));
    System.out.println("StringToStore="+stringToStore);
    System.out.println();

    // We load our bytes from the "file" (in this case a String, but its the same result)
    byte[] bytesLoaded = stringToStore.getBytes(charSet);
    // Just to check we see if the resulting String is identical
    String stringLoaded = new String(bytesLoaded, charSet);

    // We transform our bytes back to ints
    int[] idsLoaded = new int[bytesLoaded.length / 4];
    int readPos = 0;
    for (int i = 0; i < idsLoaded.length; i++) {
        byte b1 = bytesLoaded[readPos++];
        byte b2 = bytesLoaded[readPos++];
        byte b3 = bytesLoaded[readPos++];
        byte b4 = bytesLoaded[readPos++];
        idsLoaded[i] = (b4 & 0xFF) | (b3 & 0xFF) << 8 | (b2 & 0xFF) << 16 | (b1 & 0xFF) << 24;
    }

    System.out.println("BytesLoaded="+Arrays.toString(bytesLoaded));
    System.out.println("StringLoaded="+stringLoaded);
    System.out.println("idsLoaded="+Arrays.toString(idsLoaded));
    System.out.println();

    // We check everything
    System.out.println("Bytes equal: "+Arrays.equals(bytesToStore, bytesLoaded));
    System.out.println("Strings equal: "+stringToStore.equals(stringLoaded));
    System.out.println("IDs equal: "+Arrays.equals(idsToStore, idsLoaded));

UTF8 的输出是：

    idsToStore=[360, 361, 390, 391]
    BytesToStore=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -122, 0, 0, 1, -121]
    StringToStore=(can not be pasted into SO)

    idsLoaded=[360, 361, 495, -1078132736, 32489405]
    BytesLoaded=[0, 0, 1, 104, 0, 0, 1, 105, 0, 0, 1, -17, -65, -67, 0, 0, 1, -17, -65, -67]
    StringLoaded=(can not be pasted into SO)

    Bytes equal: false
    Strings equal: true
    IDs equal: false

如果我将字符集更改为 UTF16BE（

感谢您提出的任何建议。提前致谢。

【问题讨论】：

不要试图将字节直接表示为字符串。如果您希望文件可读，您应该使用像 base64 这样的文本编码算法，它会占用更多空间，但可以安全地在文本编辑器中打开并通过文本媒体传输。如果您想要文件中的实际二进制数据 - 它不是人类可读的。

标签： java file utf-8 io

【解决方案1】：

确保您的字符集是否始终有效的唯一方法是使用整个 ASCII 表对其进行测试：编写一个包含所有 256 个可能值的字节数组，并测试它是否被正确读取。

但是，回到问题的根源，我怀疑将所有数据编码成一个字符串是否能正常工作。 String 是一种 Unicode 结构，面向包含可读文本（即它可能不包含 32 位 ascii 代码下的某些字符）。

相反，我会想到一个 BINARY 结构化文件：作为二进制文件，您可以确保它可以透明地包含任何内容。并且被缝合，您确保可以在其上存储多种数据。例如，如果您可以设计一个由 segments 组成的结构，并且每个段都有一个用于其数据长度的标题，那将会很好。二进制段将通过 InputStream 读取，而文本段将通过 Reader（使用所需的编码）读取。

【讨论】：