Java：如何从 inputStream 获取编码？答案

【问题标题】：Java:How can i get the encoding from inputStream?Java：如何从 inputStream 获取编码？
【发布时间】：2011-11-29 03:55:49
【问题描述】：

我想从流中获取编码。

第一种方法 - 使用 InputStreamReader。

但它总是返回操作系统编码。

InputStreamReader reader = new InputStreamReader(new FileInputStream("aa.rar"));
System.out.println(reader.getEncoding());

输出：GBK

第二种方法 - 使用 UniversalDetector。

但它总是返回 null。

    FileInputStream input = new FileInputStream("aa.rar");

    UniversalDetector detector = new UniversalDetector(null);
    byte[] buf = new byte[4096];

    int nread;
    while ((nread = input.read(buf)) > 0 && !detector.isDone()) {
        detector.handleData(buf, 0, nread);
    }

    // (3)
    detector.dataEnd();

    // (4)
    String encoding = detector.getDetectedCharset();

    if (encoding != null) {
        System.out.println("Detected encoding = " + encoding);
    } else {
        System.out.println("No encoding detected.");
    }

    // (5)
    detector.reset();

输出：空

我怎样才能得到正确的？ :(

【问题讨论】：

InputStreamReader 将始终使用平台编码。它不会尝试检测文件中的编码。您通过 UniversalDetector 运行什么类型的文件？在您的示例中，您使用了 RAR 文件，它是一种压缩的二进制格式。首先尝试使用简单的 ASCII 文本文件。
嗨，我更改了文件类型，'Fortunes.txt' 输出：未检测到编码
它似乎没有检测到没有 BOM 的“标准”UTF-8 或 UTF-16，但它适用于带有 BOM 的 UTF-16。也许考虑使用不同的库进行字符集检测？ This link 可能会有所帮助。
通过检查文本数据来检测编码是不可靠的猜测。您确实需要在某处将编码指定为元数据。
@Michael Borwardt：但在许多情况下，您确实没有有任何元数据指定编码，并且您确实没有有任何规格告诉您在哪个编码您需要解析的txt文件将被编码。在这些情况下，诸如www-archive.mozilla.org/projects/intl/… 之类的“猜测”（使用字母频率以及许多其他启发式方法）似乎是相当“科学”的猜测。一切并不总是非黑即白。当您没有元数据时，您不会说：“我需要元数据”，但您会努力工作并编写（或重用）检测器。

标签： java encoding io

【解决方案1】：

让我们恢复一下情况：

InputStream 传递字节
*阅读器以某种编码方式传递字符
new InputStreamReader(inputStream) 使用操作系统编码
new InputStreamReader(inputStream, "UTF-8") 使用给定的编码（此处为 UTF-8）

所以在阅读之前需要知道编码。您首先使用了一个字符集检测类，一切都做对了。

阅读http://code.google.com/p/juniversalchardet/ 它应该处理 UTF-8 和 UTF-16。你可以使用编辑器JEdit来验证编码，看看有没有问题。

【讨论】：

我们可以使用其他工具来实现，但是无法理解具体的处理方法，好像是要处理的。 :(
Juniversalchardet 不支持 ISO-8859-1，这是一个非常常见的字符集
@Thomas universalchardet 源自浏览器区域，其中 ISO-8859-1 被重新解释为 Windows-1252（自 HTML 5 起正式），因此可能 Window-1252 aka Cp1252 有效。是的，检查

【解决方案2】：

    public String getDecoder(InputStream inputStream) {

    String encoding = null;

    try {
        byte[] buf = new byte[4096];
        UniversalDetector detector = new UniversalDetector(null);
        int nread;

        while ((nread = inputStream.read(buf)) > 0 && !detector.isDone()) {
            detector.handleData(buf, 0, nread);
        }

        detector.dataEnd();
        encoding = detector.getDetectedCharset();
        detector.reset();

        inputStream.close();

    } catch (Exception e) {
    }

    return encoding;
}

【讨论】：