java.util.Scanner 读取不同字符编码的文件答案

【问题标题】：java.util.Scanner to read files with different character encodingjava.util.Scanner 读取不同字符编码的文件
【发布时间】：2018-11-06 12:12:15
【问题描述】：

我使用Java 来读取文件列表。其中一些具有不同的编码，ANSI 而不是UTF-8。 java.util.Scanner 无法读取这些文件并获得空输出字符串。我尝试了另一种方法：

                FileInputStream fis = new FileInputStream(my_file);
                BufferedReader br = new BufferedReader(new InputStreamReader(fis));
                InputStreamReader isr = new InputStreamReader(fis);
                isr.getEncoding();

我不知道在ANSI 的情况下如何更改字符编码。 UTF-8 和 ANSI 文件混合在同一个文件夹中。我尝试为此使用 Apache Tika。获得文件编码后，我使用Scanner，但输出为空。

Scanner scanner = new Scanner(my_file, detector.getCharset().toString());
line = scanner.nextLine();

【问题讨论】：

@Nirekin ：在这种情况下，我混合了 UTF-8 和 ANSI 编码，因此无法设置固定解决方案。
Java : How to determine the correct charset encoding of a stream的可能重复
@locus2k ：我明白了，但是如何在扫描仪中使用检测到的字符集？
使用所需的字符集打开一个输入流，如果失败，请尝试下一个直到它工作。该链接有一些解决方案。
@locus2k : 我得到了字符集，并在 Scanner 中也设置了它，但输出字符串为空。

标签： java arrays character-encoding java.util.scanner

【解决方案1】：

有一个名为 juniversalchardet 的库，它可以帮助您猜测正确的编码。最近更新了，目前位于 GitHub 上：

https://github.com/albfernandez/juniversalchardet

但是，没有检测编码的故障安全工具，因为有很多未知的东西：

此文件是纯文本还是部分 PNG？
是否以 (1,...,k,...,n) 位编码存储？
使用了哪种 k 位编码？

可以通过计算不常用的控制字符的数量来进行一些猜测。当文件包含许多控制符号时，您可能选择了错误的编码。（然后尝试下一个。）

Juniversalchardet 尝试了多种更成功的方法来确定编码（甚至是中文编码）。它还提供了从已选择正确编码的文件中打开阅读器的便捷方法：

（摘自https://github.com/albfernandez/juniversalchardet#creating-a-reader-with-correct-encoding 并改编）

import org.mozilla.universalchardet.ReaderFactory;
import java.io.File;
import java.io.IOException;
import java.io.Reader;

public class TestCreateReaderFromFile {

    public static void main (String[] args) throws IOException {
        if (args.length != 1) {
            System.err.println("Usage: java TestCreateReaderFromFile FILENAME");
            System.exit(1);
        }

        Reader reader = null;
        try {
            File file = new File(args[0]);
            reader = ReaderFactory.createBufferedReader(file);

            String line;
            while((line=reader.readLine())!=null){
                System.out.println(line); //Print each line to console
            }
        }
        finally {
            if (reader != null) {
                reader.close();
            }
        }

    }

}

编辑：添加 ScannerFactory

/*
(C) Copyright 2016-2017 Alberto Fernández <infjaf@gmail.com>
Adapted by Fritz Windisch 2018-11-15
The contents of this file are subject to the Mozilla Public License Version
1.1 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.mozilla.org/MPL/
Software distributed under the License is distributed on an "AS IS" basis,
WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License
for the specific language governing rights and limitations under the
License.
Alternatively, the contents of this file may be used under the terms of
either the GNU General Public License Version 2 or later (the "GPL"), or
the GNU Lesser General Public License Version 2.1 or later (the "LGPL"),
in which case the provisions of the GPL or the LGPL are applicable instead
of those above. If you wish to allow use of your version of this file only
under the terms of either the GPL or the LGPL, and not to allow others to
use your version of this file under the terms of the MPL, indicate your
decision by deleting the provisions above and replace them with the notice
and other provisions required by the GPL or the LGPL. If you do not delete
the provisions above, a recipient may use your version of this file under
the terms of any one of the MPL, the GPL or the LGPL.
*/

import java.io.BufferedInputStream;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;
import java.util.Objects;
import java.util.Scanner;
import org.mozilla.universalchardet.UniversalDetector;
import org.mozilla.universalchardet.UnicodeBOMInputStream;

/**
 * Create a scanner from a file with correct encoding
 */
public final class ScannerFactory {

    private ScannerFactory() {
        throw new AssertionError("No instances allowed");
    }
    /**
     * Create a scanner from a file with correct encoding
     * @param file The file to read from
     * @param defaultCharset defaultCharset to use if can't be determined
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */

    public static Scanner createScanner(File file, Charset defaultCharset) throws IOException {
        Charset cs = Objects.requireNonNull(defaultCharset, "defaultCharset must be not null");
        String detectedEncoding = UniversalDetector.detectCharset(file);
        if (detectedEncoding != null) {
            cs = Charset.forName(detectedEncoding);
        }
        if (!cs.toString().contains("UTF")) {
            return new Scanner(file, cs.name());
        }
        Path path = file.toPath();
        return new Scanner(new UnicodeBOMInputStream(new BufferedInputStream(Files.newInputStream(path))), cs.name());
    }
    /**
     * Create a scanner from a file with correct encoding. If charset cannot be determined,
     * it uses the system default charset.
     * @param file The file to read from
     * @return Scanner for the file with the correct encoding
     * @throws java.io.IOException if some I/O error ocurrs
     */
    public static Scanner createScanner(File file) throws IOException {
        return createScanner(file, Charset.defaultCharset());
    }
}

【讨论】：

我找到了juniversalchardet，但我不确定如何将它与Scanner 一起使用。
@plaidshirt 我刚刚添加了一个可以为您生产扫描仪的ScannerFactory
你能给我看一个使用例子吗？首先我无法解析UniversalDetector.detectCharset()。

【解决方案2】：

你的方法不会给你正确的编码。

 FileInputStream fis = new FileInputStream(my_file);
 BufferedReader br = new BufferedReader(new InputStreamReader(fis));
 InputStreamReader isr = new InputStreamReader(fis);
 isr.getEncoding();

这将返回此 InputStream 使用的编码（读取 javadoc），而不是写入文件中的字符（在您的情况下为 my_file）。如果编码错误，Scanner 将无法正确读取文件。

事实上，如果我错了，请纠正我，没有办法以 100% 的准确率获得用于特定文件的编码。很少有项目在猜测编码方面具有更好的成功率，但不是 100% 准确率。另一方面，如果您知道使用的编码，则可以使用以下方式读取文件，

Scanner scanner = new Scanner(my_file, "charset");
scanner.nextLine();

另外，找出在 java 中用于 ANSI 的正确字符集名称。它是 US-ASCII 或 Cp1251。

无论您走哪条路，请留意任何IOException，它可能会为您指明正确的方向。

【讨论】：

我为这些文件尝试了 Cp1252 和 Cp1251，但输出中不存在字符串。
@plaidshirt 您可以分享您尝试与代码一起阅读的示例文本吗？
它不依赖于文本，因为它每次都以相同的方式格式化（键：值）。唯一的区别是这些文件之间的编码类型。
正如答案中所说，您无法仅通过查看文件来获得编码。如果您知道结果，因为您说它的格式相同，您可以尝试一种编码，看看它是否适合您的模式，并为其他编码执行此操作。这是昂贵的，并且扩展性极差，但可以为少量文件完成。但这是正确的方法吗？也许尝试从不同的角度看，并弄清楚如何无法以另一种编码方式获取数据。
@sezi80 : 文件的编码是预定义的，所以没有其他选项可以解决这个问题。

【解决方案3】：

要使Scanner 可以使用不同的编码，您必须向扫描仪的构造函数提供正确的编码。

要定义文件编码，最好使用外部库（例如https://github.com/albfernandez/juniversalchardet）。但是如果你肯定知道可能的编码，你可以根据Wikipedia手动检查一下

public static void main(String... args) throws IOException {
    List<String> lines = readLinesFromFile(new File("d:/utf8.txt"));
}

public static List<String> readLinesFromFile(File file) throws IOException {
    try (Scanner scan = new Scanner(file, getCharsetName(file))) {
        List<String> lines = new LinkedList<>();

        while (scan.hasNext())
            lines.add(scan.nextLine());

        return lines;
    }
}

private static String getCharsetName(File file) throws IOException {
    try (InputStream in = new FileInputStream(file)) {
        if (in.read() == 0xEF && in.read() == 0xBB && in.read() == 0xBF)
            return StandardCharsets.UTF_8.name();
        return StandardCharsets.US_ASCII.name();
    }
}

【讨论】：

它返回所有文件的“US_ASCII”，但输出不正确。
这是因为你的UTF文件开头没有标记