Tika 没有检测到纯 ascii 输入答案

【问题标题】：Tika is not detecting plain ascii inputTika 没有检测到纯 ascii 输入
【发布时间】：2020-11-20 20:47:51
【问题描述】：

我们有一个字节序列输入，我们需要检查它是 UTF-8 还是纯 ASCII 或其他。换句话说，我们必须拒绝 ISO-8859-X latin-x 或其他编码输入。

我们的第一选择是 Tika，但我们有一个问题：普通的 ascii 输入（根本没有重音字符的输入）通常被检测为 ISO-8859-2 或 ISO-8859-1！

这是有问题的部分：

    CharsetDetector detector = new CharsetDetector();
    String ascii = "Only ascii Visible:a;Invisible:GUID\nX;XXddd\n";
    detector.setText(ascii.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii2 = "Only ascii plain english text";
    detector.setText(ascii2.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii3 = "this is ISO-8859-2 do not know why";
    detector.setText(ascii3.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());
    String ascii4 = "this is UTF-8 but tell me why o why maybe sdlkfjlksdjlkfjlksdjflkjlskdjflkjsdjkflkdsjlkfjldsjlkfjldkjkfljdlkjsdfhjshdkjfhjksdhjfkksdfksjdfhkjsdhj";
    detector.setText(ascii4.getBytes());
    System.out.println("detected charset: " + detector.detect().getName());

这是输出

detected charset: ISO-8859-2
detected charset: ISO-8859-1
detected charset: ISO-8859-2
detected charset: UTF-8

我应该如何使用 Tika 来获得合理的结果？

Ps：这是一个迷你演示：https://github.com/riskop/tikaproblem

【问题讨论】：

使用更长的文本字符串？这是基于非常短字符串的概率
输入实际上是 csv 文件的内容。这些文件实际上包含应用程序的值列表。有些文件很短，不到 100 字节。这就是我所拥有的。
许多编码（例如 iso-8859）在 7 位范围内都有一组共同的字符（英语等）。我建议您查看实际的字符表，并考虑一下您的要求的实际含义......

标签： encoding detection apache-tika

【解决方案1】：

检测器上有一个 detectAll() 方法，它可以获取 Tika 认为与输入匹配的所有编码。我可以通过遵循这条规则来解决我的问题：如果 UTF-8 在匹配的编码中，则输入被接受（因为它是可能 UTF-8），否则输入被拒绝为不是 UTF-8 .

我了解 Tika 必须使用启发式算法，并且我了解有些输入可以同时是有效的 UTF-8 或其他编码文本。

例如

    bytes = "Only ascii plain english text".getBytes("UTF-8");
    printCharsetArray(new CharsetDetector().setText(bytes).detectAll());

结果：

Match of ISO-8859-1 in nl with confidence 40
Match of ISO-8859-2 in ro with confidence 30
Match of UTF-8 with confidence 15
Match of ISO-8859-9 in tr with confidence 10
Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10
Match of Shift_JIS in ja with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10

这在我的情况下是可用的，虽然两个“最佳”匹配是 ISO-8859-1 和 2，但第三好的是 UTF-8，所以我可以接受输入。

对于无效的 UTF-8 输入，它似乎也可以工作。

例如 0xc3、0xa9、0xa9

    bytes = new byte[]{(byte)0xC3, (byte)0xA9, (byte)0xA9}; // illegal utf-8: Cx leading byte followed by two continuation bytes 
    printCharsetArray(new CharsetDetector().setText(bytes).detectAll());

结果：

Match of Big5 in zh with confidence 10
Match of EUC-KR in ko with confidence 10
Match of EUC-JP in ja with confidence 10
Match of GB18030 in zh with confidence 10

这很好，匹配项中没有 UTF-8。

更有可能的输入是带有重音字符的文本，而不是 UTF-8 编码：

    bytes = "this is somethingó not utf8 é".getBytes("ISO-8859-2");
    printCharsetArray(new CharsetDetector().setText(bytes).detectAll());

结果：

Match of ISO-8859-2 in hu with confidence 31
Match of ISO-8859-1 in en with confidence 31
Match of KOI8-R in ru with confidence 10
Match of UTF-16LE with confidence 10
Match of UTF-16BE with confidence 10

哪个好，因为结果中没有 UTF-8。

【讨论】：