逐个代码点读取文本流代码点答案

【问题标题】：Read text stream codepoint by codepoint逐个代码点读取文本流代码点
【发布时间】：2018-11-12 22:23:03
【问题描述】：

我正在尝试从 Java 文本文件中读取 Unicode 代码点。 InputStreamReader 类通过int 返回流的内容int，我希望它会做我想做的事，但它不会组成代理对。

我的测试程序：

import java.io.*;
import java.nio.charset.*;

class TestChars {
    public static void main(String args[]) {
        InputStreamReader reader =
            new InputStreamReader(System.in, StandardCharsets.UTF_8);
        try {
            System.out.print("> ");
            int code = reader.read();
            while (code != -1) {
                String s =
                    String.format("Code %x is `%s', %s.",
                                  code,
                                  Character.getName(code),
                                  new String(Character.toChars(code)));
                System.out.println(s);
                code = reader.read();
            }
        } catch (Exception e) {
        }
    }
}

这表现如下：

$ java TestChars 
> keyboard ⌨. pizza ????
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE',  .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE',  .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE',  .
Code d83c is `HIGH SURROGATES D83C', ?.
Code df55 is `LOW SURROGATES DF55', ?.
Code a is `LINE FEED (LF)', 
.

我的问题是组成比萨表情符号的代理对是分开阅读的。我想将符号读入单个 int 并完成它。

问题： 是否有一个 reader(-like) 类可以在阅读时自动将代理对组合成字符？（并且，如果输入格式错误，大概会引发异常。）

我知道我可以自己组合这些对，但我宁愿避免重新发明轮子。

【问题讨论】：

read() 返回的 int 值是 UTF-16 char 值，而不是 Unicode codepoint。它是int 类型的唯一原因是它也可以返回-1。代码正在做它应该做的事情，即返回 UTF-16 代理对。
我知道这个类没有做我想做的事，这就是为什么我的问题是是否有另一个标准类做我想做的事。

标签： java unicode

【解决方案1】：

如果您利用 String 具有返回代码点流的方法，您就不必自己处理代理对：

import java.io.*;

class cptest {
    public static void main(String[] args) {
        try (BufferedReader br =
                new BufferedReader(new InputStreamReader(System.in, "UTF-8"))) {
            br.lines().flatMapToInt(String::codePoints).forEach(cptest::print);
        } catch (Exception e) {
            System.err.println("Error: " + e);
        }
    }
    private static void print(int cp) {
        String s = new String(Character.toChars(cp));
        System.out.println("Character " + cp + ": " + s);
    }
}

会产生

$ java cptest <<< "keyboard ⌨. pizza ?"
Character 107: k
Character 101: e
Character 121: y
Character 98: b
Character 111: o
Character 97: a
Character 114: r
Character 100: d
Character 32:  
Character 9000: ⌨
Character 46: .
Character 32:  
Character 112: p
Character 105: i
Character 122: z
Character 122: z
Character 97: a
Character 32:  
Character 127829: ?

【讨论】：

谢谢。这看起来像是让 Java 库处理细节的合理方法。正如我在我的应用程序中一样，我还需要在同一行中进行一些前瞻，将整行读入字符串也会更容易。

【解决方案2】：

你可以用一个简单的类来包装 Reader 实例来解码代理对：

import java.io.Closeable;
import java.io.IOException;
import java.io.Reader;

public class CodepointStream implements Closeable {

    private Reader reader;

    public CodepointStream(Reader reader) {
        this.reader = reader;
    }

    public int read() throws IOException {
        int unit0 = reader.read();
        if (unit0 < 0)
            return unit0; // EOF

        if (!Character.isHighSurrogate((char)unit0))
            return unit0;

        int unit1 = reader.read();
        if (unit1 < 0)
            return unit1; // EOF

        if (!Character.isLowSurrogate((char)unit1))
            throw new RuntimeException("Invalid surrogate pair");

        return Character.toCodePoint((char)unit0, (char)unit1);
    }

    public void close() throws IOException {
        reader.close();
        reader = null;
    }
}

main 函数需要稍作修改：

import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;

public final class App {
    public static void main(String args[]) {
        CodepointStream reader = new CodepointStream(
                new InputStreamReader(System.in, StandardCharsets.UTF_8));
        try {
            System.out.print("> ");
            int code = reader.read();
            while (code != -1) {
                String s =
                        String.format("Code %x is `%s', %s.",
                                code,
                                Character.getName(code),
                                new String(Character.toChars(code)));
                System.out.println(s);
                code = reader.read();
            }
        } catch (Exception e) {
        }
    }
}

那么你的输出变成：

> keyboard ⌨. pizza ?
Code 6b is `LATIN SMALL LETTER K', k.
Code 65 is `LATIN SMALL LETTER E', e.
Code 79 is `LATIN SMALL LETTER Y', y.
Code 62 is `LATIN SMALL LETTER B', b.
Code 6f is `LATIN SMALL LETTER O', o.
Code 61 is `LATIN SMALL LETTER A', a.
Code 72 is `LATIN SMALL LETTER R', r.
Code 64 is `LATIN SMALL LETTER D', d.
Code 20 is `SPACE',  .
Code 2328 is `KEYBOARD', ⌨.
Code 2e is `FULL STOP', ..
Code 20 is `SPACE',  .
Code 70 is `LATIN SMALL LETTER P', p.
Code 69 is `LATIN SMALL LETTER I', i.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 7a is `LATIN SMALL LETTER Z', z.
Code 61 is `LATIN SMALL LETTER A', a.
Code 20 is `SPACE',  .
Code 1f355 is `SLICE OF PIZZA', ?.
Code a is `LINE FEED (LF)', 
.

【讨论】：

谢谢。我接受了另一个答案，因为它提供了一种让 Java 库完成工作的方法。否则，这与我想出的相似。请注意，您可以使用Character.toCodePoint 和相关方法来摆脱魔术常量。
感谢有关Character 类的提示。我已经相应地更新了代码。