如何从文件中读取 Unicode G-Clef (U+1D11E)？答案

【问题标题】：How to read a Unicode G-Clef (U+1D11E) from a file?如何从文件中读取 Unicode G-Clef (U+1D11E)？
【发布时间】：2013-06-28 09:14:17
【问题描述】：

G-Clef (U+1D11E) 不是Basic Multilingual Plane (BMP) 的一部分，这意味着它需要超过 16 位。几乎所有 Java 的读取函数都只返回一个 char 或一个 int 还包含 only 16 bit。哪个函数可以读取完整的 Unicode 符号，包括 SMP、SIP、TIP、SSP 和 PUA？

更新

我问过如何从输入流中读取单个 Unicode 符号（或代码点）。我既没有整数数组，也不想读一行。

可以使用Character.toCodePoint() 构建代码点，但此功能需要char。另一方面，读取char 是不可能的，因为read() 返回int。到目前为止，我最好的工作是这个，但它仍然包含不安全的演员表：

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (Character.isHighSurrogate((char)ch16))
    return Character.toCodePoint((char)ch16, (char)input.read());
  else 
    return (int)ch16;
}

如何做得更好？

更新 2

另一个返回字符串但仍在使用强制转换的版本：

public String readchar (Reader input) throws java.io.IOException
{
  int i16 = input.read(); // UTF-16 as int
  if (i16 == -1) return null;
  char c16 = (char)i16; // UTF-16
  if (Character.isHighSurrogate(c16)) {
    int low_i16 = input.read(); // low surrogate UTF-16 as int
    if (low_i16 == -1)
      throw new java.io.IOException ("Can not read low surrogate");
    char low_c16 = (char)low_i16;
    int codepoint = Character.toCodePoint(c16, low_c16);
    return new String (Character.toChars(codepoint));
  }
  else 
    return Character.toString(c16);
}

剩下的问题：演员阵容安全吗？或者如何避免他们？

【问题讨论】：

标题中不需要添加major标签。
Java reading in character streams with supplementary unicode characters的可能重复
可能的重复不包含答案。
你的两个答案都是“正确的”（尽管第一个不处理流的结尾）。你的演员阵容没有什么不安全的。

标签： java unicode

【解决方案1】：

到目前为止我最好的解决方法是这个，但它仍然包含不安全的演员表

您提供的代码唯一不安全的地方是，如果 input 已达到 EOF，ch16 可能为 -1。如果您首先检查此条件，那么您可以保证其他(char) 强制转换是安全的，因为Reader.read() is specified 返回-1 或char (0 - 0xFFFF) 范围内的值。

public int read_code_point (Reader input) throws java.io.IOException
{
  int ch16 = input.read();
  if (ch16 < 0 || !Character.isHighSurrogate((char)ch16))
    return ch16;
  else {
    int loSurr = input.read();
    if(loSurr < 0 || !Character.isLowSurrogate((char)loSurr)) 
      return ch16; // or possibly throw an exception
    else 
      return Character.toCodePoint((char)ch16, (char)loSurr);
  }
}

这仍然不理想，您确实需要处理第一个 char 读取的是高代理但第二个不是匹配的低代理的边缘情况，在这种情况下您可能想要返回首先char 按原样备份阅读器，以便下一次阅读为您提供下一个字符。但这只适用于input.markSupported() == true。如果你可以保证，那又如何

public int read_code_point (Reader input) throws java.io.IOException
{
  int firstChar = input.read();
  if (firstChar < 0 || !Character.isHighSurrogate((char)firstChar)) {
    return firstChar;
  } else {
    input.mark(1);
    int secondChar = input.read();
    if(secondChar < 0) {
      // reached EOF
      return firstChar;
    } else if(!Character.isLowSurrogate((char)secondChar)) {
      // unpaired surrogates, un-read the second char
      input.reset();
      return firstChar;
    }
    else {
      return Character.toCodePoint((char)firstChar, (char)secondChar);
    }
  }
}

或者您可以将原始阅读器包装在 PushbackReader 中并使用 unread(secondChar)

【讨论】：

如何将其转换为代码点增益？如果你想做任何有用的事情，你很可能想要字符串中的数据。
@jtahlborn 每个解析器都需要下一个字符而不是下一个字符串。你会说解析器没用吗？

【解决方案2】：

完整的 Unicode 可以用 UTF-8 和 UTF-16 表示，分别由字节序列表示。字节对（“java chars”）。从字符串中提取完整的 Unicode 代码点：

int[] codePoints = { 0x1d11e };
String s = new String(codePoints, 0, codePoints.length);

for (int i = 0; i < s.length(); ) {
    int cp = s.codePointAt(i);
    i += Character.charCount(cp);
}

对于基本上是拉丁字符的文件，UTF-8 似乎没问题。

以下内容会读取完整的标准 Unicode 文件（UTF-8 格式）：

try (BufferedReader in = new BufferedReader(
        new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
    for (;;) {
        String line = in.readLine();
        if (line == null) {
            break;
        }
        ... do some thing with a Unicode line ...
    }
} catch (FileNotFoundException e) {
    System.err.println("No file: " + file.getPath());
} catch (IOException e) {
    ...
}

提供一个（或多个 Unicode 代码）的 Java 字符串的函数：

String s = unicodeToString(0x1d11e);
String s = unicodeToString(0x68, 0x65, 0x6c, 0x6c, 0x6f, 0x1d11e);

public static String unicodeToString(int... codepoints) {
    return new String(codePoints, 0, codePoints.length);
}

【讨论】：

详细说明；在这里，我从一个文件 FileInputStream 中读取。也许令人困惑的是，Unicode 本身不是一种格式，而是符号的标准编号。 UTF-8、UTF-16LE、UTF-16BE、UTF-16 是实际的二进制格式。实际上，Java 以 2 种格式使用 Unicode：虽然 char 是 UTF-16，但在 .class 中，字符串常量存储为 UTF-8。 UTF-8 涵盖完整的 Unicode。 在上面的代码中，数组 codePoints 使用 Unicode 数字。
它要求一个符号而不是整行。使用readline 需要取消阅读该行的其余部分。
啊哈，将其添加到答案中。