【问题标题】:Is there an iconv with //TRANSLIT equivalent in java?java中是否有一个带有//TRANSLIT等价物的iconv?
【发布时间】:2011-04-27 15:35:43
【问题描述】:

有没有办法在java中实现字符集之间的音译?类似于 unix 命令(或类似 php 函数)的东西:

iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt  > new_doc.txt

最好对字符串进行操作,与文件无关

我知道您可以使用 String 构造函数更改编码,但这不能处理结果字符集中不存在的字符的音译。

【问题讨论】:

    标签: java iconv


    【解决方案1】:

    我不知道有任何库完全按照iconv 的要求做(这似乎没有很好的定义)。但是,您可以在 Java 中使用 "normalization" 来执行诸如删除字符中的重音之类的操作。 Unicode 标准很好地定义了这个过程。

    我认为 NFKD(兼容性分解)后跟非 ASCII 字符过滤可能会让您接近您想要的。显然,这是一个有损过程;您永远无法恢复原始字符串中的所有信息,所以要小心。

    /* Decompose original "accented" string to basic characters. */
    String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
    /* Build a new String with only ASCII characters. */
    StringBuilder buf = new StringBuilder();
    for (int idx = 0; idx < decomposed.length(); ++idx) {
      char ch = decomposed.charAt(idx);
      if (ch < 128)
        buf.append(ch);
    }
    String filtered = buf.toString();
    

    使用此处使用的过滤器,您可能会使某些字符串不可读。例如,一串汉字将被完全过滤掉,因为它们都没有 ASCII 表示(这更像是 iconv 的//IGNORE)。

    总体而言,构建自己的有效字符替换查找表或至少组合可安全剥离的字符(重音和事物)查找表会更安全。最佳解决方案取决于您希望处理的输入字符范围。

    【讨论】:

    • 感谢埃里克森的提示。我遇到最麻烦的字符是省略号字符、长连字符、定向引号等。此外,UTF-8 到 ASCII 只是一个示例,因为我需要将 Windows-1252 转换为 ISO-8859 -1 也是如此。这种标准化技术在这些情况下会起作用吗?
    • @Keith - 标准化会起作用,但过滤会不太有用。鉴于您正在处理的字符,这听起来像一个明确的“手工制作”替换表可能效果最好。可能有一些图书馆有这样的表,但我不熟悉。
    【解决方案2】:

    一种解决方案是将执行 iconv 作为外部进程执行。它肯定会冒犯纯粹主义者。这取决于系统上是否存在 iconv,但它可以正常工作并且完全符合您的要求:

    public static String utfToAscii(String input) throws IOException {
        Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
        BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
        BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
        bwo.write(input,0,input.length());
        bwo.flush();
        bwo.close();
        String line  = null;
        StringBuilder stringBuilder = new StringBuilder();
        String ls = System.getProperty("line.separator");
        while( ( line = bri.readLine() ) != null ) {
            stringBuilder.append( line );
            stringBuilder.append( ls );
        }
        bri.close();
        try {
            p.waitFor();
        } catch ( InterruptedException e ) {
        }
        return stringBuilder.toString();
    }
    

    【讨论】:

      【解决方案3】:

      让我们从 Ericson 答案的细微变化开始,并在其上构建更多 //TRANSLIT 功能:

      分解字符得到ASCII-String

      public class Translit {
      
          private static final Charset US_ASCII = Charset.forName("US-ASCII");
          private static String toAscii(final String input) {
              final CharsetEncoder charsetEncoder = US_ASCII.newEncoder();
              final char[] decomposed = Normalizer.normalize(input, Normalizer.Form.NFKD).toCharArray();
              final StringBuilder sb = new StringBuilder(decomposed.length);
      
              for (int i = 0; i < decomposed.length; ) {
                  final int codePoint = Character.codePointAt(decomposed, i);
                  final int charCount = Character.charCount(codePoint);
      
                  if(charsetEncoder.canEncode(CharBuffer.wrap(decomposed, i, charCount))) {
                      sb.append(decomposed, i, charCount);
                  }
      
                  i += charCount;
              }
              return sb.toString();
          }
      
      
          public static void main(String[] args) {
              final String a = "Michèleäöüß";
              System.out.println(a + " => " + toAscii(a));
              System.out.println(a.toUpperCase() + " => " + toAscii(a.toUpperCase()));
          }
      }
      

      虽然这对于 US-ASCII 应该具有相同的行为,但该解决方案更容易用于不同的目标编码。 (由于首先分解字符,这并不一定会为其他编码产生更好的结果)

      该函数对于补充代码点是安全的(这对于 ASCII 作为目标有点过分,但如果选择其他目标编码可能会减少头痛)。

      另请注意,返回的是常规 Java-String;如果您需要 ASCII-byte[],您仍然需要对其进行转换(但我们确保没有违规字符...)。

      这就是您可以将其扩展到更多字符集的方式:

      替换或分解字符以获得可在提供的Charset 中编码的String

      import java.nio.CharBuffer;
      import java.nio.charset.Charset;
      import java.nio.charset.CharsetEncoder;
      import java.text.Normalizer;
      import java.util.Collections;
      import java.util.HashMap;
      import java.util.Map;
      
      /**
       * Created for http://stackoverflow.com/a/22841035/1266906
       */
      public class Translit {
          public static final Charset                  US_ASCII     = Charset.forName("US-ASCII");
          public static final Charset                  ISO_8859_1   = Charset.forName("ISO-8859-1");
          public static final Charset                  UTF_8        = Charset.forName("UTF-8");
          public static final HashMap<Integer, String> REPLACEMENTS = new ReplacementBuilder().put('„', '"')
                                                                                                    .put('“', '"')
                                                                                                    .put('”', '"')
                                                                                                    .put('″', '"')
                                                                                                    .put('€', "EUR")
                                                                                                    .put('ß', "ss")
                                                                                                    .put('•', '*')
                                                                                                    .getMap();
      
          private static String toCharset(final String input, Charset charset) {
              return toCharset(input, charset, Collections.<Integer, String>emptyMap());
          }
      
          private static String toCharset(final String input,
                                          Charset charset,
                                          Map<? super Integer, ? extends String> replacements) {
              final CharsetEncoder charsetEncoder = charset.newEncoder();
              return toCharset(input, charsetEncoder, replacements);
          }
      
          private static String toCharset(String input,
                                          CharsetEncoder charsetEncoder,
                                          Map<? super Integer, ? extends String> replacements) {
              char[] data = input.toCharArray();
              final StringBuilder sb = new StringBuilder(data.length);
      
              for (int i = 0; i < data.length; ) {
                  final int codePoint = Character.codePointAt(data, i);
                  final int charCount = Character.charCount(codePoint);
      
                  CharBuffer charBuffer = CharBuffer.wrap(data, i, charCount);
                  if (charsetEncoder.canEncode(charBuffer)) {
                      sb.append(data, i, charCount);
                  } else if (replacements.containsKey(codePoint)) {
                      sb.append(toCharset(replacements.get(codePoint), charsetEncoder, replacements));
                  } else {
                      // Only perform NFKD Normalization after ensuring the original character is invalid as this is a irreversible process
                      final char[] decomposed = Normalizer.normalize(charBuffer, Normalizer.Form.NFKD).toCharArray();
                      for (int j = 0; j < decomposed.length; ) {
                          int decomposedCodePoint = Character.codePointAt(decomposed, j);
                          int decomposedCharCount = Character.charCount(decomposedCodePoint);
      
                          if (charsetEncoder.canEncode(CharBuffer.wrap(decomposed, j, decomposedCharCount))) {
                              sb.append(decomposed, j, decomposedCharCount);
                          } else if (replacements.containsKey(decomposedCodePoint)) {
                              sb.append(toCharset(replacements.get(decomposedCodePoint), charsetEncoder, replacements));
                          }
      
                          j += decomposedCharCount;
                      }
                  }
      
                  i += charCount;
              }
              return sb.toString();
          }
      
      
          public static void main(String[] args) {
              final String a = "Michèleäöü߀„“”″•";
              System.out.println(a + " => " + toCharset(a, US_ASCII));
              System.out.println(a + " => " + toCharset(a, ISO_8859_1));
              System.out.println(a + " => " + toCharset(a, UTF_8));
      
              System.out.println(a + " => " + toCharset(a, US_ASCII, REPLACEMENTS));
              System.out.println(a + " => " + toCharset(a, ISO_8859_1, REPLACEMENTS));
              System.out.println(a + " => " + toCharset(a, UTF_8, REPLACEMENTS));
          }
      
          public static class MapBuilder<K, V> {
      
              private final HashMap<K, V> map;
      
              public MapBuilder() {
                  map = new HashMap<K, V>();
              }
      
              public MapBuilder<K, V> put(K key, V value) {
                  map.put(key, value);
                  return this;
              }
      
              public HashMap<K, V> getMap() {
                  return map;
              }
          }
      
          public static class ReplacementBuilder extends MapBuilder<Integer, String> {
              public ReplacementBuilder() {
                  super();
              }
      
              @Override
              public ReplacementBuilder put(Integer input, String replacement) {
                  super.put(input, replacement);
                  return this;
              }
      
              public ReplacementBuilder put(Integer input, char replacement) {
                  return this.put(input, String.valueOf(replacement));
              }
      
              public ReplacementBuilder put(char input, String replacement) {
                  return this.put((int) input, replacement);
              }
      
              public ReplacementBuilder put(char input, char replacement) {
                  return this.put((int) input, String.valueOf(replacement));
              }
          }
      }
      

      我强烈建议您构建一个广泛的替换表,因为这个简单的示例已经显示了您可能会如何丢失所需的信息,例如 。对于 ASCII,这个实现当然要慢一些,因为分解只是按需进行,StringBuilder 现在可能需要增长以容纳替换。

      GNU 的 iconv 使用translit.def 中列出的替换来执行//TRANSLIT-转换,如果您想将其用作替换映射,可以使用这样的方法:

      导入原版//TRANSLIT-replacements

      private static Map<Integer, String> readReplacements() {
          HashMap<Integer, String> map = new HashMap<>();
          InputStream stream = Translit.class.getResourceAsStream("/translit.def");
          BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(stream, UTF_8));
          Pattern pattern = Pattern.compile("^([0-9A-Fa-f]+)\t(.?[^\t]*)\t#(.*)$");
          try {
              String line;
              while ((line = bufferedReader.readLine()) != null) {
                  if (line.charAt(0) != '#') {
                      Matcher matcher = pattern.matcher(line);
                      if (matcher.find()) {
                          map.put(Integer.valueOf(matcher.group(1), 16), matcher.group(2));
                      }
                  }
              }
          } catch (IOException e) {
              e.printStackTrace();
          }
          return map;
      }
      

      【讨论】:

        猜你喜欢
        • 2014-04-09
        • 1970-01-01
        • 2020-09-20
        • 2011-01-17
        • 1970-01-01
        • 1970-01-01
        • 2012-06-08
        • 1970-01-01
        相关资源
        最近更新 更多