【问题标题】:Tokenize devnagari words into letters将 devnagari 单词标记为字母
【发布时间】:2014-07-31 10:59:42
【问题描述】:

我有类似的东西

a = "बिक्रम मेरो नाम हो"

我想在 Java 中实现类似的东西

a[0] = बि 
a[1] = क्र 
a[3] = म

【问题讨论】:

  • 您是否使用Hindi 语言工作?
  • 你试过 String.toCharArray() 吗?
  • 您要替换这些值,还是获取它们的值?
  • @sneha ideone.com/vCqkKS 我们可以得到这样的东西,但我不知道如何对这些印地语字母进行分组:-/
  • 我认为你正在寻找这个:stackoverflow.com/a/25398990/222861

标签: java split word hindi


【解决方案1】:

Java 在内部以 UTF-16(2 个字节)存储任何语言的每个字符,因此您可以安全地单独访问这些字符。

【讨论】:

  • 问题是“视觉”字母 बि 实际上是两个 Unicode 字符 ि 和 ब 的连字 - 问题不在于访问单个 char 值,而是关于如何对组合进行分组字符在一起。
【解决方案2】:

试试这个:

             String a = "बिक्रम मेरो नाम हो";
             int strLen = a.length();
             char array[] = new char[strLen];
             String strArray1[] = new String[strLen];
             for (int i=0 ; i< strLen ; i++)
             {
                 array[i] = a.charAt(i);
                 strArray1[i] = Character.toString(a.charAt(i));
                 System.out.println ("Index = " + i + "* Char = " +array[i] + "** String =" +strArray1[i] );

             }

输出:

Index = 0* Char = ब** String =ब
Index = 1* Char = ि** String =ि
Index = 2* Char = क** String =क
Index = 3* Char = ्** String =्
Index = 4* Char = र** String =र
Index = 5* Char = म** String =म
Index = 6* Char =  ** String = 
Index = 7* Char = म** String =म
Index = 8* Char = े** String =े
Index = 9* Char = र** String =र
Index = 10* Char = ो** String =ो
Index = 11* Char =  ** String = 
Index = 12* Char = न** String =न
Index = 13* Char = ा** String =ा
Index = 14* Char = म** String =म
Index = 15* Char =  ** String = 
Index = 16* Char = ह** String =ह
Index = 17* Char = ो** String =ो

注意:

为了让 eclipse 允许您使用外来字符(印地语字母)保存您的 java 程序,请执行以下操作:

转到:
"Windows > 首选项 > 常规 > 内容类型 > 文本 > {选择文件类型} {Selected file type} > Default encoding > UTF-8" 并点击更新

【讨论】:

    【解决方案3】:

    你试过 icu4j 吗?

    BreakIterator character instance 可以将字符串拆分为字符

    【讨论】:

      【解决方案4】:

      我的代码根本没有优化,很抱歉,但它可以工作!

      只需更改要在其中输入 devnagri 语句的文件的路径,它应该可以工作。

      public static void main(String[] args) throws IOException
      {
      
      
          BufferedReader br = new BufferedReader(new FileReader("/home/ubuntu/Documents/trainforjava.txt"));   //PLEASE ENTER PATH HERE
      
           String[] devFull = new String[]{
      
                   "अ","आ", "इ", "ई", "उ", "ऊ", "ऋ"
                   , "ऌ" ,"ऍ",  "ए", "ऐ", "ऑ", "ओ", "औ",
      
      
                   "क", "ख", "ग", "घ" ,"ङ",
                   "च" ,"छ" ,"ज"," झ"," ञ",
                   "ट","ठ", "ड"," ढ"," ण",
                   "त", "थ", "द", "ध", "न",
                   "प", "फ", "ब"," भ","म",
                   "य", "र", "ल", "ळ",
                   "व", "श" ,"ष","स" ,"ह"
      
      
              };
      
           String[] uniDev = new String[]
                   {
                           "905","906","907","908","909","90a","90b",
                           "90c","90d","90f","910","911","913","914",
                           "915","916","917","918","919",
                           "91a","91b","91c","91d","91e",
                           "91f","920","921","922","923",
                           "924","925","926","927","928",
                           "92a","92b","92c","92d","92e",
                           "92f","930","932","933",
                           "935","936","937","938","939"
                   };
      
      
      
      
      
      
           String[] devHalf = new String[]
                   {
                           "$़","ऽ","$ा","$ि" ,
                           "$ी", "$ ु","$ू","$ृ","$ॄ","$ॅ",
                           "$े","$ै","$ॉ",
                           "$ो","$ौ"
                   };
      
      
           String[] gujHalf = new String[]
                   {
      
                           "$઼","ઽ","$ા","$િ"  ,
                      "$ી","$ુ","$ૂ","$ૃ","$ૄ","$ૅ",
                      "$ે","$ૈ","$ૉ",
                      "$ો","$ૌ"
      
      
                   };
      
      
          try
          {
               StringBuilder sb = new StringBuilder();
                  String line = br.readLine();
      
                  while( (line = br.readLine() ) != null)
                  {
                      line=line.replaceAll(" ", "");  //remove white spaces if any 
                      System.out.println();
                      //System.out.println(line);
      
                       int strLength = line.length();
      
                      // String a = "बिक्रम मेरो नाम हो";
                       int strLen = line.length();
                       char array[] = new char[strLen];
                       String strArray1[] = new String[strLen];
                       int mark[] = new int[strLen+1];
                       String unis[]=new String[strLen];
                       int cnt=0;
                       String newCharD[]=new String [strLen];
                       String newCharG[]=new String [strLen];
                       String tempD=null;
                       String tempG=null;
                       String arr = null;
                       String next =null;
                       String temp=null;
                       String uniNext=null;
                       int hold=0;
                       int j=0;
      
                       for (int i=0 ; i< strLen ; i++)
                       {
                           j=i+1;
                           array[i] = line.charAt(i);
      
                           strArray1[i] = Character.toString(line.charAt(i));
      
                           if(i<(strLen-1))
                           {
                               char nbit = line.charAt(j);
                               next=Character.toString(line.charAt(j));
                               uniNext=Integer.toHexString(nbit);
                               //System.out.print("\nUninext:\t"+uniNext);
                           }
                           unis[i]=Integer.toHexString(array[i]); 
                                                   mark[strLen]=1;
                           if((Arrays.asList(devFull).contains(Character.toString(array[i]))) && (!uniNext.equalsIgnoreCase("94d"))  )
                           {
                               mark[i]=1;
                           }
                           else
                           {
                               mark[i]=0;
                           }
      
      
                           //
                       //System.out.println();
                           //System.out.println ("Index = " + i + "* Char = " +array[i] + "** String =" +strArray1[i]+ "Unicode="+unis[i]+"Mark="+mark[i]);
                           //System.out.print(unis[i].toString());
      
      
      
                       }
      
                       int start=0;
                       start=0;
                       for(int l1=0;l1<=strLen;l1++)
                       {
                           //start=0;
      
                           if(l1==0)
                           {
                               temp=Character.toString(array[l1]);
      
                           }
      
                           else
                           {
                               if(mark[l1]==0)
                               {
                                   temp=temp+Character.toString(array[l1]);
                               }
                               else
                               {
                                   System.out.print(" "+temp);
                                   newCharD[start]=temp;
                                   start++;
                                   temp=null;
                                   if(l1!=strLen)
                                   {
                                       temp=Character.toString(array[l1]);     
                                   }
      
                               }
                           }
                       }
      
      
                      /* for(int s=0;s<start;s++)
                       {
                           System.out.print(" "+newCharD[s]);      
                       }*/
      
      
                       for(int s=0;s<start;s++)
                       {
      
                       }
      
      
                  }
          }
           finally {
                  br.close();
              }
          //PrintStream out = new PrintStream(new //FileOutputStream("/home/ubuntu/Documents/trainforjavaoutput.txt"));
          //System.setOut(out);
      }
      

      【讨论】:

        【解决方案5】:

        试试这个印地语:-

            import java.io.*;
            import java.text.BreakIterator;
            import java.util.Locale;
            
            public class Test {
                public static void main(String[] args) throws IOException
                {
            
                    String text = "बिक्रम मेरो नाम हो";
                    Locale hindi = new Locale("hi", "IN");
                    BreakIterator breaker = BreakIterator.getCharacterInstance(hindi);
                    breaker.setText(text);
                    int start = breaker.first();
                    for (int end = breaker.next();
                         end != BreakIterator.DONE;
                         start = end, end = breaker.next()) {
                        System.out.println(text.substring(start,end));
                    }
                }
            }
        

        输出:-

        बि
        क्र
        म
         
        मे
        रो
         
        ना
        म
         
        हो
        

        BreakIterator Java 文档: https://docs.oracle.com/javase/tutorial/i18n/text/about.html

        【讨论】:

          【解决方案6】:

          为了按字母而不是字符分割字符串,按照 dvasanth 的建议,您可以尝试以下操作:

               String x = "बिक्रम मेरो नाम हो";
                   x=x.replaceAll(" ", ""); // Remove all spaces
                   int strLength = x.length();
                           String [] letterArray = new String (strLength /2);
                   String combined = "";
                   for (int i=0, j=0; i < strLength ; i=i+2,j++)
                   {
                      strArray1[i] = Character.toString(x.charAt(i));
                      if (i+1 < strLength)
                      {
                          strArray1[i+1] = Character.toString(x.charAt(i+1));
                          combined = strArray1[i]+strArray1[i+1]; // This line provides the letters.
                                     // Assumption is that each letter is 2 unicode characters long.
          
                      }
                      else
                      {
                          combined = strArray1[i];
                      }
                      letterArray [j] = combined; 
                      System.out.println("Split string by letters is : "+combined);
                              System.out.println("Split string by letters in array is : "+letterArray [j]);
                   }    
          

          输出为:

          Split string by letters is : बि
          Split string by letters is : क्
          Split string by letters is : रम
          Split string by letters is : मे
          Split string by letters is : रो
          Split string by letters is : ना
          Split string by letters is : मह
          Split string by letters is : ो
          

          注意:

          为了让 Eclipse 允许您使用外来字符(印地语字母)保存您的 java 程序,请执行以下操作:

          转到:
          "Windows > 首选项 > 常规 > 内容类型 > 文本 > {选择文件类型} {Selected file type} > Default encoding > UTF-8" 并点击更新

          【讨论】:

          • 感谢您的回答。但输出部分正确。我想将每个字母存储在数组中的不同索引处。
          • @sneha:把变量combined改成数组而不是字符串就可以得到结果了。
          • @sneha:参考我更新的答案。我包含了一个名为letterArray 的数组,它是一个包含不同索引处的字母的数组。希望对你有帮助
          猜你喜欢
          • 2013-08-26
          • 2015-11-24
          • 1970-01-01
          • 1970-01-01
          • 2018-03-02
          • 1970-01-01
          • 2022-08-22
          • 1970-01-01
          • 2015-08-29
          相关资源
          最近更新 更多