【问题标题】:How to eliminate spaces in a text file in text analysis?如何在文本分析中消除文本文件中的空格?
【发布时间】:2015-03-24 04:48:52
【问题描述】:

我正在尝试让我的程序显示文本文件中字母的频率,目前它正在显示文本文件中每个单词的频率。因此,例如,如果文本文件中的单词是“我是男人”,它会为每个单词“i”、“am”、“a”、“man”输出 4 倍的字母频率……我需要它来分析它全部作为一个单词,因此删除空格并将其视为“iamaman”。

//

【问题讨论】:

    标签: java frequency analysis


    【解决方案1】:

    这不是文本中有空格的问题。事实上,当您在添加计数之前检查Character.isLetter() 时,您已经注意忽略空格。

    主要是你只需要把你的forwhile 循环放在迭代令牌的主循环之外。

    import java.util.*;
    import java.io.*;
    
    public class J_<countlettersfilereader> {
    
        public static void main(String[] args)throws Exception {
            // open the file
            Scanner console = new Scanner(System.in);
            System.out.print("What is the name of the text file? ");
            String fileName = console.nextLine();
            Scanner input = new Scanner(new File(fileName));
    
            //initialize array with 26 elements
            int[] letterArray = new int[26]; 
    
            while (input.hasNext()) {
                String next = input.next().toLowerCase();
    
                //run loop for each line incrementing per character
                for (int i = 0; i < next.length(); i++) {
                    char characters = next.charAt(i);
    
                    //ignore all characters which aren't alphabetic 
                    if (Character.isLetter(characters)) {
    
                        //if character is uppercase then convert to lowercase
                        characters = Character.toLowerCase(characters);
    
                        //populate array 
                        int index = characters - 'a';
                        letterArray[index]++;
                    }}
            }
    
            int total = 0;
            for(int i = 0; i < letterArray.length; i ++) {
                total += letterArray[i];
            }
    
            for (char characters = 'a'; characters <= 'z'; characters++) {
                int index = characters - 'a';
                //print out the analysis
                System.out.println("'" + characters + "' entered " + (((double)letterArray[index] / (double)total) * 100) 
                                   + " percent");
            }
        }
    }
    
    
    $ cat abc.txt
    a b c
    
    $ java J_
    What is the name of the text file? abc.txt
    'a' entered 33.33333333333333 percent
    'b' entered 33.33333333333333 percent
    'c' entered 33.33333333333333 percent
    'd' entered 0.0 percent
    'e' entered 0.0 percent
    'f' entered 0.0 percent
    'g' entered 0.0 percent
    'h' entered 0.0 percent
    'i' entered 0.0 percent
    'j' entered 0.0 percent
    'k' entered 0.0 percent
    'l' entered 0.0 percent
    'm' entered 0.0 percent
    'n' entered 0.0 percent
    'o' entered 0.0 percent
    'p' entered 0.0 percent
    'q' entered 0.0 percent
    'r' entered 0.0 percent
    's' entered 0.0 percent
    't' entered 0.0 percent
    'u' entered 0.0 percent
    'v' entered 0.0 percent
    'w' entered 0.0 percent
    'x' entered 0.0 percent
    'y' entered 0.0 percent
    'z' entered 0.0 percent
    

    【讨论】:

      【解决方案2】:

      如果我理解的话,您所要做的就是将最后一个 for 循环留在图表之外,所以:

      import java.io.File;
      import java.util.Scanner;
      
      public class JCountlettersfilereader {
        public static void main(String[] args) throws Exception {
          // open the file
          // Scanner console = new Scanner(System.in);
          // System.out.print("What is the name of the text file? ");
          String fileName = "file.txt";
          Scanner input = new Scanner(new File(fileName));
      
          // initialize array with 26 elements
          int[] letterArray = new int[26];
          int totalLetters = 0;
      
          while (input.hasNext()) {
              String next = input.next().toLowerCase();
      
              // run loop for each line incrementing per character
              for (int i = 0; i < next.length(); i++) {
                  char characters = next.charAt(i);
      
                  // ignore all characters which aren't alphabetic
                  if (Character.isLetter(characters)) {
                      totalLetters++;
                      // if character is uppercase then convert to lowercase
                      characters = Character.toLowerCase(characters);
      
                      // populate array
                      int index = characters - 'a';
                      letterArray[index]++;
                  }
              }
      
              int total = 0;
              for (int i = 0; i < letterArray.length; i++) {
                  total += letterArray[i];
              }
          }
      
              for (char characters = 'a'; characters <= 'z'; characters++) {
                  int index = characters - 'a';
                  // print out the analysis
                  System.out
                          .println("'"
                                  + characters
                                  + "' entered "
                                  + (((double) letterArray[index] / (double) totalLetters) * 100)
                                  + " percent" +"("+letterArray[index] +" /"+totalLetters+")");
              }
      
      }
      }
      

      返回:

      'a' 输入 42.857142857142854%(3 /7) ... “我”输入了 14.285714285714285%(1 /7) ... 'm' 输入 28.57142857142857%(2 /7) 'n' 输入 14.285714285714285%(1 /7)

      这是你所期望的?

      【讨论】:

        【解决方案3】:

        删除空格的一种方法是:

        "i am a man".replaceAll(" ", "");
        

        【讨论】:

          【解决方案4】:

          将打印结果的代码移到 while 循环之外。您只需要运行一次,而不是对文件中的每个单词运行一次。

          此外,您不需要在两个不同的行上转换为小写。

          【讨论】:

            【解决方案5】:

            使用replaceAll("[\s]", "");

            这将删除所有空格(空白行、制表符、空格)

            【讨论】:

            • 使用 for 循环从文件中逐行取出每一行,并在 for 循环中放入此代码以删除空格
            【解决方案6】:

            您可以将分隔符设置为\\w,这意味着它不会占用空格

            设置

            input.setDelimeter("\\w");
            

            在while循环之外

            【讨论】:

              猜你喜欢
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 2018-08-20
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              • 1970-01-01
              相关资源
              最近更新 更多