【问题标题】:Java: Insert newline after every 309th characterJava:在每 309 个字符后插入换行符
【发布时间】:2010-08-03 16:09:59
【问题描述】:

让我先说我是 Java 新手。

我有一个包含单行的文件。文件大小约为 200MB。我需要在每 309 个字符后插入一个换行符。我相信我有正确执行此操作的代码,但我一直遇到内存错误。我尝试增加堆空间无济于事。

有没有一种内存占用较少的方法来处理这个问题?

BufferedReader r = new BufferedReader(new FileReader(fileName));

String line;

while ((line=r.readLine()) != null) {
  System.out.println(line.replaceAll("(.{309})", "$1\n"));
}

【问题讨论】:

  • 只对正则表达式部分进行评论(这不是解决此问题的最佳方法):在这些情况下不需要第 1 组。您可以参考第 0 组,例如replaceAll(".{309}", "$0\n") 代替。
  • 必须有一个标准的 Unix 实用程序才能做到这一点,不是吗?像columnify 309 text > out 这样的东西?无论如何,我认为 Java 对于这样的事情来说太冗长了。
  • @poly:我实际上从我一直使用的这个 sed 代码中获取了正则表达式:sed 's/(.\{309\})/\1\n/g' file.txt > file_parsed.txt 我们已经开始使用 Talend ETL 工具,所以我希望能够在 Java 中完成。
  • 另外,感谢您提供正则表达式提示!

标签: java split newline


【解决方案1】:

你的代码有两个问题:

  1. 您将整个文件一次加载到内存中,假设它是一行,因此您至少需要 200MB 的堆空间;和

  2. 添加换行符以使用这样的正则表达式是一种非常低效的方法。直接的代码解决方案将快一个数量级。

这两个问题都很容易解决。

使用 FileReaderFileWriter 一次加载 309 个字符,添加一个换行符并将其写出。

更新:添加了逐个字符和缓冲读取的测试。缓冲读取实际上增加了很多复杂性,因为您需要满足可能(但通常非常罕见)的情况,即read() 返回的字节少于您要求的并且仍有字节要读取.

首先是简单的版本:

private static void charRead(boolean verifyHash) {
  Reader in = null;
  Writer out = null;
  long start = System.nanoTime();
  long wrote = 0;
  MessageDigest md = null;
  try {
    if (verifyHash) {
      md = MessageDigest.getInstance("SHA1");
    }
    in = new BufferedReader(new FileReader(IN_FILE));
    out = new BufferedWriter(new FileWriter(CHAR_FILE));
    int count = 0;
    for (int c = in.read(); c != -1; c = in.read()) {
      if (verifyHash) {
        md.update((byte) c);
      }
      out.write(c);
      wrote++;
      if (++count >= COUNT) {
        if (verifyHash) {
          md.update((byte) '\n');
        }
        out.write("\n");
        wrote++;
        count = 0;
      }
    }
  } catch (IOException e) {
    throw new RuntimeException(e);
  } catch (NoSuchAlgorithmException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(in);
    safeClose(out);
    long end = System.nanoTime();
    System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
        CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
  }
}

还有“块”版:

private static void blockRead(boolean verifyHash) {
  Reader in = null;
  Writer out = null;
  long start = System.nanoTime();
  long wrote = 0;
  MessageDigest md = null;
  try {
    if (verifyHash) {
      md = MessageDigest.getInstance("SHA1");
    }
    in = new BufferedReader(new FileReader(IN_FILE));
    out = new BufferedWriter(new FileWriter(BLOCK_FILE));
    char[] buf = new char[COUNT + 1]; // leave a space for the newline
    int lastRead = in.read(buf, 0, COUNT); // read in 309 chars at a time
    while (lastRead != -1) { // end of file
      // technically less than 309 characters may have been read
      // this is very unusual but possible so we need to keep
      // reading until we get all the characters we want
      int totalRead = lastRead;
      while (totalRead < COUNT) {
        lastRead = in.read(buf, totalRead, COUNT - totalRead);
        if (lastRead == -1) {
          break;
        } else {
          totalRead++;
        }
      }

      // if we get -1, it'll eventually signal an exit but first
      // we must write any characters we have read
      // note: it is assumed that the trailing number, which may be
      // less than 309 will still have a newline appended. this may
      // note be the case
      if (totalRead == COUNT) {
        buf[totalRead++] = '\n';
      }
      if (totalRead > 0) {
        out.write(buf, 0, totalRead);
        if (verifyHash) {
          md.update(new String(buf, 0, totalRead).getBytes("UTF-8"));
        }
        wrote += totalRead;
      }

      // don't try and read again if we've already hit EOF
      if (lastRead != -1) {
        lastRead = in.read(buf, 0, 309);
      }
    }
  } catch (IOException e) {
    throw new RuntimeException(e);
  } catch (NoSuchAlgorithmException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(in);
    safeClose(out);
    long end = System.nanoTime();
    System.out.printf("Created %s size %,d in %,.3f seconds. Hash: %s%n",
        CHAR_FILE, wrote, (end - start) / 1000000000.0d, hash(md, verifyHash));
  }
}

以及创建测试文件的方法:

private static void createFile() {
  Writer out = null;
  long start = System.nanoTime();
  try {
    out = new BufferedWriter(new FileWriter(IN_FILE));
    Random r = new Random();
    for (int i = 0; i < SIZE; i++) {
      out.write(CHARS[r.nextInt(CHARS.length)]);
    }
  } catch (IOException e) {
    throw new RuntimeException(e);
  } finally {
    safeClose(out);
    long end = System.nanoTime();
    System.out.printf("Created %s size %,d in %,.3f seconds%n",
      IN_FILE, SIZE, (end - start) / 1000000000.0d);
  }
}

这些都假设:

private static final int SIZE = 200000000;
private static final int COUNT = 309;
private static final char[] CHARS;
private static final char[] BYTES = new char[]{'0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e', 'f'};
private static final String IN_FILE = "E:\\temp\\in.dat";
private static final String CHAR_FILE = "E:\\temp\\char.dat";
private static final String BLOCK_FILE = "E:\\temp\\block.dat";

static {
  char[] chars = new char[1000];
  int nchars = 0;
  for (char c = 'a'; c <= 'z'; c++) {
    chars[nchars++] = c;
    chars[nchars++] = Character.toUpperCase(c);
  }
  for (char c = '0'; c <= '9'; c++) {
    chars[nchars++] = c;
  }
  chars[nchars++] = ' ';
  CHARS = new char[nchars];
  System.arraycopy(chars, 0, CHARS, 0, nchars);
}

运行此测试:

public static void main(String[] args) {
  if (!new File(IN_FILE).exists()) {
    createFile();
  }
  charRead(true);
  charRead(true);
  charRead(false);
  charRead(false);
  blockRead(true);
  blockRead(true);
  blockRead(false);
  blockRead(false);
}

给出这个结果(Intel Q9450,Windows 7 64bit,8GB RAM,在 7200rpm 1.5TB 驱动器上测试运行):

Created E:\temp\char.dat size 200,647,249 in 29.690 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 18.177 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.911 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 7.867 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.018 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.949 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 3.958 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.909 seconds. Hash: (not calculated)

结论: SHA1 哈希验证非常昂贵,这就是我运行有无版本的原因。基本上在热身后,“高效”版本的速度只有大约 2 倍。我猜这个时候文件实际上已经在内存中了。

如果我颠倒块和字符读取的顺序,结果是:

Created E:\temp\char.dat size 200,647,249 in 8.071 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 8.087 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 4.128 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 3.918 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 18.020 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 17.953 seconds. Hash: 0x22ce9e17e17a67e5ea6f8fe929d2ce4780e8ffa4
Created E:\temp\char.dat size 200,647,249 in 7.879 seconds. Hash: (not calculated)
Created E:\temp\char.dat size 200,647,249 in 8.016 seconds. Hash: (not calculated)

有趣的是,逐字符版本在第一次读取文件时会受到更大的初始影响。

因此,像往常一样,这是效率和简单之间的选择。

【讨论】:

    【解决方案2】:

    打开它,一次读取一个字符,然后将该字符写入它需要去的地方。保留一个计数器,每次计数器足够大时,写出一个换行符并将计数器设置为零。

    【讨论】:

    • 然后将其包装在 BufferedReader 中。我保持简单。
    【解决方案3】:

    读入一个长度为309的字节数组,然后写入读取的字节数:

       import java.io.*;
    
    
    
       public class Test {
          public static void main(String[] args) throws Exception  {
             InputStream in = null;
             byte[] chars = new byte[309];
             try   {
                in = new FileInputStream(args[0]);
                int read = 0;
    
                while((read = in.read(chars)) != -1)   {
                   System.out.write(chars, 0, read);
                   System.out.println("");
                }
             }finally {
                if(in != null)  {
                   in.close();
                }
             }
          }
    
       }
    

    【讨论】:

    • 字节可能会破坏多字节编码中的数据,例如 utf-8 或 utf-16。原始问题中未指定,但仍然如此。如果第 309 个字节是多字节字符的第一个字节,再见。
    【解决方案4】:

    不确定此解决方案有多好,但您始终可以逐字阅读。

    1. 读入 309 个字符并写入文件。不确定您是否可以一次执行此操作,还是一次必须由一个字符执行此操作
    2. 写入第 309 个字符后,在文件中输出一个换行符
    3. 重复

    例如(使用this 站点):

    FileInputStream fis = new FileInputStream(file);
    char current;
    int counter = 0
       while (fis.available() > 0) {
          current = (char) fis.read();
          counter++;
          // output current to file
          if ((counter%309) = 0) {
             //output newline character
          }
       }
    

    【讨论】:

      【解决方案5】:

      不要使用BufferedReader,这会将大部分基础文件保留在内存中。直接使用FileReader,然后使用read() 方法获取所需的数据:

      FileReader reader = new FileReader(fileName);
      char[] buffer = new char[309];
      int charsRead = 0;
      
      while ((charsRead = reader.read(buffer, 0, buffer.length)) == buffer.length)
      {
          System.out.println(new String(buffer));
      }
      if (charsRead > 0)
      {
           // print any trailing chars
           System.out.println(new String(buffer, 0, charsRead));
      }
      

      【讨论】:

      • 您可以设置 BufferedReader 的大小以避免一次读取整个内容。
      • -1:您不能保证 reader.read() 会填满缓冲区。
      • BufferedReader 不读取将整个文件保留在内存中。问题是如果文件是一行,那么readLine() 将根据定义读入整个文件。
      【解决方案6】:

      将您的 FileReader 包装在 BufferedReader 中,然后继续循环,一次读取 309 个字符。

      类似的东西(未测试):

      BufferedReader r = new BufferedReader(new FileReader("yourfile.txt"), 1024);
      boolean done = false;
      char[] buffer = new char[309];
      while(!done)
      {
         int read = r.read(buffer,0,309);
         if(read > 0)
         {
           //write buffer to dfestination, appending newline
         }
         else
         {
              done = true;
         }
      }
      

      【讨论】:

        【解决方案7】:

        你可以把你的程序改成这样:

         BufferedReader r = null;
        
         r = new BufferedReader(new FileReader(fileName));
         char[] data = new char[309];
        
         while (r.read(data, 0, 309) > 0) {
             System.out.println(new String(data) + "\n");
         }
        

        这是我的想法,未经测试。

        【讨论】:

          猜你喜欢
          • 2019-04-09
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2011-02-09
          • 2017-03-25
          相关资源
          最近更新 更多