行尾混乱答案

【问题标题】：Line endings confusion行尾混乱
【发布时间】：2012-01-02 12:02:25
【问题描述】：

我用 java 制作了一个简单的解析器，它一次读取一个字符并构造单词。

我尝试在 Linux 下运行它，但我注意到寻找 '\n' 不起作用。虽然如果我将字符与值 10 进行比较，它会按预期工作。根据 ASCII 表值 10 是 LF（换行）。我在某处（我不记得在哪里）读到 Java 应该只能通过查找 '\n' 才能找到换行符。

我正在使用BufferedReader 和read 方法来读取字符。

编辑

readLine不能使用，会产生其他问题

当我在 linux 下使用带有 mac/windows 文件结尾的文件时，似乎出现了问题。

【问题讨论】：

请显示实际代码。
见line.separator。
Java: How do I get a platform independent new line character?的可能重复
很可能你做错了什么。也许您正在使用 readLine() 并扫描行？
@trashgod 我试过了，结果一样。

标签： java bufferedreader eol

【解决方案1】：

使用readLine()逐行读取文本

示例

FileInputStream fstream = new FileInputStream("textfile.txt");
  // Get the object of DataInputStream
  DataInputStream in = new DataInputStream(fstream);
  BufferedReader br = new BufferedReader(new InputStreamReader(in));
  String strLine;
  //Read File Line By Line
  while ((strLine = br.readLine()) != null)   {
  // Print the content on the console
  System.out.println (strLine);
  }
  //Close the input stream
  in.close();
    }catch (Exception e){//Catch exception if any
  System.err.println("Error: " + e.getMessage());
  }

【讨论】：

【解决方案2】：

这里有两种方法可以做到

1-使用逐行读取并使用正则表达式拆分每个单词以获取单个单词

2- 编写你自己的 isDelimiter 方法并用它来检查你是否达到了拆分条件

package misctests;

import static org.junit.Assert.assertEquals;
import static org.junit.Assert.assertNotNull;
import java.util.ArrayList;
import java.util.List;
import org.junit.Test;


public class SplitToWords {

    String someWords = "Lorem ipsum\r\n(dolor@sit)amet,\nconsetetur!\rsadipscing'elitr;sed~diam";
    String delimsRegEx = "[\\s;,\\(\\)!'@~]+";
    String delimsPlain = ";,()!'@~"; // without whitespaces

    String[] expectedWords = {
        "Lorem",
        "ipsum",
        "dolor",
        "sit",
        "amet",
        "consetetur",
        "sadipscing",
        "elitr",
        "sed",
        "diam"
    };

    private static final class StringReader {
        String input = null;
        int pos = 0;
        int len = 0;
        StringReader(String input) {
            this.input = input == null ? "" : input;
            len = this.input.length();
        }

        public boolean hasMoreChars() {
            return pos < len;
        }

        public int read() {
            return hasMoreChars() ? ((int) input.charAt(pos++)) : 0;
        }
    }

    @Test
    public void splitToWords_1() {
        String[] actual = someWords.split(delimsRegEx);
        assertEqualsWords(expectedWords, actual);
    }

    @Test
    public void splitToWords_2() {
        StringReader sr = new StringReader(someWords);
        List<String> words = new ArrayList<String>();
        StringBuilder sb = null;
        int c = 0;
        while(sr.hasMoreChars()) {
            c = sr.read();
            while(sr.hasMoreChars() && isDelimiter(c)) {
                c = sr.read();
            }
            sb = new StringBuilder();
            while(sr.hasMoreChars() && ! isDelimiter(c)) {
                sb.append((char)c);
                c = sr.read();
            }
            if(! isDelimiter(c)) {
                sb.append((char)c);
            }
            words.add(sb.toString());
        }

        String[] actual = new String[words.size()];
        words.toArray(actual);

        assertEqualsWords(expectedWords, actual);
    }

    private boolean isDelimiter(int c) {
        return (Character.isWhitespace(c) ||
            delimsPlain.contains(new String(""+(char)c))); // this part is subject for optimization
    }

    private void assertEqualsWords(String[] expected, String[] actual) {
        assertNotNull(expected);
        assertNotNull(actual);
        assertEquals(expected.length, actual.length);
        for(int i = 0; i < expected.length; i++) {
            assertEquals(expected[i], actual[i]);
        }
    }
}

【讨论】：

我会尝试实现它。它会影响很多代码，但这是我的错。
您所需要的只是splitToWords_2() 中的外部...自从您逐字节从缓冲读取器中读取后，您可能已经拥有了。 StringReader 类只是缓冲阅读器的一种模拟/替代品......请注意，它的读取方法返回 int，就像 BufferedReader 的方法一样。对于 delimsPlain，您可以使用 java.util.Set<Character> 并在静态块中对其进行初始化，因此您可以使用 ... || delimsPlain.contains((char)c) 之类的东西进入 isDelimiter。祝你好运！

【解决方案3】：

如果您逐字节读取文件，则必须注意所有 3 种情况，Linux 为 '\n'，windows 为 "\r\n"，mac 为 '\r'。

请改用 readLine 方法。它会为您处理这些事情，并且只返回没有任何终止符的线路。阅读每一行后，您可以对其进行标记以获得单个单词。

还要考虑使用系统属性“line.separator”。它始终拥有依赖于系统的行终止符，至少可以使您的代码（而不是生成的文件）更加门户。

【讨论】：

Mac OS X 使用\n; Mac OS 9 及更早版本使用\r。
我认为造成问题的是这个'\r'。如果它有十进制值13。
很高兴知道 mac-guys 来自 '\r' ... @marcus - 您还可以使用 Charachter 静态方法 isWhitespace(char ch) 来抛出所有空白。循环（将字符读入字符串生成器，直到遇到空格，构造一个单词，在仍然空格时读取）直到没有字符可供读取
@A4L 我希望它就这么简单。单词并不总是用空格分隔。
用例中的分隔符是什么？