使用 Java 从文本中删除重复行答案

【问题标题】：Remove Duplicate Lines from Text using Java使用 Java 从文本中删除重复行
【发布时间】：2011-05-09 01:41:15
【问题描述】：

我想知道是否有人在 java 中具有删除重复行同时保持行顺序的逻辑。

我不希望使用正则表达式解决方案。

【问题讨论】：

【解决方案1】：

public class UniqueLineReader extends BufferedReader {
    Set<String> lines = new HashSet<String>();

    public UniqueLineReader(Reader arg0) {
        super(arg0);
    }

    @Override
    public String readLine() throws IOException {
        String uniqueLine;
        if (lines.add(uniqueLine = super.readLine()))
            return uniqueLine;
        return "";
    }

  //for testing.. 

    public static void main(String args[]) {
        try {
            // Open the file that is the first
            // command line parameter
            FileInputStream fstream = new FileInputStream(
                    "test.txt");
            UniqueLineReader br = new UniqueLineReader(new InputStreamReader(fstream));
            String strLine;
            // Read File Line By Line
            while ((strLine = br.readLine()) != null) {
                // Print the content on the console
                if (strLine != "")
                    System.out.println(strLine);
            }
            // Close the input stream
            in.close();
        } catch (Exception e) {// Catch exception if any
            System.err.println("Error: " + e.getMessage());
        }
    }

}

修改版：

public class UniqueLineReader extends BufferedReader {
    Set<String> lines = new HashSet<String>();

    public UniqueLineReader(Reader arg0) {
        super(arg0);
    }

    @Override
    public String readLine() throws IOException {
        String uniqueLine;
        while (lines.add(uniqueLine = super.readLine()) == false); //read until encountering a unique line
            return uniqueLine;
    }

    public static void main(String args[]) {
        try {
            // Open the file that is the first
            // command line parameter
            FileInputStream fstream = new FileInputStream(
                    "/home/emil/Desktop/ff.txt");
            UniqueLineReader br = new UniqueLineReader(new InputStreamReader(fstream));
            String strLine;
            // Read File Line By Line
            while ((strLine = br.readLine()) != null) {
                // Print the content on the console
                    System.out.println(strLine);
            }
            // Close the input stream
            in.close();
        } catch (Exception e) {// Catch exception if any
            System.err.println("Error: " + e.getMessage());
        }

    }
}

【讨论】：

【解决方案2】：

如果您将这些行输入LinkedHashSet，它会忽略重复的行，因为它是一个集合，但会保留顺序，因为它是链接的。如果您只是想知道您之前是否看过给定的行，请在继续时将它们输入一个简单的Set，并忽略 Set 已经包含/包含的那些。

【讨论】：

【解决方案3】：

使用新的 java Stream API 可以很容易地从文本或文件中删除重复的行。 Stream 支持不同的聚合特性，如排序、区分和使用不同的 java 现有数据结构及其方法。以下示例可用于使用 Stream API 删除重复或对 File 中的内容进行排序

package removeword;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.OpenOption;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.Arrays;
import java.util.Scanner;
import java.util.stream.Stream;
import static java.nio.file.StandardOpenOption.*;
import static java.util.stream.Collectors.joining;

public class Java8UniqueWords {

public static void main(String[] args) throws IOException {        
    Path sourcePath = Paths.get("C:/Users/source.txt");
    Path changedPath = Paths.get("C:/Users/removedDouplicate_file.txt");
      try (final Stream<String> lines = Files.lines(sourcePath )
               // .map(line -> line.toLowerCase()) /*optional to use existing string methods*/
                .distinct()
               // .sorted())  /*aggregrate function to sort  disctincted line*/
       {
            final String uniqueWords = lines.collect(joining("\n"));
            System.out.println("Final Output:" + uniqueWords);
            Files.write(changedPath , uniqueWords.getBytes(),WRITE, TRUNCATE_EXISTING);
        }
}
}

【讨论】：

【解决方案4】：

使用 BufferedReader 读取文本文件并将其存储在 LinkedHashSet 中。打印出来。

这是一个例子：

public class DuplicateRemover {

    public String stripDuplicates(String aHunk) {
        StringBuilder result = new StringBuilder();
        Set<String> uniqueLines = new LinkedHashSet<String>();

        String[] chunks = aHunk.split("\n");
        uniqueLines.addAll(Arrays.asList(chunks));

        for (String chunk : uniqueLines) {
            result.append(chunk).append("\n");
        }

        return result.toString();
    }

}

这里有一些单元测试来验证（忽略我邪恶的复制粘贴；））：

import org.junit.Test;
import static org.junit.Assert.*;

public class DuplicateRemoverTest {

    @Test
    public void removesDuplicateLines() {
        String input = "a\nb\nc\nb\nd\n";
        String expected = "a\nb\nc\nd\n";

        DuplicateRemover remover = new DuplicateRemover();

        String actual = remover.stripDuplicates(input);
        assertEquals(expected, actual);
    }

    @Test
    public void removesDuplicateLinesUnalphabetized() {
        String input = "z\nb\nc\nb\nz\n";
        String expected = "z\nb\nc\n";

        DuplicateRemover remover = new DuplicateRemover();

        String actual = remover.stripDuplicates(input);
        assertEquals(expected, actual);
    }

}

【讨论】：

嗯，知道了。不知道。

【解决方案5】：

这是另一个解决方案。让我们使用 UNIX！

cat MyFile.java | uniq > MyFile.java

编辑：哦，等等，我重新阅读了这个主题。这是一个合法的解决方案，因为我设法成为语言不可知论者？

【讨论】：

我想你可以在这里使用类似的解决方案：stackoverflow.com/questions/1088113/…。不过，如果您在 UNIX 系统上，我会尝试为脚本编写挂钩。

【解决方案6】：

为了获得更好/最佳的性能，明智的做法是使用 Java 8 的 API 功能，即。 Streams & Method references 与 LinkedHashSet for Collection 如下：

import java.io.IOException;
import java.io.PrintWriter;
import java.nio.file.Files;
import java.nio.file.Paths;
import java.util.LinkedHashSet;
import java.util.stream.Collectors;

public class UniqueOperation {

private static PrintWriter pw;  
enter code here
public static void main(String[] args) throws IOException {

    pw = new PrintWriter("abc.txt");

    for(String p : Files.newBufferedReader(Paths.get("C:/Users/as00465129/Desktop/FrontEndUdemyLinks.txt")).
                   lines().
                   collect(Collectors.toCollection(LinkedHashSet::new))) 
        pw.println(p);
    pw.flush();
    pw.close();

    System.out.println("File operation performed successfully");
}

【讨论】：

【解决方案7】：

这里我使用哈希集来存储看到的行

Scanner scan;//input
Set<String> lines = new HashSet<String>();
StringBuilder strb = new StringBuilder();
while(scan.hasNextLine()){
    String line = scan.nextLine();
    if(lines.add(line)) strb.append(line);
}

【讨论】：

但是我们可以确保输入行和输出行的顺序与散列保持相同吗？
我还将它们添加到一个字符串生成器中，以在您浏览整个文本后用作输出，您丢弃该集合并保留strb.toString()
当你添加到一个集合时，你不需要检查它是否已经存在。此外，HashSet 不保证顺序。
@Kal 我正在检查，所以我不会在 stringbuilder 中添加双精度
如果您要即时将行添加到字符串生成器，则不需要linkedHashSet。