Java 8，流查找重复元素答案

【问题标题】：Java 8, Streams to find the duplicate elementsJava 8，流查找重复元素
【发布时间】：2015-02-24 22:53:33
【问题描述】：

我正在尝试列出整数列表中的重复元素，例如，

List<Integer> numbers = Arrays.asList(new Integer[]{1,2,1,3,4,4});

使用 jdk 8 的 Streams。有没有人尝试过。要删除重复项，我们可以使用 distinct() api。但是如何找到重复的元素呢？有人可以帮帮我吗？

【问题讨论】：

Collect stream with grouping, counting and filtering operations 的可能重复项
如果您不想收集流，这基本上归结为“我如何在流中一次查看多个项目”？
Set items = new HashSet(); numbers.stream().filter(n -> i!tems.add(n)).collect(Collectors.toSet());

标签： java lambda java-8 java-stream

【解决方案1】：

你可以使用Collections.frequency:

numbers.stream().filter(i -> Collections.frequency(numbers, i) >1)
                .collect(Collectors.toSet()).forEach(System.out::println);

【讨论】：

与@OussamaZoghlami answer 中的 O(n^2) 性能相同，但可能更简单。不过，这是一个赞成票。欢迎使用 StackOverflow！
如前所述，这是一个 n^2 解决方案，其中存在一个平凡的线性解决方案。我不会在 CR 中接受这个。
它可能比@Dave 选项慢，但它更漂亮，所以我会考虑性能。
@jwilner 是关于 n^2 解决方案指的是在过滤器中使用 Collections.frequency 的观点吗？
@mancocapac 是的，它是二次的，因为频率调用必须访问数字中的每个元素，并且在每个元素上都调用它。因此，对于每个元素，我们访问每个元素 - n^2 并且不必要地低效。

【解决方案2】：

基本示例。前半部分构建频率图，后半部分将其缩减为过滤列表。可能不如 Dave 的答案那么有效，但更通用（例如，如果您想准确检测两个等）

List<Integer> duplicates = IntStream.of( 1, 2, 3, 2, 1, 2, 3, 4, 2, 2, 2 )
   .boxed()
   .collect( Collectors.groupingBy( Function.identity(), Collectors.counting() ) )
   .entrySet()
   .stream()
   .filter( p -> p.getValue() > 1 )
   .map( Map.Entry::getKey )
   .collect( Collectors.toList() );

【讨论】：

这个答案是正确的，因为它是线性的并且不违反“无状态谓词”规则。
@jwilner 不是真的，Collectors.counting() 与上述答案相同。恕我直言，在一小部分中，上面的那个更简单、更干净
@kidnan1991 不一样。在上面的答案中，每个项目都根据它的频率进行过滤，再次针对每个项目。这真的不是制作地图。

【解决方案3】：

您需要一个集合（下面的allItems）来保存整个数组内容，但这是 O(n)：

Integer[] numbers = new Integer[] { 1, 2, 1, 3, 4, 4 };
Set<Integer> allItems = new HashSet<>();
Set<Integer> duplicates = Arrays.stream(numbers)
        .filter(n -> !allItems.add(n)) //Set.add() returns false if the item was already in the set.
        .collect(Collectors.toSet());
System.out.println(duplicates); // [1, 4]

【讨论】：

filter() 需要无状态谓词。您的“解决方案”与 javadoc 中给出的有状态谓词示例惊人地相似：docs.oracle.com/javase/8/docs/api/java/util/stream/…
@MattMcHenry：这是否意味着此解决方案有可能产生意外行为，还是只是不好的做法？
@IcedDante 在像那里这样的本地化案例中，您确定流是sequential()，它可能是安全的。在流可能是parallel() 的更一般情况下，它几乎可以保证以奇怪的方式中断。
除了在某些情况下产生意想不到的行为之外，这还混合了范式，正如 Bloch 在 Effective Java 第三版中所说的那样，你不应该这样做。如果你发现自己在写这篇文章，只需使用 for 循环。
在野外发现这个被 Hibernate Validator UniqueElements 约束使用。

【解决方案4】：

O(n) 方式如下：

List<Integer> numbers = Arrays.asList(1, 2, 1, 3, 4, 4);
Set<Integer> duplicatedNumbersRemovedSet = new HashSet<>();
Set<Integer> duplicatedNumbersSet = numbers.stream().filter(n -> !duplicatedNumbersRemovedSet.add(n)).collect(Collectors.toSet());

这种方法的空间复杂度会增加一倍，但空间并不是浪费；事实上，我们现在只将重复项单独作为一个 Set 以及另一个 Set 也删除了所有重复项。

【讨论】：

【解决方案5】：

我的StreamEx 库增强了Java 8 流，提供了一个特殊的操作distinct(atLeast)，它可以只保留至少出现指定次数的元素。所以你的问题可以这样解决：

List<Integer> repeatingNumbers = StreamEx.of(numbers).distinct(2).toList();

在内部它类似于@Dave 解决方案，它计算对象，以支持其他需要的数量并且它是并行友好的（它使用ConcurrentHashMap 用于并行化流，但HashMap 用于顺序）。对于大量数据，您可以使用.parallel().distinct(2) 来加快速度。

【讨论】：

问题是关于 Java Streams，而不是第三方库。

【解决方案6】：

你可以像这样得到副本：

List<Integer> numbers = Arrays.asList(1, 2, 1, 3, 4, 4);
Set<Integer> duplicated = numbers
  .stream()
  .filter(n -> numbers
        .stream()
        .filter(x -> x == n)
        .count() > 1)
   .collect(Collectors.toSet());

【讨论】：

这不是 O(n^2) 操作吗？
尝试使用numbers = Arrays.asList(400, 400, 500, 500);
这类似于创建 2 深度循环吗？ for(..) { for(..) } 只是好奇它在内部是如何工作的
虽然这是一个不错的方法，但在stream 中包含stream 成本很高。

【解决方案7】：

我认为这个问题的基本解决方案应该如下：

Supplier supplier=HashSet::new; 
HashSet has=ls.stream().collect(Collectors.toCollection(supplier));

List lst = (List) ls.stream().filter(e->Collections.frequency(ls,e)>1).distinct().collect(Collectors.toList());

嗯，不建议进行过滤操作，但为了更好理解，我用过，而且以后的版本应该会有一些自定义过滤。

【讨论】：

【解决方案8】：

多重集是保持每个元素出现次数的结构。使用 Guava 实现：

Set<Integer> duplicated =
        ImmutableMultiset.copyOf(numbers).entrySet().stream()
                .filter(entry -> entry.getCount() > 1)
                .map(Multiset.Entry::getElement)
                .collect(Collectors.toSet());

【讨论】：

【解决方案9】：

如果您只需要检测重复项的存在（而不是列出它们，这是 OP 想要的），只需将它们转换为 List 和 Set，然后比较大小：

    List<Integer> list = ...;
    Set<Integer> set = new HashSet<>(list);
    if (list.size() != set.size()) {
      // duplicates detected
    }

我喜欢这种方法，因为它出错的地方更少。

【讨论】：

【解决方案10】：

创建额外的地图或流非常耗时……

Set<Integer> duplicates = numbers.stream().collect( Collectors.collectingAndThen(
  Collectors.groupingBy( Function.identity(), Collectors.counting() ),
  map -> {
    map.values().removeIf( cnt -> cnt < 2 );
    return( map.keySet() );
  } ) );  // [1, 4]

...对于哪个问题声称是一个 [duplicate]

public static int[] getDuplicatesStreamsToArray( int[] input ) {
  return( IntStream.of( input ).boxed().collect( Collectors.collectingAndThen(
      Collectors.groupingBy( Function.identity(), Collectors.counting() ),
      map -> {
        map.values().removeIf( cnt -> cnt < 2 );
        return( map.keySet() );
      } ) ).stream().mapToInt( i -> i ).toArray() );
}

【讨论】：

【解决方案11】：

那么检查索引呢？

        numbers.stream()
            .filter(integer -> numbers.indexOf(integer) != numbers.lastIndexOf(integer))
            .collect(Collectors.toSet())
            .forEach(System.out::println);

【讨论】：

应该可以正常工作，但也可以像这里的其他解决方案一样提供 O(n^2) 性能。

【解决方案12】：

我认为我有很好的解决方案来解决这样的问题 - List => List 并按Something.a和Something.b分组。有扩展定义：

public class Test {

    public static void test() {

        class A {
            private int a;
            private int b;
            private float c;
            private float d;

            public A(int a, int b, float c, float d) {
                this.a = a;
                this.b = b;
                this.c = c;
                this.d = d;
            }
        }


        List<A> list1 = new ArrayList<A>();

        list1.addAll(Arrays.asList(new A(1, 2, 3, 4),
                new A(2, 3, 4, 5),
                new A(1, 2, 3, 4),
                new A(2, 3, 4, 5),
                new A(1, 2, 3, 4)));

        Map<Integer, A> map = list1.stream()
                .collect(HashMap::new, (m, v) -> m.put(
                        Objects.hash(v.a, v.b, v.c, v.d), v),
                        HashMap::putAll);

        list1.clear();
        list1.addAll(map.values());

        System.out.println(list1);
    }

}

class A, list1 它只是传入的数据 - 魔法就在 Objects.hash(...) :)

【讨论】：

警告：如果Objects.hash 为(v.a_1, v.b_1, v.c_1, v.d_1) 和(v.a_2, v.b_2, v.c_2, v.d_2) 生成相同的值，那么它们将被视为相等并作为重复项被删除，而无需实际检查a、b、 c 和 d 是一样的。这可能是一个可接受的风险，或者您可能希望使用Objects.hash 以外的函数，该函数可以保证在您的域中产生唯一的结果。

【解决方案13】：

你必须使用java 8 idioms（steams）吗？也许一个简单的解决方案是将复杂性转移到类似地图的数据结构中，该数据结构将数字作为键（不重复）并将其发生的时间作为值。你可以让他们迭代那个地图，只对那些出现> 1的数字做一些事情。

import java.lang.Math;
import java.util.Arrays;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
import java.util.Iterator;

public class RemoveDuplicates
{
  public static void main(String[] args)
  {
   List<Integer> numbers = Arrays.asList(new Integer[]{1,2,1,3,4,4});
   Map<Integer,Integer> countByNumber = new HashMap<Integer,Integer>();
   for(Integer n:numbers)
   {
     Integer count = countByNumber.get(n);
     if (count != null) {
       countByNumber.put(n,count + 1);
     } else {
       countByNumber.put(n,1);
     }
   }
   System.out.println(countByNumber);
   Iterator it = countByNumber.entrySet().iterator();
    while (it.hasNext()) {
        Map.Entry pair = (Map.Entry)it.next();
        System.out.println(pair.getKey() + " = " + pair.getValue());
    }
  }
}

【讨论】：

【解决方案14】：

试试这个解决方案：

public class Anagramm {

public static boolean isAnagramLetters(String word, String anagramm) {
    if (anagramm.isEmpty()) {
        return false;
    }

    Map<Character, Integer> mapExistString = CharCountMap(word);
    Map<Character, Integer> mapCheckString = CharCountMap(anagramm);
    return enoughLetters(mapExistString, mapCheckString);
}

private static Map<Character, Integer> CharCountMap(String chars) {
    HashMap<Character, Integer> charCountMap = new HashMap<Character, Integer>();
    for (char c : chars.toCharArray()) {
        if (charCountMap.containsKey(c)) {
            charCountMap.put(c, charCountMap.get(c) + 1);
        } else {
            charCountMap.put(c, 1);
        }
    }
    return charCountMap;
}

static boolean enoughLetters(Map<Character, Integer> mapExistString, Map<Character,Integer> mapCheckString) {
    for( Entry<Character, Integer> e : mapCheckString.entrySet() ) {
        Character letter = e.getKey();
        Integer available = mapExistString.get(letter);
        if (available == null || e.getValue() > available) return false;
    }
    return true;
}

}

【讨论】：

【解决方案15】：

如果您正在寻找性能，

Set.add() 会更快。

public class FindDuplicatedBySet {

public static void main(String[] args) {
    List<Integer> list = Arrays.asList(5, 3, 4, 1, 3, 7, 2,3,1, 9, 9, 4,1);
    Set<Integer> result = findDuplicatedBySetAdd(list);
    result.forEach(System.out::println);
  }

public static <T> Set<T> findDuplicatedBySetAdd(List<T> list) {
    Set<T> items = new HashSet<>();
    return list.stream()
            .filter(n -> !items.add(n))
            .collect(Collectors.toSet());
  }
}

【讨论】：