如何使用哈希表改进我的 2 sum 算法以处理一系列数字？答案

【问题标题】：How can I improve my 2 sum algorithm for a range of numbers using a hash table?如何使用哈希表改进我的 2 sum 算法以处理一系列数字？
【发布时间】：2015-05-11 04:23:36
【问题描述】：

我开发了一种算法来使用哈希表解决 2 和问题，尽管它的性能对于大量输入来说是可怕的。

我的目标是找到所有不同的数字 x,y，其中 -10000

这是我的代码：

import com.google.common.base.Stopwatch;

import java.util.Scanner;
import java.util.HashMap;
import java.util.ArrayList;

import static com.google.common.collect.Lists.newArrayList;

public class TwoSum {

    private HashMap<Long, Long> map;
    private ArrayList<Long> Ts;
    private long result = 0L;


    public TwoSum() {
        Ts = newArrayList();
        for(long i = -10000; i < 10001; i++){
            Ts.add(i);
        }

        Scanner scan = new Scanner(System.in);
        map = new HashMap<>();
        while (scan.hasNextLong()) {
            long a = scan.nextLong();
            if (!map.containsKey(a)) {
                map.put(a, a);
            }
        }
    }

    private long count(){
        //long c = 0L;
        for (Long T : Ts) {
            long t = T;
            for (Long x : map.values()) {
                long y = t - x;
                if (map.containsValue(y) && y != x) {
                    result++;
                }
                //System.out.println(c++);
            }
        }
        return result / 2;
    }

    public static void main(String [] args) {
        TwoSum s = new TwoSum();
        Stopwatch stopwatch = Stopwatch.createStarted();
        System.out.println(s.count());
        stopwatch.stop();
        System.out.println("time:" + stopwatch);

    }
}

样本输入：

-7590801 -3823598 -5316263 -2616332 -7575597 -621530 -7469475 1084712 -7780489 -5425286 3971489 -57444 1371995 -5401074 2383653 1752912 7455615 3060706 613097 -1073084 7759843 7267574 -7483155 -2935176 -5128057 -7881398 -637647 -2607636 -3214997 -8253218 2980789 168608 3759759 -5639246 555129 -4489068 44019 2275782 -3506307 -8031288 -213609 -4524262 -1502015 -1040324 3258235 32686 1047621 -3376656 7601567 -7051390 6633993 -6245148 4994051 -4259178 856589 6047000 1785511 4449514 -1177519 4972172 8274315 7725694 -4923179 5076288 -876369 -7663790 1613721 4472116 -4587501 3194726 6195357 -3364248 -113737 6260410 1974241 3174620 3510171 7289166 4532581 -6650736 -3782721 7007010 6007081 -7661180 -1372125 -5967818 516909 -7625800 -2700089 -7676790 -2991247 2283308 1614251 -4619234 2741749 567264 4190927 5307122 -5810503 -6665772

输出：6

【问题讨论】：

标签： java performance algorithm hashmap

【解决方案1】：

问题是来自Algorithms: Design and Analysis 的作业 - 由斯坦福大学提供并由 Tim Roughgarden 教授教授的在线课程。我碰巧上的是同样的课程。

在哈希表中查找t - i 的常用解决方案是O(n) 查找单个t，但这样做20001 * 1000000 次会导致大约200 亿次查找！

更好的解决方案是从输入文件中创建一个排序集xs，并在∀i ∈ xs 中找到xs 范围内[-10000 - i, 10000 - i] 中的所有数字。由于根据定义，排序集没有重复项，因此我们无需担心范围内的任何数字等于i。但是有一个问题，问题陈述中并不清楚。不仅找到唯一的(x, y) ∀ x, y ∈ xs 就足够了，而且它们的总和是唯一的。显然，2 个唯一数可能产生相等的和（例如 2 + 4 = 1 + 5 = 6）。因此，我们也需要跟踪总和。

最后，一旦超过 5000，我们就可以停止，因为右边不能再有任何数字加起来小于 10000。

这是一个 Scala 解决方案：

def twoSumCount(xs: SortedSet[Long]): Int = {
  xs
    .foldLeft(collection.mutable.Set.empty[Long]) { (sums, i) =>
      if (i < TenThou / 2) {
        xs
          // using from makes it slower
          .range(-TenThou - i, TenThou - i + 1)
          .map(_ + i)
          // using diff makes it slower
          .withFilter(y => !sums.contains(y))
          // adding individual elements is faster than using
          // diff/filter/filterNot and adding all using ++=
          .foreach(sums.add)
      }
      sums
    }
    .size
}

基准测试：

cores: 8
hostname: ***
name: OpenJDK 64-Bit Server VM
osArch: x86_64
osName: Mac OS X
vendor: Azul Systems, Inc.
version: 11.0.1+13-LTS
Parameters(file -> 2sum): 116.069441 ms

【讨论】：

【解决方案2】：

你的算法的要点可以用伪代码重写为：

for all integers t from -10k to 10k,
    for all map keys x,
        if t - x in map, and t is not 2*x,
            count ++
return count / 2

您可以轻松地改进一下：

for all integers t from -10k to 10k,
    for the lower half of keys x in ascending order such that t is not 2*x
        if t - x in map,
            count ++

这使其运行速度提高了一倍（您不再重复计算）。但是，您需要对输入进行排序以确保映射键按升序排列。您可以将它们添加到 TreeSet 中，然后将其移动到 LinkedHashSet 中。如果您不关心值，使用 Sets 比使用 Maps 更好，并且所有信息都在键中。

运行时间仍然是 O(inputs * range)，因为您有两个嵌套循环，一个带有 range 迭代，另一个带有一半 input。这是算法的根本缺陷，再多的优化也无法解决。

【讨论】：

downvoter，想解释一下原因吗？我也许可以解决它。
我没有投反对票，但我想t is not 2*x 来自对不同数字的要求，尽管从问题描述中不清楚这意味着什么。您假设x + y = t ∀ x, y ⊂ map and x ≠ y，但我可以想到其他解释。如果x = 30000 and y = -20000，则不计算x = -20000 and y = 30000。或者可能是x = 30000 and y = -15000 不被计算在内，因为我们见过x 一次。
我同意问题陈述有些模棱两可，但只有 OP 才能澄清其含义。如果得到确认，您提出的问题可以在伪代码中轻松修复：例如，乘以 2 来计算 x,y 和 y,x（其中 xdistinct numbers 将禁止考虑t=2*x。