使用有限的内存查找丢失的号码答案

【问题标题】：find a missing number using limited memory使用有限的内存查找丢失的号码
【发布时间】：2016-03-09 21:11:24
【问题描述】：

问题是，给定一个包含 40 亿个唯一整数的输入文件，提供一种算法来生成一个不包含在文件中的整数，假设只有 10 MB 内存。

搜索了一些解决方案并在下面发布了代码，其中一个是将整数存储到位向量块中（每个块代表40亿范围中的特定整数范围，块中的每个位代表一个整数），以及对每个块使用另一个计数器，以计算每个块中的整数个数。因此，如果整数的数量小于整数的块容量，则扫描块的位向量以查找丢失的整数。

我对这个解决方案的问题是，为什么“我们选择的越靠近中间，在任何给定时间使用的内存就越少”，这里有更多上下文，

第一轮中的数组可以容纳 10 兆字节或大约 2^23 字节的内存。由于数组中的每个元素都是一个 int，而一个 int 是 4 个字节，所以我们最多可以容纳一个大约 2^21 个元素的数组。因此，我们可以推断出：

因此，我们可以得出以下结论： 2^10

public class QuestionB {
    public static int bitsize = 1048576; // 2^20 bits (2^17 bytes)
    public static int blockNum = 4096; // 2^12
    public static byte[] bitfield = new byte[bitsize/8];
    public static int[] blocks = new int[blockNum];

    public static void findOpenNumber() throws FileNotFoundException {
        int starting = -1;
        Scanner in = new Scanner (new FileReader ("Chapter 10/Question10_3/input_file_q10_3.txt"));
        while (in.hasNextInt()) {
            int n = in.nextInt();
            blocks[n / (bitfield.length * 8)]++;
        }

        for (int i = 0; i < blocks.length; i++) {
            if (blocks[i] < bitfield.length * 8){
                /* if value < 2^20, then at least 1 number is missing in
                 * that section. */
                starting = i * bitfield.length * 8;
                break;
            }
        }

        in = new Scanner(new FileReader("Chapter 10/Question10_3/input_file_q10_3.txt"));
        while (in.hasNextInt()) {
            int n = in.nextInt();
            /* If the number is inside the block that’s missing 
             * numbers, we record it */
            if (n >= starting && n < starting + bitfield.length * 8) {
                bitfield [(n-starting) / 8] |= 1 << ((n - starting) % 8);
            }
        }

        for (int i = 0 ; i < bitfield.length; i++) {
            for (int j = 0; j < 8; j++) {
                /* Retrieves the individual bits of each byte. When 0 bit 
                 * is found, finds the corresponding value. */
                if ((bitfield[i] & (1 << j)) == 0) {
                    System.out.println(i * 8 + j + starting);
                    return;
                }
            }
        }       
    }

    public static void main(String[] args) throws FileNotFoundException {
        findOpenNumber();
    }

}

【问题讨论】：

你的问题是为什么 2¹⁰ 和 2²⁶ 之间的中间占用更少的内存？
文件是否有 40 亿个唯一整数？或者，是否有可能重复？
您忘记明确说明您的问题。看完你的帖子，我只能觉得“这很有趣，但你在问什么？”
数字是唯一的吗？除了缺少的那一个之外，它们是连续的吗？
既然int范围从-2b到+2b左右，你说这个集合是不同的，负数一定存在吧？所以这会在第一个 while 循环中导致负数组索引 eval。

标签： java algorithm

【解决方案1】：

如果您形成 M 个大小为 2^32/M 的块，则所需的总内存为 M+2^27/M 个字（32 位）。该函数在 M=√2^27 时达到最小值，即 1 和 2^27 块之间的一半。最小为 2^14.5 字，约 92 KB。

这在双对数图上非常清楚。

【讨论】：

感谢 Yves，您如何获得 M+2^27/M 字词（32 位）？
@LinMa：每组一个 32 位计数器和 2^32/M 位。
@LinMa：是的，我写的。
@LinMa：例如。
@LinMa：对不起，我不再花时间在这上面了。

【解决方案2】：

我喜欢这个问题。我会再考虑一下，但我认为如果磁盘空间和时间不是问题，您可以将数字分成 100k 块，然后在每个文件中对它们进行排序。任何没有 100k 条目的块都会有一个间隙。它一点也不优雅，但它让球滚动起来。

【讨论】：

感谢 ajacian81，我们也感谢您对内存利用率的见解。 :))