估计压缩整数列表上的最大有效负载大小答案

【问题标题】：Estimating max payload size on a compressed list of integers估计压缩整数列表上的最大有效负载大小
【发布时间】：2020-01-15 20:11:12
【问题描述】：

我在一个应用程序中有 100 万行。它向服务器发出如下请求：

/search?q=hello

搜索返回一个排序的整数列表，表示在输入数据集中匹配的行（用户在他们的浏览器中）。我将如何估计有效负载将返回的最大大小？例如，首先我们有：

# ~7 MB if we stored "all results" uncompressed
6888887

# ~ 3.5MB if we stored "all results" relative to 0 or ALL matches (cuts it down by two)
3444443

然后我们希望使用某种解压缩（Elias-Fano？）来压缩这些整数，对于 1M 排序整数的大小，“最坏情况”会是什么？以及如何计算？

应用程序有一百万行数据，所以假设 R1 --> R1000000，或者如果是零索引，range(int(1e6))。服务器将响应类似：[1,2,3]，表示（仅）第 1、2 和 3 行匹配。

【问题讨论】：

由于您没有回复，我查看了编辑历史，它回答了我的问题。这些数字在 [0-999999] 范围内（尽管您似乎从 1 开始数，但也许这是一个错误？反正差别不大）。 |您似乎选择使用带有逗号分隔符的十进制格式数字的 ascii 表示。正如您的测量结果所示，这是一种相当低效的方法，因为您平均需要 6.89 个字节来表示一个数字。即使是一个简单的 32 位整数二进制数组，每个条目也需要 4 个字节。但是你只需要 20 位来表示 1000000（2.5 字节）以下的数字
但是，我们需要存储的相关信息只是这个固定范围内的每个数字是否存在。这意味着每个条目一位，因此 100 万位（125000 字节）。这可能是最坏情况下的大小。 |您可以对生成的比特流进行一些熵编码以进一步减小大小。 |对于 50000 个或更少匹配项的情况，您可以切换到仅对匹配项进行编码（使用固定宽度 20 位每个整数变得小于位图）。 |要进一步压缩，请反向排序并使用范围缩减来仅对有效位进行编码。
@DanMašek 感谢这些 cmets。是的，对于一个非常粗略的开始，我只是在 python 中做了range(1000000) 和json.dumps()。我使用的是单索引行号（如 excel）而不是普通的数组索引。你想写一个关于你如何建议编码和压缩的答案吗？我也会为这个问题添加一个赏金！
@David542 当然可以，但是要对我的标准做出足够好的回答需要一些时间（并且任何概念验证代码都是 C++）。我已经考虑了更多并进行了一些试验，并且具有可以处理小数位（范围，算术，fse - 可以接近熵的）的熵编码器的位图对于任何数量的匹配都很难被击败.带有分组符号的霍夫曼也可以接近。我将尝试介绍其他一些选项。但是，将位图通过一些现有的轻量级压缩器（如 zlib/snappy/lz4...
@DanMašek 完美，是的，C 或 C++ 会很棒！

标签： search optimization compression huffman-code

【解决方案1】：

有2^(10^6) 不同的排序（无重复）整数列表10^6。将每个此类列表（例如 [0, 4, ...]）映射到相应的列表位数组（比如10001....）产生10^6位，即125kB的信息。由于每个位数组对应一个唯一的可能排序列表，反之亦然，这是最紧凑（在某种意义上：具有最小的最大尺寸）表示。

当然，如果某些结果比其他结果更可能，则可能会有更有效（在某种意义上：具有更小的平均大小）表示。例如，如果大多数结果集都很小，那么简单的行程编码通常可能会产生更小的编码。

不可避免地，在这种情况下，编码的最大大小（您询问的最大有效负载大小）将超过 125 kB

使用例如压缩上述 125 kB 位数组zlib 将为小型结果集生成可接受的紧凑编码。此外，zlib 有一个函数deflateBound()，给定未压缩的大小，它将计算最大有效负载大小（在您的情况下，它肯定会大于 125 kB，但不会太大）

【讨论】：

【解决方案2】：

输入规范：

0 到 999999 之间的行号（如果需要 1 索引，可以应用偏移量）
每个行号只出现一次
数字按升序排序（很有用，我们还是希望对它们进行排序）

当匹配数超过可能值的一半时，您的一个好主意是反转结果的含义。让我们保留这一点，并假设我们得到了一个标志和一个匹配/未命中列表。

您最初尝试将数字编码为逗号分隔的文本。这意味着对于 90% 的可能值，您需要 6 个字符 + 1 个分隔符——因此平均需要 7 个字节。但是，由于最大值为 999999，因此您实际上只需要 20 位来对每个条目进行编码。

因此，减小尺寸的第一个想法是使用二进制编码。

二进制编码

最简单的方法是写入发送的值的数量，后跟一个 32 位整数流。

一种更有效的方法是将两个 20 位值打包到每 5 个写入的字节中。如果是奇数，您只需用零填充 4 个多余的位。

这些方法可能适用于少量匹配（或未命中）。然而，需要注意的重要一点是，对于每一行，我们只需要跟踪 1 位信息——无论它是否存在。这意味着我们可以将结果编码为 1000000 位的位图。

结合这两种方法，我们可以在匹配或未命中的情况下使用位图，并在效率更高时切换到二进制编码。

范围缩小

编码排序的整数序列时使用的下一个潜在改进是使用范围缩减。

这个想法是从最大到最小对值进行编码，并随着它们变得更小而减少每个值的位数。

首先，我们对表示第一个值所需的位数N 进行编码。
我们使用N bits 对第一个值进行编码
对于以下每个值
- 使用N bits 对值进行编码
- 如果值需要更少的位来编码，请适当减少N

熵编码

让我们回到位图编码。基于Shannon entropy theory 最坏的情况是我们有 50% 的匹配。概率偏差越大，我们平均需要对每个条目进行编码的位数就越少。

Matches | Bits
--------+-----------
0       | 0
1       | 22
2       | 41
3       | 60
4       | 78
5       | 96
10      | 181
100     | 1474
1000    | 11408
10000   | 80794
100000  | 468996
250000  | 811279
500000  | 1000000

为此，我们需要使用可以对小数位进行编码的熵编码器，例如算术或范围编码器或某些基于 ANS 的新编码器，例如 FSE。或者，我们可以将符号组合在一起并使用霍夫曼编码。

原型和测量

我使用 Amir Said 的 FastAC 的 32 位实现编写了一个测试，它将模型限制为小数点后 4 位。（这不是一个真正的问题，因为我们不应该将这些数据直接提供给编解码器。这只是一个演示。）

首先是一些常用代码：

typedef std::vector<uint8_t> match_symbols_t;
typedef std::vector<uint32_t> match_list_t;
typedef std::set<uint32_t> match_set_t;
typedef std::vector<uint8_t> buffer_t;
// ----------------------------------------------------------------------------
static uint32_t const NUM_VALUES(1000000);
// ============================================================================
size_t symbol_count(uint8_t bits)
{
    size_t count(NUM_VALUES / bits);
    if (NUM_VALUES % bits > 0) {
        return count + 1;
    }
    return count;
}
// ----------------------------------------------------------------------------
void set_symbol(match_symbols_t& symbols, uint8_t bits, uint32_t match, bool state)
{
    size_t index(match / bits);
    size_t offset(match % bits);
    if (state) {
        symbols[index] |= 1 << offset;
    } else {
        symbols[index] &= ~(1 << offset);
    }
}
// ----------------------------------------------------------------------------
bool get_symbol(match_symbols_t const& symbols, uint8_t bits, uint32_t match)
{
    size_t index(match / bits);
    size_t offset(match % bits);
    return (symbols[index] & (1 << offset)) != 0;
}
// ----------------------------------------------------------------------------
match_symbols_t make_symbols(match_list_t const& matches, uint8_t bits)
{
    assert((bits > 0) && (bits <= 8));

    match_symbols_t symbols(symbol_count(bits), 0);
    for (auto match : matches) {
        set_symbol(symbols, bits, match, true);
    }

    return symbols;
}
// ----------------------------------------------------------------------------
match_list_t make_matches(match_symbols_t const& symbols, uint8_t bits)
{
    match_list_t result;
    for (uint32_t i(0); i < 1000000; ++i) {
        if (get_symbol(symbols, bits, i)) {
            result.push_back(i);
        }
    }
    return result;
}

首先，更简单的变体是写入匹配数，确定匹配/未命中的概率并将其限制在支持的范围内。然后使用这个静态概率模型简单地对位图的每个值进行编码。

class arithmetic_codec_v1
{
public:
    buffer_t compress(match_list_t const& matches)
    {
        uint32_t match_count(static_cast<uint32_t>(matches.size()));

        arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
        codec.start_encoder();

        // Store the number of matches (1000000 needs only 20 bits)
        codec.put_bits(match_count, 20);

        if (match_count > 0) {
            // Initialize the model
            static_bit_model model;
            model.set_probability_0(get_probability_0(match_count));

            // Create a bitmap and code all the bitmap entries
            // NB: This is lazy and inefficient, but simple
            match_symbols_t symbols = make_symbols(matches, 1);
            for (auto entry : symbols) {
                codec.encode(entry, model);
            }
        }

        uint32_t compressed_size = codec.stop_encoder();
        return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
    }

    match_list_t decompress(buffer_t& compressed)
    {
        arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
        codec.start_decoder();

        // Read number of matches (20 bits)
        uint32_t match_count(codec.get_bits(20));

        match_list_t result;
        if (match_count > 0) {
            static_bit_model model;
            model.set_probability_0(get_probability_0(match_count));

            result.reserve(match_count);
            for (uint32_t i(0); i < NUM_VALUES; ++i) {
                uint32_t entry = codec.decode(model);
                if (entry == 1) {
                    result.push_back(i);
                }
            }
        }

        codec.stop_decoder();
        return result;
    }

private:
    double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
    {
        double probability_0(double(num_values - match_count) / num_values);
        // Limit probability to match FastAC limitations...
        return std::max(0.0001, std::min(0.9999, probability_0));
    }
};

第二种方法是根据我们编码的符号来调整模型。在每个匹配被编码后，减少下一个匹配的概率。一旦我们编码的所有匹配项，停止。

第二个变体的压缩效果稍微好一些，但性能开销明显。

class arithmetic_codec_v2
{
public:
    buffer_t compress(match_list_t const& matches)
    {
        uint32_t match_count(static_cast<uint32_t>(matches.size()));
        uint32_t total_count(NUM_VALUES);

        arithmetic_codec codec(static_cast<uint32_t>(NUM_VALUES / 4));
        codec.start_encoder();

        // Store the number of matches (1000000 needs only 20 bits)
        codec.put_bits(match_count, 20);

        if (match_count > 0) {
            static_bit_model model;

            // Create a bitmap and code all the bitmap entries
            // NB: This is lazy and inefficient, but simple
            match_symbols_t symbols = make_symbols(matches, 1);
            for (auto entry : symbols) {
                model.set_probability_0(get_probability_0(match_count, total_count));
                codec.encode(entry, model);
                --total_count;
                if (entry) {
                    --match_count;
                }
                if (match_count == 0) {
                    break;
                }
            }
        }

        uint32_t compressed_size = codec.stop_encoder();
        return buffer_t(codec.buffer(), codec.buffer() + compressed_size);
    }

    match_list_t decompress(buffer_t& compressed)
    {
        arithmetic_codec codec(static_cast<uint32_t>(compressed.size()), &compressed[0]);
        codec.start_decoder();

        // Read number of matches (20 bits)
        uint32_t match_count(codec.get_bits(20));
        uint32_t total_count(NUM_VALUES);

        match_list_t result;
        if (match_count > 0) {
            static_bit_model model;
            result.reserve(match_count);
            for (uint32_t i(0); i < NUM_VALUES; ++i) {
                model.set_probability_0(get_probability_0(match_count, NUM_VALUES - i));
                if (codec.decode(model) == 1) {
                    result.push_back(i);
                    --match_count;
                }
                if (match_count == 0) {
                    break;
                }
            }
        }

        codec.stop_decoder();
        return result;
    }

private:
    double get_probability_0(uint32_t match_count, uint32_t num_values = NUM_VALUES)
    {
        double probability_0(double(num_values - match_count) / num_values);
        // Limit probability to match FastAC limitations...
        return std::max(0.0001, std::min(0.9999, probability_0));
    }
};

实用方法

实际上，可能不值得设计一种新的压缩格式。事实上，将结果写成位可能不值得，只需创建一个值为 0 或 1 的字节数组。然后使用现有的压缩库——zlib 很常见，或者你可以尝试 lz4 或 snappy、bzip2、lzma……选择很多。

ZLib 示例

class zlib_codec
{
public:
    zlib_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}

    buffer_t compress(match_list_t const& matches)
    {
        match_symbols_t symbols(make_symbols(matches, bits_per_symbol));

        z_stream defstream;
        defstream.zalloc = nullptr;
        defstream.zfree = nullptr;
        defstream.opaque = nullptr;

        deflateInit(&defstream, Z_BEST_COMPRESSION);
        size_t max_compress_size = deflateBound(&defstream, static_cast<uLong>(symbols.size()));

        buffer_t compressed(max_compress_size);

        defstream.avail_in = static_cast<uInt>(symbols.size());
        defstream.next_in = &symbols[0];
        defstream.avail_out = static_cast<uInt>(max_compress_size);
        defstream.next_out = &compressed[0];

        deflate(&defstream, Z_FINISH);
        deflateEnd(&defstream);

        compressed.resize(defstream.total_out);
        return compressed;
    }

    match_list_t decompress(buffer_t& compressed)
    {
        z_stream infstream;
        infstream.zalloc = nullptr;
        infstream.zfree = nullptr;
        infstream.opaque = nullptr;

        inflateInit(&infstream);

        match_symbols_t symbols(symbol_count(bits_per_symbol));

        infstream.avail_in = static_cast<uInt>(compressed.size());
        infstream.next_in = &compressed[0];
        infstream.avail_out = static_cast<uInt>(symbols.size());
        infstream.next_out = &symbols[0];

        inflate(&infstream, Z_FINISH);
        inflateEnd(&infstream);

        return make_matches(symbols, bits_per_symbol);
    }
private:
    uint32_t bits_per_symbol;
};

BZip2 示例

class bzip2_codec
{
public:
    bzip2_codec(uint32_t bits_per_symbol) : bits_per_symbol(bits_per_symbol) {}

    buffer_t compress(match_list_t const& matches)
    {
        match_symbols_t symbols(make_symbols(matches, bits_per_symbol));

        uint32_t compressed_size = symbols.size() * 2;
        buffer_t compressed(compressed_size);

        int err = BZ2_bzBuffToBuffCompress((char*)&compressed[0]
            , &compressed_size
            , (char*)&symbols[0]
            , symbols.size()
            , 9
            , 0
            , 30);
        if (err != BZ_OK) {
            throw std::runtime_error("Compression error.");
        }

        compressed.resize(compressed_size);
        return compressed;
    }

    match_list_t decompress(buffer_t& compressed)
    {
        match_symbols_t symbols(symbol_count(bits_per_symbol));

        uint32_t decompressed_size = symbols.size();
        int err = BZ2_bzBuffToBuffDecompress((char*)&symbols[0]
            , &decompressed_size
            , (char*)&compressed[0]
            , compressed.size()
            , 0
            , 0);
        if (err != BZ_OK) {
            throw std::runtime_error("Compression error.");
        }
        if (decompressed_size != symbols.size()) {
            throw std::runtime_error("Size mismatch.");
        }

        return make_matches(symbols, bits_per_symbol);
    }
private:
    uint32_t bits_per_symbol;
};

比较

代码库，包括 64 位 Visual Studio 2015 的依赖项位于 https://github.com/dan-masek/bounded_sorted_list_compression.git

【讨论】：

哇，多么棒的答案，太棒了——希望其他人也会发现它也很有用！
@David542 谢谢 :) 又想到了一件事情（基于我们近十年前进行的一次小型编码竞赛......） - 使用位图是一种很好的 O(N) 方法对唯一有界整数列表进行排序。如果您要在编码之前生成位图，您甚至不需要对输入进行排序。 |对 20 位整数或范围缩小进行一些演示，并对原始或使用通用压缩器压缩的内容进行一些测量，这可能会很有用。当我有时间的时候可能会搞砸。

【解决方案3】：

存储排序整数的压缩列表在数据检索和数据库应用程序中极为常见，并且已经开发了多种技术。

我很确定，在您的列表中随机选择大约一半的项目将是您最糟糕的情况。

许多流行的整数列表压缩技术，例如 Roaring 位图，都回退到使用（对于这种最坏情况的输入数据）每索引 1 位的位图。

因此，在您的情况下，如果有 100 万行，返回的最大有效负载将是（在最坏的情况下）设置了“使用位图”标志的标头，后跟一个 100 万位（125,000 字节）的位图，例如，如果数据库中的第 700 行匹配，则位图的第 700 位设置为 1，如果数据库中的第 700 行不匹配，则设置为 0匹配。（谢谢，Dan Mašek！）

我的理解是，虽然准简洁的 Elias-Fano 压缩和其他技术对于压缩许多“自然发生”的排序整数集非常有用，但对于这个最坏情况的数据集，它们都没有提供更好的压缩，而且它们中的大多数提供的“压缩”比简单的位图要差得多。

（这类似于大多数通用数据压缩算法（例如 DEFLATE）的方式，当输入“最坏情况”数据（例如无法区分的随机加密数据）时，会创建包含几个字节的“压缩”文件设置“存储/原始/文字”标志的开销，然后是未压缩文件的简单副本）。

王建国；林春斌；雅尼斯·帕帕康斯坦丁努；史蒂文斯旺森。 "An Experimental Study of Bitmap Compression vs. Inverted List Compression"
https://en.wikipedia.org/wiki/Bitmap_index#Compression
https://en.wikipedia.org/wiki/Inverted_index#Compression

【讨论】：