连续内存块中的位计数答案

【问题标题】：Bit counting in a contiguous memory chunk连续内存块中的位计数
【发布时间】：2011-11-04 23:57:04
【问题描述】：

我在一次采访中被问到以下问题。

int countSetBits(void *ptr, int start, int end);

简介： 假设ptr 指向一大块内存。将此内存视为连续的位序列，start 和end 是位位置。假设start 和end 有正确的值，ptr 指向一个初始化的内存块。

问题： 编写一个 C 代码来计算从start 到end [包括] 设置的位数并返回计数。

只是为了更清楚

 ptr---->+-------------------------------+
         | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
         +-------------------------------+
         | 8 | 9 |                   |15 |
         +-------------------------------+
         |                               |
         +-------------------------------+
              ...
              ...
         +-------------------------------+
         |               | S |           |
         +-------------------------------+
              ...
              ...
         +-------------------------------+
         |    | E |                      |
         +-------------------------------+
              ...
              ...

我的解决方案：

int countSetBits(void *ptr, int start, int end )
{
    int count = 0, idx; 

    char *ch; 

    for (idx = start; idx <= end; idx++) 
    {     ch = ptr + (idx/8); 

          if((128 >> (idx%8)) & (*ch)) 
          {
                   count++; 
          }
    }

    return count; 
}

我在面试中给出了一个非常冗长且效率低下的代码。我后来研究它并提出了上述解决方案。

我很确定 SO 社区可以提供更优雅的解决方案。我只是好奇地想看看他们的反应。

PS：以上代码未编译。它更像是一个伪代码，可能包含错误。

【问题讨论】：

我想你可能没有抓住重点。如果您的主循环以字节而不是位工作，那么效率会更高，所以您有for (int i = ...; ...; ++i) count += bits_per_byte[((unsigned char*)ptr)[i]]。其中 bits_per_byte 是一个包含 256 个值的预计算数组，其中包含每个可能的字节值的设置位数。在循环的开始和结束时，你有一些事情要做，你没有一个完整的字节可以玩。
我认为他们希望您演示查找表的用法。假设 8 位字符，您将创建一个 256 长的数组，将字符映射到其中的位数。 1当然，您可以对 16 位数量执行相同操作，或者对任何不太大且有效处理的数量进行处理。
我不确定开始/结束值。它们是指位偏移量还是字节偏移量还是什么？
开始和结束索引是从字节内的 MSB 到 LSB 计数还是相反？
@harold：这没有明确说明，我希望这个问题必须在采访中提出。问题中的示例countSetBits()函数从MSB计数到LSB。

标签： c++ c algorithm bit-manipulation

【解决方案1】：

在我看来，最快速有效的方法是使用包含 256 个条目的表，其中每个元素代表索引中的位数。索引是内存位置的下一个字节。

类似这样的：

int bit_table[256] = {0, 1, 1, 2, 1, ...};
char* p = ptr + start;
int count = 0;
for (p; p != ptr + end; p++)
    count += bit_table[*(unsigned char*)p];

【讨论】：

但是问题的一个重要部分是start 和end 是位索引，而不是字节索引。而且它们可能不在偶数字节边界上（而且它们可能在同一个字节内增加了另一个边界条件）。
是的，这对于大部分范围来说是一个很好的优化，但你必须使用其他东西作为边界。
接受了这个回复，因为它肯定是一个非常快速的解决方案。但是，我们可以做一些优化，即使是在面试的情况下。例如使用大小为 16 的查找数组来表示低八位字节中的设置位。 @Michael Burr 和 Jiri 在下面使用它们。
@dimitri，您真的对此进行了基准测试吗？在现代机器上，RAM 访问是 200-300 个周期，缓存访问是几十个（取决于级别），寄存器基本上是瞬时的。查找表将对处理器的内存层次结构施加压力，这可能抹杀任何因降低计算量而获得的性能。在具有深内存层次结构的现代处理器上，执行更多计算和更少内存访问通常比其他方式更快。我并不是说这里就是这种情况，但是如果没有基准测试就很难知道！
问题清楚地表明 start 和 end 是 BIT 位置。这个代码是错误的，作为标记它的人是正确的。

【解决方案2】：

您可能会觉得this page 很有趣，它包含针对您的问题的多种替代解决方案。

【讨论】：

如果有趣，花点时间在这里重新写一遍，参考第一次创作的地点和人。
我不相信我有权在此处复制粘贴解决方案（代码除外，它是明确的公共领域），并且用我自己的话解释它们会很费力。为什么你认为我需要投入这些努力？我不认为重复已经完成的工作会让我的回答更有帮助。
当然可以，只需提及您的来源即可。每个人都喜欢平淡无奇的抄袭。不要低估您必须向他人解释某事以及其他人阅读/聆听对之前已经解释过或在教科书中可用的事物的新观点的积极成果。

【解决方案3】：

有许多种方法可以解决这个问题。 This 是比较常用选项性能的好帖子。

【讨论】：

【解决方案4】：

@dimitri 的版本可能是最快的。但是在采访中很难为所有 128 个 8 位字符构建位计数表。您可以获得一个非常快速的版本，其中包含 16 个十六进制数字 0x0、0x1、...、0xF 的表格，您可以轻松构建：

int countBits(void *ptr, int start, int end) {
    // start, end are byte indexes
    int hexCounts[16] =   {0, 1, 1, 2,   1, 2, 2, 3,
                           1, 2, 3, 3,   2, 3, 3, 4}; 
    unsigned char * pstart = (unsigned char *) ptr + start;
    unsigned char * pend = (unsigned char *) ptr + end;
    int count = 0;
    for (unsigned char * p = pstart; p <= pend; ++p) {
        unsigned char b = *p;
        count += hexCounts[b & 0x0F] + hexCounts[(b >> 4) & 0x0F];
    }
    return count;
}

编辑：如果start 和end 是位索引，则在调用上述函数之前，将首先计算第一个和最后一个字节中的位：

int countBits2(void *ptr, int start, int end) {
    // start, end are bit indexes
    if (start > end) return 0;
    int count = 0;
    unsigned char* pstart = (unsigned char *) ptr + start/8; // first byte
    unsigned char* pend = (unsigned char *) ptr + end/8;     // last byte
    int istart = start % 8;                                  // index in first byte
    int iend = end % 8;                                      // index in last byte 
    unsigned char b = *pstart;                               // byte
    if (pstart == pend) {                                    // count in 1 byte only
        b = b << istart;
        for (int i = istart; i <= iend; ++i) {               // between istart, iend
            if (b & 0x80) ++count; 
            b = b << 1;
        }
    }
    else {                                                   // count in 2 bytes
        for (int i = istart; i < 8; ++i) {                   // from istart to 7
            if (b & 1) ++count; 
            b = b >> 1;
        }
        b = *pend;
        for (int i = 0; i <= iend; ++i) {                    // from 0 to iend
            if (b & 0x80) ++count; 
            b = b << 1;
        }
    }
    return count + countBits(ptr, start/8 + 1, end/8 - 1);
}

【讨论】：

countBits2() 没有将正确的开始和结束索引传递给 countBits()。即使是这样，如果开始和结束范围包含在单个字节内（例如start == 4 和end == 5），它也会过度计数。
@Michael：感谢您的提示。现在更正了，希望OK。这将是一个漫长的采访！ :)。
处理这些极端案例比您想象的要痛苦，对吧？（这当然是给我的）。现在您在原始 countBits() 函数的 count += ... 行中留下了几个小错误/错别字。
我明白了 - 已更正。好吧，在投入生产之前还需要进行一些测试。但作为一个想法应该就足够了。谢谢！
确实——我认为在实际采访中指出需要解决的问题才是主要的。但如果这是面试中的“家庭作业”问题（似乎越来越流行），我肯定会寻找所有角落案例都得到正确处理。

【解决方案5】：

边界条件，他们没有得到尊重......

这里的每个人似乎都在专注于查找表来计算位数。这没关系，但我认为在回答面试问题时更重要的是确保你处理好边界条件。

查找表只是一种优化。 得到正确的答案比快速得到答案更重要。如果这是我的采访，直接去找查找表，甚至不提有一些棘手的细节来处理不在全字节边界上的前几位和后几位，这比想出一个重要的解决方案更糟糕每一点都很缓慢，但边界条件是正确的。

所以我认为 Bhaskar 在他的问题中的解决方案可能优于这里提到的大多数答案 - 它似乎可以处理边界条件。

这是一个使用查找表并尝试仍然处理边界的解决方案（它只是经过轻微测试，所以我不会声称它是 100% 正确的）。它也比我想要的更丑，但是已经晚了：

typedef unsigned char uint8_t;

static
size_t bits_in_byte( uint8_t val)
{
    static int const half_byte[] = { 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 };

    int result1 = half_byte[val & 0x0f];
    int result2 = half_byte[(val >> 4) & 0x0f];

    return result1 + result2;
}


int countSetBits( void* ptr, int start, int end) 
{
    uint8_t*    first;
    uint8_t*    last;
    int         bits_first;
    int         bits_last;
    uint8_t     mask_first;
    uint8_t     mask_last;

    size_t count = 0;

    // get bits from the first byte
    first = ((uint8_t*) ptr) + (start / 8);
    bits_first = 8 - start % 8;
    mask_first = (1 << bits_first) - 1;
    mask_first = mask_first << (8 - bits_first);


    // get bits from last byte
    last = ((uint8_t*) ptr) + (end / 8);
    bits_last = 1 + (end % 8);
    mask_last = (1 << bits_last) - 1;

    if (first == last) {
        // we only have a range of bits in  the first byte
        count = bits_in_byte( (*first) & mask_first & mask_last);        
    }
    else {
        // handle the bits from the first and last bytes specially
        count += bits_in_byte((*first) & mask_first);
        count += bits_in_byte((*last) & mask_last);

        // now we've collected the odds and ends from the start and end of the bit range
        // handle the full bytes in the interior of the range

        for (first = first+1; first != last; ++first) {
            count += bits_in_byte(*first);
        }
    }

    return count;
}

请注意，作为访谈的一部分，必须解决的一个细节是字节中的位是否从最低有效位 (lsb) 或最高有效位 (msb) 开始索引。换句话说，如果起始索引被指定为 0，那么一个值为 0x01 的字节或一个值为 0x80 的字节是否会在该索引中设置位？有点像决定索引是否将字节内的位顺序视为大端或小端。

对此没有“正确”的答案 - 面试官必须指定行为应该是什么。我还要注意，我的示例解决方案以与 OP 示例代码相反的方式处理此问题（我按照我解释图表的方式进行处理，索引也读为“位数”）。 OP 的解决方案将位顺序视为大端，我的函数将它们视为小端。因此，即使两者都在范围的开头和结尾处理部分字节，它们也会给出不同的答案。哪个是正确的答案取决于问题的实际规范是什么。

【讨论】：

+1：跑得更快是没有意义的，当你直接撞墙时，你最多只会死得更快......

【解决方案6】：

免责声明：未尝试编译以下代码。

/*
 * Table counting the number of set bits in a byte.
 * The byte is the index to the table.
 */
uint8_t  table[256] = {...};

/***************************************************************************
 *
 * countBits - count the number of set bits in a range
 *
 * The most significant bit in the byte is considered to be bit 0.
 *
 * RETURNS: 0 on success, -1 on failure
 */
int countBits (
    uint8_t *  buffer,
    int        startBit,  /* starting bit */
    int        endBit,    /* End-bit (inlcusive) */
    unsigned * pTotal     /* Output: number of consecutively set bits */
    ) {
    int      numBits;     /* number of bits left to check */
    int      mask;        /* mask to apply to byte from <buffer> */
    int      bits;        /* # of bits to end of byte */
    unsigned count = 0;   /* total number of bits set */
    uint8_t  value;       /* value read from the buffer */

    /* Return -1 if parameters fail sanity check (skipped) */

    numBits   = (endBit - startBit) + 1;

    index  = startBit >> 3;
    bits   = 8 - (startBit & 7);
    mask   = (1 << bits) - 1;

    value = buffer[index] & mask;  /* mask-out any bits preceding <startBit> */
    numBits -= bits;

    while (numBits > 0) {          /* Note: if <startBit> and <endBit> are in */
        count += table[value];     /* same byte, this loop gets skipped. */
        index++;
        value = buffer[index];
        numBits -= 8;
    }

    if (numBits < 0) {             /* mask-out any bits following <endBit> */
        bits   = 8 - (endBit & 7);
        mask   = 0xff << bits;
        value &= mask;
    }

    count += table[value];

    *pTotal = count;
    return 0;
}

编辑：函数头已更新。

【讨论】：

我不确定我是否完全理解了代码，但我仍然认为代码可以更简洁。我怀疑你误解了这个问题。它不是“连续的”设置位。但只需在“连续”内存中设置位。
@Bhaskar：函数头描述错误。谢谢你抓住那个。 :) 代码的想法是使用查找表来计算范围内设置的位。在开始和结束时必须特别小心，因为它们不可能是完整字节。在查找表中的值之前，开始位之前的位和结束位之后的位被屏蔽掉。该算法还涵盖了开始位和结束位在同一个字节中的情况。

【解决方案7】：

根据您应用的行业，查找表可能不是一种可接受的优化方式，而特定于平台/编译器的优化是。知道大多数编译器和 CPU 指令集都有弹出计数指令，我会这样做。这是一个简单性与性能的权衡，因为现在我仍在迭代一个字符列表。

另请注意，与大多数答案相反，我假设 start 和 end 是字节偏移量，因为问题中没有指定它们不是，并且在大多数情况下它是默认值。

int countSetBits(void *ptr, int start, int end )
{
    assert(start < end);

    unsigned char *s = ((unsigned char*)ptr + start);
    unsigned char *e = ((unsigned char*)ptr + end);

    int r = 0;

    while(s != e)
    {
        // __builtin_clz is not defined for 0 input.
        if(*s) r += 32 - __builtin_clz(*s);
        s++;
    }

    return r;
}

【讨论】：

【解决方案8】：

最近一项出色的研究比较了几种最现代的技术，用于计算一系列内存（aka Hamming Weight, bitset cardinality, sideways sum, population count or popcnt, etc.) 可以在 Wojciech、Kurz 和 Lemire (2017) 中找到，Faster population counts using AVX2 instructions¹

以下是该论文中“Harley-Seal”算法的完整、测试和完全工作的 C# 改编版本，作者发现这是使用通用目的的最快方法按位运算（即不需要特殊硬件）。

1.托管数组入口点
（可选）提供对托管数组ulong[]的块优化位计数的访问。

/// <summary> Returns the total number of 1-valued bits in the array </summary>
[DebuggerStepThrough]
public static int OnesCount(ulong[] rg) => OnesCount(rg, 0, rg.Length);

/// <summary> Finds the total number of '1' bits in an array or its subset </summary>
/// <param name="rg"> Array of ulong values to scan </param>
/// <param name="index"> Starting index in the array </param>
/// <param name="count"> Number of ulong values to examine, starting at 'i' </param>
public static int OnesCount(ulong[] rg, int index, int count)
{
    if ((index | count) < 0 || index > rg.Length - count)
        throw new ArgumentException();

    fixed (ulong* p = &rg[index])
        return OnesCount(p, count);
}

2。标量 API
由块优化计数器用于聚合来自进位保存加法器的结果，并完成任何不能被 16 x 8 字节/ulong 优化块大小整除的块大小的余数= 128 字节。也适合一般用途。

/// <summary> Finds the Hamming Weight or ones-count of a ulong value </summary>
/// <returns> The number of 1-bits that are set in 'x' </returns>
public static int OnesCount(ulong x)
{
    x -= (x >> 1) & 0x5555555555555555;
    x = ((x >> 2) & 0x3333333333333333) + (x & 0x3333333333333333);
    return (int)((((x + (x >> 4)) & 0x0F0F0F0F0F0F0F0F) * 0x0101010101010101) >> 56);
}

3. “Harley-Seal” 块优化的 1 位计数器
一次处理 128 个字节的块，即每个块 16 个 ulong 值。使用进位保存加法器（如下所示）在相邻的 ulongs 上组合添加单个位，并将总计向上聚合为 2 的幂。

/// <summary> Count the number of 'set' (1-valued) bits in a range of memory. </summary>
/// <param name="p"> Pointer to an array of 64-bit ulong values to scan </param>
/// <param name="c"> Size of the memory block as a count of 64-bit ulongs </param>
/// <returns> The total number of 1-bits </returns>
public static int OnesCount(ulong* p, int c)
{
    ulong z, y, x, w;
    int c = 0;

    for (w = x = y = z = 0UL; cq >= 16; cq -= 16)
        c += OnesCount(CSA(ref w,
                            CSA(ref x,
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++)),
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++))),
                            CSA(ref x,
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++)),
                                CSA(ref y,
                                    CSA(ref z, *p++, *p++),
                                    CSA(ref z, *p++, *p++)))));

    c <<= 4;
    c += (OnesCount(w) << 3) + (OnesCount(x) << 2) + (OnesCount(y) << 1) + OnesCount(z);

    while (--cq >= 0)
        c += OnesCount(*p++);

    return c;
}

4.进位保存加法器 (CSA)

/// <summary> carry-save adder </summary>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static ulong CSA(ref ulong a, ulong b, ulong c)
{
    ulong v = a & b | (a ^ b) & c;
    a ^= b ^ c;
    return v;
}

备注

由于此处显示的方法通过一次处理 128 字节块来计算 1 位的总数，因此只有在更大的内存块大小时才会变得最佳。例如，可能至少是十六个 qword (16-ulong) 块大小的一些（小）倍数。对于在较小的内存范围内计算 1 位，此代码将正常工作，但性能大大低于更幼稚的方法。详情见论文。

从论文中，这张图总结了Carry-Save Adder 的工作原理：

参考文献

[1.] Muła、Wojciech、Nathan Kurz 和 Daniel Lemire。 “使用 AVX2 指令加快人口计数。”计算机杂志 61，没有。 1 (2017): 111-120。

【讨论】：