霍夫曼算法逆向匹配答案

【问题标题】：Huffman algorithm inverse matching霍夫曼算法逆向匹配
【发布时间】：2015-06-24 14:37:30
【问题描述】：

我想知道如果给定一个二进制序列，我们是否可以使用 Huffman 算法检查它是否与字符串匹配。

例如，如果我们有一个字符串“abdcc”和几个二进制序列，我们可以计算出哪个是使用霍夫曼算法的“abdcc”的可能表示

【问题讨论】：

你当然可以使用回溯搜索，虽然可能没有任何高效的方法。例如：在霍夫曼树中使用第 1 位，或第 2 位，或 ...，或除最后 4 位之外的整个字符串来编码“a”。这些可能的选择中的每一个都会导致搜索树中的一个新节点，您可以在其中考虑二进制字符串的其余部分和文本的其余部分，并且每个选项都会以某种方式约束搜索树，从而导致矛盾以排除可能的编码。
例如，当只剩下最后一个“c”要匹配时，剩下的二进制字符串必须与我们在上一步中假设的“c”代码相同。

标签： algorithm encoding huffman-code

【解决方案1】：

有趣的谜题。正如 j_random_hacker 在评论中提到的，可以使用回溯搜索来做到这一点。字符串的有效霍夫曼编码有一些限制，我们可以使用这些限制来缩小搜索范围：

长度为 n 和 m 的两个霍夫曼码在前 n 位或 m 位（以较短者为准）中不能相同。这是因为否则霍夫曼解码器将无法判断它在解码时是否遇到了更长或更短的代码。显然，相同长度的两个代码不可能相同。 (1)
如果在任何时候比特流中剩余的比特少于我们正在匹配的字符串中剩余的字符，则该字符串无法匹配。 (2)
如果我们到达字符串的末尾并且比特流中仍有剩余的位，则字符串不匹配 (3)
如果我们第二次在字符串中遇到一个字符，并且我们已经为字符串中的相同字符假设了一个霍夫曼代码，那么比特流中必须存在相同的代码，否则字符串无法匹配。 (4)

我们可以定义一个函数matchHuffmanString 来匹配带有霍夫曼编码比特流的字符串，并将霍夫曼代码表作为全局状态的一部分。首先代码表是空的，我们调用matchHuffmanString，传递字符串的开头和比特流的开头。

当函数被调用时，它会检查流中是否有足够的位来匹配字符串，如果没有则返回。 (2)

如果字符串为空，则如果比特流也为空，则匹配并输出码表。如果流为空但比特流不是，则不匹配，因此函数返回。 (3)

如果字符串中仍有字符，则读取第一个字符。该函数检查代码表中是否已经存在该字符的条目，如果是，则比特流中必须存在相同的代码。如果不是，则没有匹配，因此函数返回 (4)。如果有，则该函数调用自身，移动到下一个字符并通过比特流中的匹配代码。

如果字符没有匹配的代码，则考虑用从 1 位到 32 位（任意限制）的每个可能长度 n 的代码来表示它的可能性。根据规则 (1) 从比特流中读取 n 个比特并检查这样的代码是否会与任何现有代码冲突。如果不存在冲突，则将代码添加到代码表中，然后函数递归，移动到下一个字符并经过假定的长度为 n 位的代码。返回后，它通过从表中删除代码来回溯。

C 中的简单实现：

#include <stdio.h>

// Huffman table:

// a 01
// b 0001
// c 1
// d 0010

char* string = "abdcc";

// 01 0001 0010 1 1

// reverse bit order (MSB first) an add extra 0 for padding to stop getBits reading past the end of the array:
#define MESSAGE_LENGTH  (12)
unsigned int message[] = {0b110100100010, 0};

// can handle messages of >32 bits, even though the above message is only 12 bits long
unsigned int getBits(int start, int n)
{
    return ((message[start>>5] >> (start&31)) | (message[(start>>5)+1] << (32-(start&31)))) & ((1<<n)-1);
}

unsigned int codes[26];
int code_lengths[26];
int callCount = 0;

void outputCodes()
{
    // output the codes:
    int i, j;
    for(i = 0; i < 26; i++)
    {
        if(code_lengths[i] != 0)
        {
            printf("%c ", i + 'a');
            for(j = 0; j < code_lengths[i]; j++)
                printf("%s", codes[i] & (1 << j) ? "1" : "0");
            printf("\n");
        }
    }
}

void matchHuffmanString(char* s, int len, int startbit)
{
    callCount++;

    if(len > MESSAGE_LENGTH - startbit)
        return;  // not enough bits left to encode the rest of the message even at 1 bit per char (2)

    if(len == 0) // no more characters to match
    {
        if(startbit == MESSAGE_LENGTH)
        {
            // (3) we exactly used up all the bits, this stream matches.
            printf("match!\n\n");
            outputCodes();
            printf("\nCall count: %d\n", callCount);
        }
        return;
    }

    // read a character from the string (assume 'a' to 'z'):
    int c = s[0] - 'a';

    // is there already a code for this character?
    if(code_lengths[c] != 0)
    {
        // check if the code in the bit stream matches:
        int length = code_lengths[c];
        if(startbit + length > MESSAGE_LENGTH)
            return; // ran out of bits in stream, no match
        unsigned int bits = getBits(startbit, length);
        if(bits != codes[c])
            return; // bits don't match (4)

        matchHuffmanString(s + 1, len - 1, startbit + length);
    }
    else
    {
        // this character doesn't have a code yet, consider every possible length
        int i, j;
        for(i = 1; i < 32; i++)
        {
            // are there enough bits left for a code this long?
            if(startbit + i > MESSAGE_LENGTH)
                continue;

            unsigned int bits = getBits(startbit, i);
            // does this code conflict with an existing code?
            for(j = 0; j < 26; j++)
            {
                if(code_lengths[j] != 0) // check existing codes only
                {
                    // do the two codes match in the first i or code_lengths[j] bits, whichever is shorter?
                    int length = code_lengths[j] < i ? code_lengths[j] : i;
                    if((bits & ((1 << length)-1)) == (codes[j] & ((1 << length)-1)))
                        break; // there's a conflict (1)
                }
            }
            if(j != 26)
                continue; // there was a conflict

            // add the new code to the codes array and recurse:
            codes[c] = bits; code_lengths[c] = i;
            matchHuffmanString(s + 1, len - 1, startbit + i);
            code_lengths[c] = 0; // clear the code (backtracking)
        }
    }
}

int main(void) {
    int i;
    for(i = 0; i < 26; i++)
        code_lengths[i] = 0;

    matchHuffmanString(string, 5, 0);

    return 0;
}

输出：

match!

a 01
b 0001
c 1
d 0010

Call count: 42

Ideone.com Demo

上面的代码可以通过迭代字符串来改进，只要它遇到它已经有代码的字符，并且只有在找到它没有的字符时才会递归。此外，它仅适用于没有空格的小写字母 a-z，并且不进行任何验证。我必须对其进行测试以确定，但我认为即使对于长字符串也是一个易于处理的问题，因为任何可能的组合爆炸只会在遇到表中没有代码的新字符时才会发生，即使那样它也是主题限制。

【讨论】：

不错。规则 #2 可能会加强一点——首先将“已知”剩余字符的数量乘以它们的编码大小，然后限制未知字符的大小，例如如果剩余 k 个不同的未知字符，那么其中最多 2 个可以用 1 位表示，最多 4 乘 2 位等......为此，您需要悲观地假设最常见的剩余未知字符由表示1位。但我怀疑这会产生巨大的影响，因为我认为大部分修剪已经按照您的规则 #1 和 #4 完成。