In 1953, David A. Huffman published his paper "A Method for the Construction of Minimum-Redundancy Codes", and hence printed his name in the history of computer science. As a professor who gives the final exam problem on Huffman codes, I am encountering a big problem: the Huffman codes are NOT unique. For example, given a string "aaaxuaxz", we can observe that the frequencies of the characters 'a', 'x', 'u' and 'z' are 4, 2, 1 and 1, respectively. We may either encode the symbols as {'a'=0, 'x'=10, 'u'=110, 'z'=111}, or in another way as {'a'=1, 'x'=01, 'u'=001, 'z'=000}, both compress the string into 14 bits. Another set of code can be given as {'a'=0, 'x'=11, 'u'=100, 'z'=101}, but {'a'=0, 'x'=01, 'u'=011, 'z'=001} is NOT correct since "aaaxuaxz" and "aazuaxax" can both be decoded from the code 00001011001001. The students are submitting all kinds of codes, and I need a computer program to help me determine which ones are correct and which ones are not.
Input Specification:
Each input file contains one test case. For each case, the first line gives an integer N (2), then followed by a line that contains all the N distinct characters and their frequencies in the following format:
c[1] f[1] c[2] f[2] ... c[N] f[N]
where c[i] is a character chosen from {'0' - '9', 'a' - 'z', 'A' - 'Z', '_'}, and f[i] is the frequency of c[i] and is an integer no more than 1000. The next line gives a positive integer M (≤), then followed by M student submissions. Each student submission consists of N lines, each in the format:
c[i] code[i]
where c[i] is the i-th character and code[i] is an non-empty string of no more than 63 '0's and '1's.
Output Specification:
For each test case, print in each line either "Yes" if the student's submission is correct, or "No" if not.
Note: The optimal solution is not necessarily generated by Huffman algorithm. Any prefix code with code length being optimal is considered correct.
Sample Input:
7
A 1 B 1 C 1 D 3 E 3 F 6 G 6
4
A 00000
B 00001
C 0001
D 001
E 01
F 10
G 11
A 01010
B 01011
C 0100
D 011
E 10
F 11
G 00
A 000
B 001
C 010
D 011
E 100
F 101
G 110
A 00000
B 00001
C 0001
D 001
E 00
F 10
G 11
Sample Output:
Yes
Yes
No
No
解题思路
要判断编码是否为最优编码,需要对编码进行两个方面的检验:对每一组编码来判断WPL是否为最小的,以及是否为前缀码。
下面给出第一种思路,需要构建Huffman Tree的方法,同时也是用堆来实现的。
树节点的定义如下:
1 struct Data { 2 char letter; 3 int freq; 4 }; 5 6 struct TNode { 7 Data data; 8 TNode *left, *right; 9 };
首先我们要根据给出的字符频率来构造出一颗对应的Huffman Tree,同时计算出WPL。
又因为我们是用堆来实现的,所以要构造一颗Huffman Tree我们先要把给定的频率来构建一个最小堆。
这里我们需要对堆进行定义,同时还要定义堆的相关操作。这里就直接上代码,不解释了。
1 struct Heap { 2 TNode *H; // 堆的每一个元素的数据类型是树节点TNode 3 int size; 4 int capacity; 5 }; 6 7 Heap *createMinHeap(int n) { 8 Heap *minHeap = new Heap; 9 minHeap->size = 0; 10 minHeap->capacity = n; 11 minHeap->H = new TNode[n + 1]; 12 minHeap->H[0].data.freq = -1; 13 14 for (int i = 0; i < minHeap->capacity; i++) { 15 TNode *tmp = new TNode; 16 tmp->left = tmp->right = NULL; 17 getchar(); 18 scanf("%c %d", &tmp->data.letter, &tmp->data.freq); 19 insertHeap(minHeap, tmp); 20 } 21 22 return minHeap; 23 } 24 25 void insertHeap(Heap *minHeap, TNode *treeNode) { 26 int pos = ++minHeap->size; 27 for ( ; treeNode->data.freq < minHeap->H[pos / 2].data.freq; pos /= 2) { 28 minHeap->H[pos] = minHeap->H[pos / 2]; 29 } 30 minHeap->H[pos] = *treeNode; 31 } 32 33 TNode *deleteMin(Heap *minHeap) { 34 TNode *minTreeNode = new TNode; 35 *minTreeNode = minHeap->H[1]; 36 TNode tmp = minHeap->H[minHeap->size--]; 37 38 int parent = 1, child; 39 for ( ; parent * 2 <= minHeap->size; parent = child) { 40 child = parent * 2; 41 if (child != minHeap->size && minHeap->H[child].data.freq > minHeap->H[child + 1].data.freq) child++; 42 if (tmp.data.freq < minHeap->H[child].data.freq) break; 43 else minHeap->H[parent] = minHeap->H[child]; 44 } 45 minHeap->H[parent] = tmp; 46 47 return minTreeNode; 48 }
构建好最小堆后,下一步我们需要通过这个堆来构造出对应的Huffman Tree。
就是每次从堆中弹出最小频率的那两个节点,然后把这两个节点分别插在新节点的左右边,作为左右孩子。再把新节点压入堆中。如此循环n-1次后(其中n代表节点的个数),堆中就只剩下一个元素,那个元素就是Huffman Tree的根节点,我们直接返回即可。
按照上面构造Huffman Tree的思路,相应的代码如下:
1 TNode *createHuffmanTree(Heap *minHeap) { 2 int n = minHeap->size - 1; 3 while (n--) { 4 TNode *tmp = new TNode; 5 tmp->left = deleteMin(minHeap); 6 tmp->right = deleteMin(minHeap); 7 tmp->data.freq = tmp->left->data.freq + tmp->right->data.freq; 8 9 insertHeap(minHeap, tmp); 10 } 11 12 return deleteMin(minHeap); 13 }
然后,我们要根据这颗Huffman Tree来计算WPL。我们用递归来实现计算WPL。
如果这个节点是叶子节点,那么就用当前深度乘以对应的频率,然后返回。如果不是叶子节点,就递归来计算左右子树的WPL,相加后返回。
1 int WPL(TNode *T, int depth) { 2 if (T->left == NULL && T->right == NULL) return depth * T->data.freq; // 叶子节点直接返回结果 3 else return WPL(T->left, depth + 1) + WPL(T->right, depth + 1); // 不是叶子节点,计算左右子树的WPL并返回,同时由于左右子树的深度加深一层,记得depth+1 4 }
好了,折腾了这么久,终于计算出给定频率的WPL了。
接下来我们先对每一组编码来检测其WPL是不是最小的,也就是每组编码的WPL是否与给定频率的WPL相等。
计算方法很简单,每一组编码的WPL计算公式为:
再判断codeLen是否与上面求出的给定频率的WPL相等,如果不相等,就说明这个编码不是最优编码,就不需要再判断是否为前缀码了。如果相等再去判断是否为前缀码。
这里还有个陷阱。首先我们要知道,一个最优编码的长度是不会超过n-1的。所以如果某个编码的长度大于n-1也说明该编码不是最优编码。
这里同时给出计算编码长度和判断是否是前缀码的函数:
1 bool check(TNode *huffmanTree, int n) { 2 int wpl = WPL(huffmanTree, 0); // 计算给定频率构成的Huuffman Tree的WPL 3 4 std::string code[n]; // 存放每一个字符的编码 5 int codeLen = 0; 6 bool ret = true; // 用来标记该组编码是否为最优编码 7 8 for (int i = 0; i < n; i++) { 9 char letter; 10 getchar(); 11 scanf("%c", &letter); 12 getchar(); 13 std::cin >> code[i]; 14 15 if (ret) { // 如果已经知道该组编码不是最优编码就不需要再计算编码长度了,但仍要继续输入 16 if (code[i].size() > n - 1) ret = false; // 如果某个字符的编码长度大于n-1,说明该组编码不是最优编码 17 codeLen += code[i].size() * findFreq(huffmanTree, letter); // 计算编码长度 18 } 19 } 20 21 if (ret && codeLen == wpl) { // 如果ret == true并且编码长度与WPL相同,接着判断该组编码是否为前缀码 22 TNode *T = new TNode; // 为这组编码构造一颗Huffman Tree,初始化Huffman Tree的根节点 23 T->data.freq = 0; 24 T->left = T->right = NULL; 25 26 for (int i = 0; i < n; i++) { // 有n个节点,需要判断n次 27 TNode *pre = T; // 每次判断一个字符都从根节点开始 28 29 for (std::string::iterator it = code[i].begin(); it != code[i].end(); it++) { // 对该字符的每一个编码进行判断 30 if (*it == '0') { // 如果编码是0 31 if (pre->left == NULL) { // 如果当前节点的左子树为空 32 TNode *tmp = new TNode; // 就为当前节点生成一颗左子树 33 tmp->data.freq = 0; // 该节点的频率标记为0,表示该节点还没有字符占用 34 tmp->left = tmp->right = NULL; 35 pre->left = tmp; 36 } 37 pre = pre->left; // pre指针指向左子树 38 } 39 else { // 如果编码是1 40 if (pre->right == NULL) { // 如果当前节点的右子树为空 41 TNode *tmp = new TNode; // 就为当前节点生成一颗右子树 42 tmp->data.freq = 0; // 该节点的频率标记为0,表示该节点还没有字符占用 43 tmp->left = tmp->right = NULL; 44 pre->right = tmp; 45 } 46 pre = pre->right; // pre指针指向左子树 47 } 48 } 49 50 // 读完了字符的编码后,pre指针就指向这个字符应该占用的位置 51 // 这时需要判断pre指向的这个节点是否为叶子节点,并且该节点有没有被其他字符占用 52 if (pre->left == NULL && pre->right == NULL && pre->data.freq == 0) { 53 pre->data.freq = 1; // 如果是叶子节点并且没有被占用,该字符就占用了这个节点,并把这个节点的频率标记为1 54 } 55 else { // 否则,如果这些条件中有一个不满足 56 ret = false; // 就说明该组字符不满足前缀码的要求,ret赋值为false 57 break; // 后面的字符不需要判断了,直接退出退出判断前缀码的循环 58 } 59 } 60 } 61 else { // 如果ret == false并且编码长度不等于WPL,就说明该组编码不是最优编码 62 ret = false; 63 } 64 65 return ret; 66 }
这里是通过构造一颗Huffman Tree来判断该组编码是否符合前缀码。
判断的过程如下:
有一个指向Huffman Tree根节点的指针。
- 如果编码是'0',先判断当前节点的左子树是否存在,如果不存在先生成左子树,再让指针移到左子树的节点。如果存在那么直接让指针移到左子树的节点即可。
- 如果编码是'1',先判断当前节点的右子树是否存在,如果不存在先生成右子树,再让指针移到右子树的节点。如果存在那么直接让指针移到右子树的节点即可。
读完该字符的编码后,那么此时字符应该放入这个指针指向的节点。这个节点要满足两个条件才可以放入:
- 该节点的左右孩子都为空,也就是该节点为叶子节点。
- 该节点每有被标记过,也就是说该节点没有存放其他的字符。
如果有一个条件不满足,就说明该组编码不是前缀码。
最后,给出这种方法的完整AC代码,代码量有点多。
#include <cstdio> #include <iostream> #include <string> struct Data { char letter; int freq; }; struct TNode { Data data; TNode *left, *right; }; struct Heap { TNode *H; int size; int capacity; }; Heap *createMinHeap(int n); void insertHeap(Heap *minHeap, TNode *treeNode); TNode *deleteHeap(Heap *minHeap); TNode *createHuffmanTree(Heap *minHeap); bool check(TNode *huffmanTree, int n); int WPL(TNode *T, int depth); int findFreq(TNode *huffmanTree, char letter); int main() { int n; scanf("%d", &n); Heap *minHeap = createMinHeap(n); TNode *huffmanTree = createHuffmanTree(minHeap); int m; scanf("%d", &m); for (int i = 0; i < m; i++) { bool ret = check(huffmanTree, n); printf("%s\n", ret ? "Yes" : "No"); } return 0; } Heap *createMinHeap(int n) { Heap *minHeap = new Heap; minHeap->size = 0; minHeap->capacity = n; minHeap->H = new TNode[n + 1]; minHeap->H[0].data.freq = -1; for (int i = 0; i < minHeap->capacity; i++) { TNode *tmp = new TNode; tmp->left = tmp->right = NULL; getchar(); scanf("%c %d", &tmp->data.letter, &tmp->data.freq); insertHeap(minHeap, tmp); } return minHeap; } void insertHeap(Heap *minHeap, TNode *treeNode) { int pos = ++minHeap->size; for ( ; treeNode->data.freq < minHeap->H[pos / 2].data.freq; pos /= 2) { minHeap->H[pos] = minHeap->H[pos / 2]; } minHeap->H[pos] = *treeNode; } TNode *deleteHeap(Heap *minHeap) { TNode *minTreeNode = new TNode; *minTreeNode = minHeap->H[1]; TNode tmp = minHeap->H[minHeap->size--]; int parent = 1, child; for ( ; parent * 2 <= minHeap->size; parent = child) { child = parent * 2; if (child != minHeap->size && minHeap->H[child].data.freq > minHeap->H[child + 1].data.freq) child++; if (tmp.data.freq < minHeap->H[child].data.freq) break; else minHeap->H[parent] = minHeap->H[child]; } minHeap->H[parent] = tmp; return minTreeNode; } TNode *createHuffmanTree(Heap *minHeap) { int n = minHeap->size - 1; while (n--) { TNode *tmp = new TNode; tmp->left = deleteHeap(minHeap); tmp->right = deleteHeap(minHeap); tmp->data.freq = tmp->left->data.freq + tmp->right->data.freq; insertHeap(minHeap, tmp); } return deleteHeap(minHeap); } bool check(TNode *huffmanTree, int n) { int wpl = WPL(huffmanTree, 0); std::string code[n]; int codeLen = 0; bool ret = true; for (int i = 0; i < n; i++) { char letter; getchar(); scanf("%c", &letter); getchar(); std::cin >> code[i]; if (ret) { if (code[i].size() > n - 1) ret = false; codeLen += code[i].size() * findFreq(huffmanTree, letter); } } if (ret && codeLen == wpl) { TNode *T = new TNode; T->data.freq = 0; T->left = T->right = NULL; for (int i = 0; i < n; i++) { TNode *pre = T; for (std::string::iterator it = code[i].begin(); it != code[i].end(); it++) { if (*it == '0') { if (pre->left == NULL) { TNode *tmp = new TNode; tmp->data.freq = 0; tmp->left = tmp->right = NULL; pre->left = tmp; } pre = pre->left; } else { if (pre->right == NULL) { TNode *tmp = new TNode; tmp->data.freq = 0; tmp->left = tmp->right = NULL; pre->right = tmp; } pre = pre->right; } } if (pre->left == NULL && pre->right == NULL && pre->data.freq == 0) { pre->data.freq = 1; } else { ret = false; break; } } } else { ret = false; } return ret; } int WPL(TNode *T, int depth) { if (T->left == NULL && T->right == NULL) return depth * T->data.freq; else return WPL(T->left, depth + 1) + WPL(T->right, depth + 1); } int findFreq(TNode *huffmanTree, char letter) { int ret = 0; if (huffmanTree) { if (huffmanTree->left == NULL && huffmanTree->right == NULL && huffmanTree->data.letter == letter) ret = huffmanTree->data.freq; if (ret == 0) ret = findFreq(huffmanTree->left, letter); if (ret == 0) ret = findFreq(huffmanTree->right, letter); } return ret; }