为什么我的 Huffman Compression 使 Compressed 输出比我的原始文本更大？答案

【问题标题】：Why is my Huffman Compression making the Compressed output a larger size than my original text?为什么我的 Huffman Compression 使 Compressed 输出比我的原始文本更大？
【发布时间】：2022-01-12 02:52:35
【问题描述】：

我正在尝试用 Python 编写一个程序来使用 Huffman Compression 压缩文本。我遇到的问题是压缩文本在保存到文本文件时最终比原始文本大，我不知道为什么会这样。我已经实现了一个名为 Heapnode 的类来帮助我构建优先级队列并使用 Heapq 构建我的二叉树。在 HuffmanCoding 类中，我已经实现了获取每个字符的频率、创建优先级队列、使用它将“节点”合并为一种二叉树并遍历该树为每个字符构建霍夫曼代码的方法。



class HeapNode:
    def __init__(self, char, freq):
        self.char = char
        self.freq = freq
        self.left = None
        self.right = None

    def __lt__(self, other):  # if the frequency of one character is lower than the frequency of another one
        return self.freq < other.freq

    def __eq__(self, other):  # if two characters have the same frequencies
        if other == None:
            return False
        if not isinstance(other, HeapNode):  # checks if the character is a node or not
            return False
        return self.freq == other.freq


class HuffmanCoding:
    def __init__(self, text_to_compress):
        self.text_to_compress = text_to_compress  # text that will be compressed
        self.heap = []
        self.codes = {}  # will store the Huffman code of each character
        self.decompress_map = {}

    def get_frequency(self):  # method to find frequency of each character in text - RLE
        frequency_Dictionary = {}  # creates an empty dictionary where frequency of each character will be stored

        for character in self.text_to_compress:  # Iterates through the text to be compressed
            if character in frequency_Dictionary:
                frequency_Dictionary[character] = frequency_Dictionary[character] + 1  # if character already exists in
                # dictionary, its value is increased by 1
            else:
                frequency_Dictionary[character] = 1  # if character is not present in list, its value is set to 1

        return frequency_Dictionary

    def make_queue(self, frequency):  # creates the priority queue of each character and its associated frequency
        for key in frequency:
            node = HeapNode(key, frequency[key])  # create node (character) and store its frequency alongside it
            heapq.heappush(self.heap, node)  # Push the node into the heap

    def merge_nodes(
            self):  # creates HuffmanTree by getting the two minimum nodes and merging them together, until theres
        # only one node left
        while len(self.heap) > 1:
            node1 = heapq.heappop(self.heap)  # pop node from top of heap
            node2 = heapq.heappop(self.heap)  # pop next node which is now at the top of heap

            merged = HeapNode(None, node1.freq + node2.freq)  # merge the two nodes we popped out from heap
            merged.left = node1
            merged.right = node2

            heapq.heappush(self.heap, merged)  # push merged node back into the heap

    def make_codes(self, root, current_code):  # Creates Huffman code for each character
        if root == None:
            return

        if root.char != None:
            self.codes[root.char] = current_code
            self.decompress_map[current_code] = root.char

        self.make_codes(root.left, current_code + "0")  # Every time you traverse left, add a 0 - Recursive Call
        self.make_codes(root.right, current_code + "1")  # Every time you traverse right, add a 1 - Recursive Call

    def assignCodes(self):  # Assigns codes to each character
        root = heapq.heappop(self.heap)  # extracts root node from heap
        current_code = ""
        self.make_codes(root, current_code)

    def get_compressed_text(self, text):  # Replaces characters in original text with codes
        compressed_text = ""
        for character in text:
            compressed_text += self.codes[character]
        return compressed_text

    def show_compressed_text(self):

        frequency = self.get_frequency()
        self.make_queue(frequency)
        self.merge_nodes()
        self.assignCodes()

        compressed_text = self.get_compressed_text(self.text_to_compress)
        return compressed_text


print(HuffmanCoding('This sentence will get compressed').show_compressed_text())

【问题讨论】：

您的 Huffman 编码表示的消息是一个由 '0'/'1' 字符组成的字符串 - 这很浪费，您需要将此字符串的每个 8 个字符块压缩成一个字节，并且解码时做相反的事情。我不是 python 人，所以我不能说最有效的方法是什么。

标签： python compression huffman-code

【解决方案1】：

您的代码是包含 ASCII 字符“0”和“1”的字符串。每个字符占用 8 位，因此您将压缩数据扩展八倍。

您需要改为制作可变长度的二进制代码，其中一位占用一位空间。然后，您需要能够连接这些可变长度代码以生成字节序列（以b'' 开头，而不是""），并根据需要用零位填充最后一个字节。然后你有一个 bytes 序列，每个包含你代码中的 8 个位。每个都可以有 256 个可能的字节值中的任何一个。

您可以对整数使用位运算符来构造它，特别是移位：<<、>>)，或：|，以及：&。您可以使用bytes() 将整数转换为字节。例如：

>>> bytes([65,66,67])
b'ABC'

另请注意，您正在压缩一个非常短的字符串，即使您将输出正确地写入位，也不会被压缩。尤其是当您将代码连同它一起发送时。您需要测试更多文本，以便压缩利用英语中不同字母的频率。

【讨论】：

这是有道理的，我现在意识到我将每个 '0' 和 '1' 存储为一个 8 位 ASCII 字符。愚蠢的问题，但我为什么要对整数使用位运算符？还有为什么字节序列需要以b''开头？
Python 3 区分字符串字符（例如'åbc'）和字节序列（例如b'abc'）。它们是两个不同的东西，其中 stings 可以包含来自其他语言的字符，例如中文，内部以多字节格式（UTF-8）存储。
整数是唯一可以使用位运算符的东西。