Adaptive Huffman Coding

In the "Three Little Pigs," the wolf had to huff and huff to blow those houses down. He should have just adaptive huffed.

Overview

Huffman coding suffers from the fact that the uncompresser need have some knowledge of the probabilities of the characters in the compressed files. Not only can this add somewhat to the bits needed to encode the file, but, if this crucial piece of knowledge is unavailable, compressing the file requires two passes- one pass to find the frequency of each character and construct the huffman tree and a second pass to actually compress the file. Expanding on the huffman algorithm, Faller and Gallagher, and later Knuth and Vitter, developed a way to perform the huffman algorithm as a one pass procedure.(Sayood 55)

While the methods of each of these eminent gentlemen differ slightly, the discrepancies do not effect the basic "adaptive huffman" algorithm. The differences between the two methods will be explored in greater depth later in this document.

The fact that the file is encoded dynamically has significant implications for the effectiveness of the tree as a encoder and decoder. Because the Huffman tree is constructed from the character counts of the file as a whole, it works effectively at a universal level. Adaptive Huffman coding also works at a universal level, but is far more effective than static huffman coding at a local level because the tree is constantly evolving. Say, for example, a file starts out with a series of a character that are not repeated again in the file. In static Huffman coding, that character will be low down on the tree because of its low overall count, thus taking lots of bits to encode. In adaptive huffman coding, the character will be inserted at the highest leaf possible to be decoded, before eventually getting pushed down the tree by higher-frequecy characters.

At the heart of the Adaptive Huffman algorithm is the sibling property which states "each node (except the root) has a sibling and the nodes can be listed in order of non-increasing weight with each node adjacent to its siblings"(Data Compression Library) To maintain this property, we will keep track of each nodes order and its weight where the order is simply used as a numbering system for the weights that increases left to right, top to bottom, and where the weight keeps track of the number of times the value contained in the node has occured so far in the file, with nodes that do not contain values (like the root and other internal nodes) having a weight equal to the sum of their two child nodes. Nodes with higher orders will also have correspondingly higher weights. Below is a declaration of the struct Tree used for the adaptive hufman algorithm as well as an example tree.

 struct Tree
    {
	int val;      //8-bit character contained in tree node
	int weight;   //number of times val has occured in file so far
	int order;    //ordering system to track weights
	
	Tree* left;
	Tree* right;
	Tree* parent;

	Tree::Tree(int value, int wei, int num, Tree* l, Tree* r,
		   Tree* p)
	    :val(value), weight(wei), order(num), left(l), right(r), parent(p)
	    {}
	Tree::Tree()
	    :val(0),weight(0),order(0), left(0), right(0), parent(0)
            {}
    }

Although you do not yet know how the tree was set up, it should be obvious that the weights are set up in an numbering structure similar to a heap and that the order is determined in reverse in-order. The ordering starts at 512 because there are 256 characters that can be represented with 8 bits and a maximum total of internal nodes equal to 255, with an additional node granted for the NYT node, which will be dealt with later. For this assignment, the numbering is only important in that the ordering should remain positive. More advanced versions of the program may use mathematical prinicples to prove how many bits are needed and thereby decrease the number of bits needed to recognize a character.

This algorithm is called adaptive huffman coding because the tree is adaptive- it is created simultaneously with either the compressed or uncompressed file as it reads in the other. The tree is dynamic and changes after the reading of every character. Both the compresser and decompresser use the same method to construct the tree and therefore can decode what the other has written.

top

Tree Manipulation

The tree is manipulated as the file is read to maintain the following properties:

Each node has a sibling
Node's with higher weights have higher orders
On each level, the node farthest to the right will have the highest order although there might be other nodes with equal weight
Leaf nodes contain character values, except the Not Yet Transmitted(NYT) node which is the node whereat all new characters are added
Internal nodes contain weights equal to the sum of their children's weights
All nodes of the same weight will be in consecutive order.

Now to get into the nitty-gritty of the actual manipulation. Every tree contains a root and a NYT node, where the NYT node is the node with the lowest order in the tree. When a character is read in from a file, the tree is first checked to see if it already contains that character. If it doesn't, the NYT node spawns two new nodes. The node to its right is a new node containing the character and the new left node is the new NYT node. If the character is already in the tree, you simply update the weight of that particular tree node. In some cases, when the node is not the highest-ordered node in its weight class, you will need to swap this node so that it fulfills the property that nodes with higher weight have higher orders. To do this, before you update the node's weight, search the tree for all nodes of equal weight and swap the soon-to-be updated value with the highest ordered node of equal weight. Finally update the weight.

However in both cases for inserting values, weights are changed for a leaf and this change will effect all nodes above it. Therefore, after you insert a node, you must check the parent above the node following the same procedure you followed when updating already seen values. Check to see whether the node in question is the highest order node in its weight class prior to updating. If not, swap with the node that is the highest order making sure to reassign only the pointers to the two nodes being swapped.

NOTE: There are several important checks that need to be in place prior to any swapping being done:

The root should never be swapped with anything
Remember that you are moving up the tree so things above are not updated. Therefore, be sure never to swap a node with its parent.
Although the pointers must be swapped in the tree, be sure to reset the order to fit the new arrangement. The orders of the two swapped nides should not be swapped- or if they are, should be re-swapped. Order is not a measure related to the value in a node- it is related to that node's position in the tree.

Below is a flowchart detailing the tree manipulation process as carried out by both the encoder and decoder:

Figure 2: Tree Manipulation Procedure: Be sure to notice the key verbs here: insertnew value, give birth to new nodes, update weight, check if max in weight class, swap, isRoot, move to parent. Not all of these will be functions, but these actions will form the basis of a tree manipulation class for encoding and decoding.

Chart taken in near entirety from Sayood, 57

For help relating this chart to real world experience- take a look at this example which depicts the tree manipulations done when adding the string "aardv." Or, for a quick refresher, you can check out this animated version of the process.

top

Encoding Procedure

Once you have the functions of your tree manipulation working correctly, it is relatively easy to complete the encoding and decoding parts of adaptive huffman coding. To encode, you simply read through the file to be compressed one character at a time. If you have seen the character before, you write to the output file the root to leaf path with 1 demarkating a move right and a 0 marking a move left- the same as you would in static huffman coding. If the character is new, write out the root to leaf path of the NYT node to alert the decoder that a new character follows. Then write out the new character itself. Use nine bits in anticipation of the PSEUDO_EOF. Finally update the tree by calling the appropriate insert function with the new value. Read through the entire file in this manner and when you are done, manually write out the final root to NYT path followed by the PSEUDO_EOF character. This procedure is outlined below.

Sayood, 59

top

Decoding Procedure

The decoding procedure is very similar to the encoding procedure and shoudl be easy to figure out based on the information in the previous section. To uncompress the compressed file, read it in one bit at a time, traversing the tree as it is up to that point. Eventually, you will come to a leaf. If that leaf has a character value, write out the eight-bit translation of the character to the uncompressed file and then update the count of that character in the tree, making sure all necessary changes are made to the tree as the whole. If the leaf is the NYT node, read in the next nine bits and write out the eight bit translation of that character. Then insert the new character in the tree. It is extremely important to remember that the compresser and decompresser, although reading in characters in different manners, should construct trees exactly the same. At any given point in a file in either operation, the trees would be the same if compared. Do not confuse this with saying that the compresser and decompresser run simultaneously. They don't. But they do construct the same trees after reading the same information. You might even say that they use the same "Adaptive Huffman Tree class"- but that might be just a tad presumptuous.

Below is a diagram of the decoding procedure:

Sayood, 62

top

Implementations

There are several methods for implementing the adaptive huff algorithm. The FGK algorithm works reasonably well and is easy to implement, but far inferior to the Vitter algorithm in minimizing the height of the tree to shorten the time taken to find a root-to-leaf path. The Vitter algorithm keeps the height of the tree to an absolute minimum but is hard to implement so that the program runs at a reasonable speed. Below is a short outline of both methods, keeping in mind that the method described in the above sections follows the Vitter pattern.

FGK

The basic difference between FGK and Vitter is that FGK doesn't check the entire tree when updating weight class or use any type of numbered ordering scheme. It only checks the nodes against its right sibling or the right sibling of its parents. This stops the tree from being of minimum length but speeds up the leaf-to-root update process and allows the user to keep track of the path for encoding at the same time.

Insert element in the tree
Increment weight of node, keep track of path and weight
Check weight against right siblings, swap if necessary
Move to parent
Check weight against parent's children, swap if necessary
Keep track of path from leaf to root so far, keep track of new node's weight
Continue up tree keeping track of path, swapping weights with right siblings if lower

Vitter

The Vitter method is the product of current Duke Chair of the Computer Science Jeffrey Vitter. The Vitter method is the method that has been described in the page. The difficulty with this method is getting it to work fast, because it is a lengthy procedure to have to update the tree by examining every node in the tree to find those of similar weight class. There are a couple of tricks to get around this. The simplest way to work around this is to implement the program with trees and then keep a vector of all the elements to check the tree indexed by the order. This works because for any node, all the nodes of the same weight will be in consecutive order. You can also keep another vector of just leaf nodes indexed by the character(int) contained in the node. This is especially helpful in seeing if a value has already been inserted in the tree. These measures only help so much though. To achieve the fastest running times from adaptive huff, fast enough to make it comparable to programs like LZW and huff, you need to use a floating tree data structure. The floating tree consists of only vectors and maps to implement the algorithm, and maintians pointers to right children and parents only. The vector implementation of the tree is set up like a heap with parents of node n at 2n and 2n-1. For more information on this implementation, I would refer you to the outstanding paper by Mr. Vitter, as it is somewhat difficult to condense.

Jonathan Low

Last modified: Wed Jan 2 17:45:33 EST 2008