Huffman coding

Author:	Arkadi Kagan
	arkadi_kagan@hotmail.com
Document:	Entropy Compression Methods.

Huffman coding.

The basic idea of Huffman coding is to encode symbols that have higher probability to appear with fewer bits of output. Since each symbol can be represented by different amount of bits, it is important that no symbol code is a prefix of other symbol code. A code with this property is else called the Prefix Code.

There is a wide range of possible ways to implement Huffman coding. I will try to gather the most popular and important.

Shannon-Fano Coding.
The Shannon-Fano coding is differ in the way that Minimal-Prefix code is created.
1. Sort symbols by probabilities to be found in the source stream.
2. Divide list of sorted symbols in two parts. There must be an equal probability that arbitrary symbol from the source stream ( if it is in the relevant range ) can be found in the left or right part of the list.
3. Create node A of Shannon-Fano tree.
4. Recursively create nodes corresponding to the left and right parts of the list.
5. Assign left and right children to node A.
6. Return node A as a result of the recursive function.
Static Huffman Coding.

The tree for static Huffman Coding is proved [1] to provide with an optimal Prefix Code for any given input stream.
1. Create a list of sorted nodes corresponding to probabilities for each symbol. Probability of a symbol is assigned to be equal to the node weight.
2. Find and remove two nodes with smallest probabilities. Mark nodes A and B.
3. Create new node with weight[node] = weight[A] + weight[B].
4. Assign left and right children of node to A and B.
5. Insert the new node back to the sorted list.
6. Repeat the loop until the list consist of the only last node.
On each loop in the process of creating Huffman Tree it is possible to decide what child will be the left and what will be right. In advance if some symbols or sum of symbols have equal weights, it is possible to select each of them when minimal-weight node is searched. Therefore it is possible to create dozen different Huffman Trees. Each tree will be valid Huffman Tree and therefore can be used for compression.
Now we have a problem to pass statistics data in order for Decoder to be able to build exactly the same tree of the same shape as Encoder do.

It can be proved that special shape of Huffman Tree called Canonical can be entirely defined by lengths of codes for each symbol. For typical 256 symbol alphabet the single symbol code length can not exceed 8 bits. This is in contrast to count of occurrences or floating-point probability that can be arbitrary big or can require arbitrary tight precision. The list of code lengths is redundant data that can be compressed by itself too. In the common case it can be RLE but some more sophisticated compressions can be used else.

Canonical Huffman Tree:
Canonical Huffman Tree follow two basic rules:
1. Shorter codes filled with zeros to the right have a numerical higher value than longer codes.
2. Withing the same length numerical values of symbols increase with alphabet.
In order to obtain lengths of codes there is no other way but building Huffman tree. I will be glad to be corrected if this my statement is wrong.
Given the lengths for each symbol code it is possible to get actual code for Canonical Huffman Coding like this:
- Mark Lⁿ_{k_n} code number k_n of length n.
Encoding Static Huffman Code.
Without relation of how Huffman Codes are calculated, there is a common scheme of encoding the Static Huffman Code.
- Build Huffman Tree and calculate codes.
- Read the next input symbol s.
- Find or calculate code for symbol s - code[s].
- Output code[s].
- Continue the encoding loop.
Decoding Static Huffman Code.
Decoding is symmetric to encoding, however, here it is:
- Build Huffman Tree and/or calculate codes.
- Find a code corresponding to the given bits stream, starting from the current position.
  Suppose found code is corresponding to symbol s. The length of code[s] is l bits.
- Output symbol s.
- Adjust current point in the input stream by l bits.
- Continue the decoding loop.

Adaptive Huffman Coding.

The main disadvantage of Static Huffman Coding is a need to care statistics information together with encoded stream of symbols. Adaptive scheme allow to avoid transmitting statistics data. Statistics information is gathered in the same pass and Huffman tree is updated accordinly. This way Encoder and Decoder both build Huffman tree while reading stream of data.
Advantages of Adaptive Huffman Coding, as always with Adaptive Codings, are coming by cost of optimality and therefore by cost of compression rate.
This algorithm is found by Gallager and improved by Knuth [5].
Decoding Adaptive Huffman Code.

The decoding process is generally symmetric to encoding.
1. Start with a fixed Huffman Tree. The tree must have the same special properties like it was defined for encoding. The initialization of the tree must be similar to one used for encoding.
2. Find a code, corresponding to the bits stream starting from the current input bit stream position. Sign it code[s] of length l.
3. If s is a special symbol O,
  - Read single uncoded symbol A.
  - Output symbol A.
  - Create tree node for symbol A with weight corresponding to zero occurrences.
  - Create new node P.
  - Set weight(P) = weight(A) + weight(O).
  - Place node P instead of node O, including parent pointer and place in the index list.
  - Assign left and right children of node P to be nodes O and A. Decide what node is left or right in order to keep nodes sorted.
  - Update list of nodes to include new nodes O and A.
  else
  - Output the found symbol s.
  - Assign variable A to be the same symbol s.
4. Update the Adaptive Huffman Tree exactly like it was done for encoding.
5. Continue the decoding loop.

When the Huffman Tree is built or updated there is a time to start the actual encoding or decoding. But first it could be useful to decide how code for a symbol could be found.
Create CodeBook with codes listed.
CodeBook for Encoding.
It is possible to maintain array, Hash-Map, list, tree, Heap or whatever structure with codes ready for use on each symbol. The key is a symbol.
CodeBook for Decoding.
For decoding there is a need for something more sophisticated. The key must be a variable-length code. It can be one of previously mentioned storage structures, but with special comparison function like this:
int compare_stream_to_code(byte *code, int code_bits_size, byte *stream, int stream_bits_offset, int stream_bits_size);

One variant of CodeBook for decoding with Canonical Huffman Tree could be like this:

Define base[], symb[] and off[] arrays.
base[i] is amount of codes of length i.
symb[] is an array of symbols sorted by two keys:
1. Length of code
2. Numerical symbol value.
offs[i] is index in symb[] array.
symb[offs[i]] = Lⁱ₀
In other words symb[offs[i]] is first symbol with code length i.

_{Start of the loop:}

code = next input bit.
length = 1.
while code < base[length]
- code = code << 1.
- code += next input bit.
- length = length + 1.
symbol = symb[offs[length] + code - base[length]].
Continue loop.

This code is using the special feature of Canonical Huffman Tree that actual codes can be calculated and not explicitly passed along with data.

The main problem with CodeBook is a decision how to keep code bits. The maximal length of code is N-1, where N is the size of alphabet. In case of typical alphabet with 256 symbols, the maximal symbols code length is 255 bits. Therefore we need 256/8 = 32 bytes to keep a single symbol code.

Multilevel Structure of CodeBook.

We can keep the CodeBook in a special Multilevel Structure that can be described like this:

On the first level we find the first, for example up to 8, bits of code.
On the second level we find the next 8 bits and therefore overall codes with length up to 16 bits per code. The first part of code must be taken from the first level node and the last part is taken from the current level node.
The next levels are not likely to ever appear with 256 symbols of 8-bits alphabet. But nobody give guarantee that longer codes not exist - they do and must be handled.

In this structure each node on each level must be marked if it is containing the ending of code or middle. The node with ending must keep the length of the last part of code it contain.
For 8-bit alphabet I could propose to keep the first level of CodeBook in Balanced tree, Hash Table or similar for fast search. On other hand search between tens of items can be linear if time is not the main issue.
The second and next levels are more likely to be keeped in simple linked lists or alike. The overhead of keeping complex structure can eat any possible benefits for list of items that is too short.

In case of encoding with multilevel structure of CodeBook we gain the only benefit that is minimal amount of memory yet able to keep codes of arbitrary lengths. The key for searching code is symbol and it is of known length. In order to construct a code from symbol and multilevel structure, the structure of the CodeBook have to be enhanced with "parent" pointer for each node.

There is a corner case when we select the maximal width of levels symbol to be the biggest of all possible. In this case we do not the need multilevel approach. We have the only and single level. Previously mentioned decoding scheme with Canonical Huffman Tree is rather intended for this case.

The other far corner of the same thing:
It is Huffman Tree itself.
In this case we have single bit per level and not 8 like I suggested for 256 symbols alphabet. And of course, I don`t believe somebody will suggest B-Tree or Hash-Table for maximum of 2 items :)

Limitation of Maximal Code Length.

The opposite to maintaining structures with full Huffman Codes is to limit codelength to keep a code as an integer with limited value. This can be achieved with some loosing of compression ratio. But possible simplicity of structures and speed of calculations can be good reasons to go this way.

The optimal Huffman Tree limited to some constant height is promissed to be found by algorithm called "The Warm-up Algorithm" [4]. I will update this document when proper description degraded to my understanding will be available.

Thanks to Simakov Alexander [3] I know somewhat less effective but simple variant.

Sign L` the maximal allowed length of Huffman code.
Sign L the current maximal length of Huffman code and therefore the height of Huffman Tree.
Sign RN_i the rightmost leaf node on level i.
Sign LN_i the leftmost leaf node on level i.
We have a task to make L` < L therefore lets start from level L.

Start the main loop here.

Place RN_L instead of its parent. The sibling node of RN_L we sign S.
Find level j that have leafs and j < (L-1).
Find LN_j.
Replace LN_j with new node P.
Make S and LN_j the children of node P.
Repeat until riches the required height of Huffman Tree.
Notice that loop is removing leafs from the bottom level first and then continue to the next level.

General notes:

The list of probabilities do not have to contain symbols that are not in use.
Instead of probabilities it is more practical to keep count of occurrences of symbol in the source stream. The only operation that is performed on probability is comparison of two symbol probabilities. This simple operation can be done on symbol counters without calculation of actual probabilities.

References:

[1] A Method for the Construction of Minimum-Redundancy Codes.
David A. Huffman.
http://compression.graphicon.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf

[2] Data Compression
Debra A. Lelewer and Daniel S. Hirschberg
http://www.ics.uci.edu/~dan/pubs/DataCompression.html

[3] Huffman Code. (Russian edition)
Simakov Alexander.
http://www.compression.ru/download/articles/huff/simakov_2002_huffcode.html

[4] The Warm-up Algorithm: A Lagrangean Construction of length restricted Huffman Codes.
Ruy Luis Milidiu and Eduardo Sany Laber. Departamento de Informatica, Puc-Rio.
http://www-di.inf.puc-rio.br/~laber/public/lagrange.ps

[5] Introduction to Information Theory and Data Compression.
Darrel Hankerson, Greg A. Harris and Peter D. Johnson,Jr.
http://www.dms.auburn.edu/compression/

Huffman coding.

Shannon-Fano Coding.

Static Huffman Coding.

Canonical Huffman Tree:

Encoding Static Huffman Code.

Decoding Static Huffman Code.

Adaptive Huffman Coding.

Decoding Adaptive Huffman Code.

Create CodeBook with codes listed.

Multilevel Structure of CodeBook.

Limitation of Maximal Code Length.

General notes: