SourceForge.net Logo
Author:Arkadi Kagan
arkadi_kagan@hotmail.com
Document:Entropy Compression Methods.

Huffman coding.

The basic idea of Huffman coding is to encode symbols that have higher probability to appear with fewer bits of output. Since each symbol can be represented by different amount of bits, it is important that no symbol code is a prefix of other symbol code. A code with this property is else called the Prefix Code.

There is a wide range of possible ways to implement Huffman coding. I will try to gather the most popular and important.

When the Huffman Tree is built or updated there is a time to start the actual encoding or decoding. But first it could be useful to decide how code for a symbol could be found.

Create CodeBook with codes listed.

CodeBook for Encoding.
It is possible to maintain array, Hash-Map, list, tree, Heap or whatever structure with codes ready for use on each symbol. The key is a symbol.
CodeBook for Decoding.
For decoding there is a need for something more sophisticated. The key must be a variable-length code. It can be one of previously mentioned storage structures, but with special comparison function like this:
int compare_stream_to_code(byte *code, int code_bits_size, byte *stream, int stream_bits_offset, int stream_bits_size);

One variant of CodeBook for decoding with Canonical Huffman Tree could be like this: This code is using the special feature of Canonical Huffman Tree that actual codes can be calculated and not explicitly passed along with data.

The main problem with CodeBook is a decision how to keep code bits. The maximal length of code is N-1, where N is the size of alphabet. In case of typical alphabet with 256 symbols, the maximal symbols code length is 255 bits. Therefore we need 256/8 = 32 bytes to keep a single symbol code.

Multilevel Structure of CodeBook.

We can keep the CodeBook in a special Multilevel Structure that can be described like this: In this structure each node on each level must be marked if it is containing the ending of code or middle. The node with ending must keep the length of the last part of code it contain.
For 8-bit alphabet I could propose to keep the first level of CodeBook in Balanced tree, Hash Table or similar for fast search. On other hand search between tens of items can be linear if time is not the main issue.
The second and next levels are more likely to be keeped in simple linked lists or alike. The overhead of keeping complex structure can eat any possible benefits for list of items that is too short.

In case of encoding with multilevel structure of CodeBook we gain the only benefit that is minimal amount of memory yet able to keep codes of arbitrary lengths. The key for searching code is symbol and it is of known length. In order to construct a code from symbol and multilevel structure, the structure of the CodeBook have to be enhanced with "parent" pointer for each node.

There is a corner case when we select the maximal width of levels symbol to be the biggest of all possible. In this case we do not the need multilevel approach. We have the only and single level. Previously mentioned decoding scheme with Canonical Huffman Tree is rather intended for this case.

The other far corner of the same thing:
It is Huffman Tree itself.
In this case we have single bit per level and not 8 like I suggested for 256 symbols alphabet. And of course, I don`t believe somebody will suggest B-Tree or Hash-Table for maximum of 2 items :)

Limitation of Maximal Code Length.


The opposite to maintaining structures with full Huffman Codes is to limit codelength to keep a code as an integer with limited value. This can be achieved with some loosing of compression ratio. But possible simplicity of structures and speed of calculations can be good reasons to go this way.

The optimal Huffman Tree limited to some constant height is promissed to be found by algorithm called "The Warm-up Algorithm"
[4]. I will update this document when proper description degraded to my understanding will be available.

Thanks to Simakov Alexander [3] I know somewhat less effective but simple variant.

General notes:


References:

[1] A Method for the Construction of Minimum-Redundancy Codes.
David A. Huffman.
http://compression.graphicon.ru/download/articles/huff/huffman_1952_minimum-redundancy-codes.pdf

[2] Data Compression
Debra A. Lelewer and Daniel S. Hirschberg
http://www.ics.uci.edu/~dan/pubs/DataCompression.html

[3] Huffman Code. (Russian edition)
Simakov Alexander.
http://www.compression.ru/download/articles/huff/simakov_2002_huffcode.html

[4] The Warm-up Algorithm: A Lagrangean Construction of length restricted Huffman Codes.
Ruy Luis Milidiu and Eduardo Sany Laber. Departamento de Informatica, Puc-Rio.
http://www-di.inf.puc-rio.br/~laber/public/lagrange.ps

[5] Introduction to Information Theory and Data Compression.
Darrel Hankerson, Greg A. Harris and Peter D. Johnson,Jr.
http://www.dms.auburn.edu/compression/