The basic idea of Huffman coding is to encode symbols that have higher
probability to appear with fewer bits of output.
Since each symbol can be represented by different amount of bits, it is
important that no symbol code is a prefix of other symbol code.
A code with this property is else called the Prefix
Code.
There is a wide range of possible ways to implement Huffman coding.
I will try to gather the most popular and important.
Shannon-Fano Coding.
The Shannon-Fano coding is differ in the way that Minimal-Prefix code is
created.
Sort symbols by probabilities to be found in the source
stream.
Start recursion:
Divide list of sorted symbols in two parts.
There must be an equal probability that arbitrary symbol from the source
stream ( if it is in the relevant range ) can be found in the left or
right part of the list.
Create node A of Shannon-Fano tree.
Recursively create nodes corresponding to the left and right parts
of the list.
Assign left and right children to node A.
Return node A as a result of the recursive function.
Static Huffman Coding.
The tree for static Huffman Coding is proved [1] to
provide with an optimal Prefix Code for any given
input stream.
Create a list of sorted nodes corresponding to probabilities for
each symbol.
Probability of a symbol is assigned to be equal to the node
weight.
Start of loop:
Find and remove two nodes with smallest probabilities. Mark nodes A
and B.
Create new node with weight[node] = weight[A] +
weight[B].
Assign left and right children of node to A and B.
Insert the new node back to the sorted list.
Repeat the loop until the list consist of the only last node.
On each loop in the process of creating Huffman Tree it is possible to
decide what
child will be the left and what will be right.
In advance if some symbols or sum of symbols have equal weights, it is
possible
to select each of them when minimal-weight node is searched.
Therefore it is possible to create dozen different Huffman Trees.
Each tree will be valid Huffman Tree and therefore can be used for
compression.
Now we have a problem to pass statistics data in order for Decoder to be
able to build
exactly the same tree of the same shape as Encoder do.
It can be proved that special shape of Huffman Tree called Canonical
can be entirely defined by lengths of codes for each symbol.
For typical 256 symbol alphabet the single symbol code length can not
exceed 8 bits.
This is in contrast to count of occurrences or floating-point
probability
that can be arbitrary big or can require arbitrary tight precision.
The list of code lengths is redundant data that can be compressed by
itself too.
In the common case it can be RLE but some more sophisticated
compressions can be used else.
Canonical Huffman Tree:
Canonical Huffman Tree follow two basic rules:
Shorter codes filled with zeros to the right have a
numerical higher value than longer codes.
Withing the same length numerical values of symbols increase
with alphabet.
In order to obtain lengths of codes there is no other way but building
Huffman tree.
I will be glad to be corrected if this my statement is wrong.
Given the lengths for each symbol code it is possible to get actual code
for Canonical Huffman Coding like this:
Mark Lnkn
code number kn of length n.
kn is a number of symbols coded by n bits.
Let m be the length of the longest code.
Lm0 = 0m.
Lm1 = 0m + 1.
Lmkm = 0m +
km.
Lm-10 =
(Lmkm >> 1) + 1.
Lm-1km-1 =
Lm-10 + km-1.
L10 =
(L2k2 >> 1) + 1.
L1k1 =
L10 + k1.
Encoding Static Huffman Code.
Without relation of how Huffman Codes are calculated, there is a common
scheme of
encoding the Static Huffman Code.
Build Huffman Tree and calculate codes.
Start the encoding loop here.
Read the next input symbol s.
Find or calculate code for symbol s - code[s].
Output code[s].
Continue the encoding loop.
Decoding Static Huffman Code.
Decoding is symmetric to encoding, however, here it is:
Build Huffman Tree and/or calculate codes.
Start the decoding loop here.
Find a code corresponding to the given bits stream,
starting from the current position.
Suppose found code is corresponding to symbol s.
The length of code[s] is l bits.
Output symbol s.
Adjust current point in the input stream by l bits.
Continue the decoding loop.
Adaptive Huffman Coding.
The main disadvantage of Static Huffman Coding is a need to care
statistics information together with encoded stream of symbols.
Adaptive scheme allow to avoid transmitting statistics data.
Statistics information is gathered in the same pass and Huffman tree is
updated accordinly.
This way Encoder and Decoder both build Huffman tree while reading
stream of data.
Advantages of Adaptive Huffman Coding, as always with Adaptive Codings,
are coming by
cost of optimality and therefore by cost of compression rate.
This algorithm is found by Gallager and improved by Knuth [5].
Start with a fixed Huffman Tree.
A tree with single special node O.
This variant is preferred by most authors for simplicity and
efficiency.
Weight of node O is subject to discussion.
Some implementations use value zero to encode O with longest
code.
The full balanced tree, providing equal probability for each
symbol.
This variant thought to get less compression but I did not found
based prove
if it is right. The opposite can be true.
For sure this variant require more wasted memory and CPU time
for reordering tree.
Adaptive Huffman Coding require advanced properties from the
tree nodes:
All nodes are indexed.
Indexes are sorted by node weights.
The first index is for leaf node of symbol with longest
code.
The last index is for root node.
Nodes of each tree level are sorted by weights.
The tree is valid Huffman tree with all corresponding
properties.
Start of main loop:
Read next input symbol A.
Check if A already exist in the tree.
If started with Balanced tree it exist always.
If A is not found in the tree:
Output code for pseudo symbol O.
Output uncompressed code for symbol A.
Create tree node for symbol A with weight corresponding to
zero occurrence.
Create new node P.
Set weight(P) = weight(A) + weight(O).
Place node P instead of node O, including parent pointer and
place in the indexes list.
Assign left and right children of node P to be nodes O and
A. Decide what node is left or
right in order to keep nodes sorted.
Update list of indexes to include new nodes O and A.
Else
Output current code for node A.
Update Adaptive Huffman Tree:
Start of loop:
Increase weight of node A.
If weight of A is bigger than next node by index:
Find node N with weight(N) >
weight(A).
Sign predecessor of N in index list B.
Exchange nodes A and B including place in list of
indexes
and parent node but not children nodes.
Set A to its parent or NULL if not exist.
Continue the loop until A is not NULL.
Continue the main loop until input symbols are available.
Decoding Adaptive Huffman Code.
The decoding process is generally symmetric to encoding.
Start with a fixed Huffman Tree.
The tree must have the same special properties like it was
defined for encoding.
The initialization of the tree must be similar to one used for
encoding.
Start the decoding loop.
Find a code, corresponding to the bits stream starting from
the current
input bit stream position.
Sign it code[s] of length l.
If s is a special symbol O,
Read single uncoded symbol A.
Output symbol A.
Create tree node for symbol A with weight
corresponding to zero occurrences.
Create new node P.
Set weight(P) = weight(A) + weight(O).
Place node P instead of node O, including parent
pointer and place
in the index list.
Assign left and right children of node P to be nodes
O and A.
Decide what node is left or right in order to keep nodes
sorted.
Update list of nodes to include new nodes O and
A.
else
Output the found symbol s.
Assign variable A to be the same symbol s.
Update the Adaptive Huffman Tree exactly like it was done
for encoding.
Continue the decoding loop.
When the Huffman Tree is built or updated there is a time to start the
actual encoding or decoding.
But first it could be useful to decide how code for a symbol could be
found.
Create CodeBook with codes listed.
CodeBook for Encoding.
It is possible to maintain array, Hash-Map, list, tree, Heap or whatever
structure with codes
ready for use on each symbol. The key is a symbol. CodeBook for Decoding.
For decoding there is a need for something more sophisticated.
The key must be a variable-length code.
It can be one of previously mentioned storage structures, but with
special comparison function like this: int compare_stream_to_code(byte *code, int
code_bits_size, byte *stream, int stream_bits_offset, int
stream_bits_size);
offs[i] is index in symb[] array.
symb[offs[i]] = Li0
In other words symb[offs[i]] is first symbol with code length i.
Start of the loop:
code = next input bit.
length = 1.
while code < base[length]
code = code << 1.
code += next input bit.
length = length + 1.
symbol = symb[offs[length] + code - base[length]].
Continue loop.
This code is using the special feature of Canonical
Huffman Tree that actual codes can
be calculated and not explicitly passed along with data.
The main problem with CodeBook is a decision how to keep code bits.
The maximal length of code is N-1, where N is the size of alphabet.
In case of typical alphabet with 256 symbols, the maximal symbols
code length is 255 bits.
Therefore we need 256/8 = 32 bytes to keep a single symbol code.
Multilevel Structure of CodeBook.
We can keep the CodeBook in a special Multilevel
Structure that can be described like this:
On the first level we find the first, for example up to 8, bits of
code.
On the second level we find the next 8 bits and therefore overall
codes
with length up to 16 bits per code.
The first part of code must be taken from the first level node and the
last part is taken from the current level node.
The next levels are not likely to ever appear with 256 symbols of
8-bits alphabet.
But nobody give guarantee that longer codes not exist - they do and must
be handled.
In this structure each node on each level must be marked if it is
containing the ending
of code or middle. The node with ending must keep the length of the last
part of code it contain.
For 8-bit alphabet I could propose to keep the first level of CodeBook
in Balanced tree,
Hash Table or similar for fast search. On other hand search between tens
of items can be linear
if time is not the main issue.
The second and next levels are more likely to be keeped in simple linked
lists or alike.
The overhead of keeping complex structure can eat any possible benefits
for list
of items that is too short.
In case of encoding with multilevel structure of CodeBook we gain
the only benefit
that is minimal amount of memory yet able to keep codes of arbitrary
lengths.
The key for searching code is symbol and it is of known length.
In order to construct a code from symbol and multilevel structure, the
structure
of the CodeBook have to be enhanced with "parent" pointer for each
node.
There is a corner case when we select the maximal width of levels symbol
to be the biggest
of all possible. In this case we do not the need multilevel approach. We
have the only and single level.
Previously mentioned decoding scheme
with Canonical Huffman Tree is rather intended for this case.
The other far corner of the same thing:
It is Huffman Tree itself.
In this case we have single bit per level
and not 8 like I suggested for 256 symbols alphabet.
And of course, I don`t believe somebody will suggest
B-Tree or Hash-Table for maximum of 2 items :)
Limitation of Maximal Code Length.
The opposite to maintaining structures with full Huffman Codes is
to limit codelength to keep a code as an integer with limited value.
This can be achieved with some loosing of compression ratio.
But possible simplicity of structures and speed of calculations can be
good
reasons to go this way.
The optimal Huffman Tree limited to some constant height is promissed to
be found by
algorithm called "The Warm-up Algorithm" [4].
I will update this document when proper description degraded to my
understanding will be available.
Thanks to Simakov Alexander [3] I know somewhat less
effective but simple variant.
Sign L` the maximal allowed length of Huffman code.
Sign L the current maximal length of Huffman code and therefore the
height of Huffman Tree.
Sign RNi the rightmost leaf node on level i.
Sign LNi the leftmost leaf node on level i.
We have a task to make L` < L therefore lets start from level
L.
Start the main loop here.
Place RNL instead of its parent. The sibling node of
RNL we sign S.
Find level j that have leafs and j < (L-1).
Find LNj.
Replace LNj with new node P.
Make S and LNj the children of node P.
Repeat until riches the required height of Huffman Tree.
Notice that loop is removing leafs from the bottom level
first and then continue to the next level.
General notes:
The list of probabilities do not have to contain symbols that are
not in use.
Instead of probabilities it is more practical to keep count of
occurrences of symbol in the source stream.
The only operation that is performed on probability is comparison of two
symbol probabilities.
This simple operation can be done on symbol counters without calculation
of actual probabilities.