The basic idea of Huffman coding is to encode symbols that have higher
probability to appear with fewer bits of output.
Since each symbol can be represented by different amount of bits, it is
important that no symbol code is a prefix of other symbol code.
A code with this property is else called the Prefix
Code.
Create a list of sorted nodes corresponding to probabilities for
each symbol.
Probability of a symbol is assigned to be equal to the node
weight.
_{Start of loop:}
Find and remove two nodes with smallest probabilities. Mark nodes A
and B.
Create new node with weight[node] = weight[A] +
weight[B].
Assign left and right children of node to A and B.
Insert the new node back to the sorted list.
Repeat the loop until the list consist of the only last node.
On each loop in the process of creating Huffman Tree it is possible to
decide what
child will be the left and what will be right.
In advance if some symbols or sum of symbols have equal weights, it is
possible
to select each of them when minimal-weight node is searched.
Therefore it is possible to create dozen different Huffman Trees.
Each tree will be valid Huffman Tree and therefore can be used for
compression.
Now we have a problem to pass statistics data in order for Decoder to be
able to build
exactly the same tree of the same shape as Encoder do.
It can be proved that special shape of Huffman Tree called Canonical
can be entirely defined by lengths of codes for each symbol.
For typical 256 symbol alphabet the single symbol code length can not
exceed 8 bits.
This is in contrast to count of occurrences or floating-point
probability
that can be arbitrary big or can require arbitrary tight precision.
The list of code lengths is redundant data that can be compressed by
itself too.
In the common case it can be RLE but some more sophisticated
compressions can be used else.
Find a code, corresponding to the bits stream starting from
the current
input bit stream position.
Sign it code[s] of length l.
If s is a special symbol O,
Read single uncoded symbol A.
Output symbol A.
Create tree node for symbol A with weight
corresponding to zero occurrences.
Create new node P.
Set weight(P) = weight(A) + weight(O).
Place node P instead of node O, including parent
pointer and place
in the index list.
Assign left and right children of node P to be nodes
O and A.
Decide what node is left or right in order to keep nodes
sorted.
Update list of nodes to include new nodes O and
A.
else
Output the found symbol s.
Assign variable A to be the same symbol s.
Update the Adaptive Huffman Tree exactly like it was done
for encoding.
Continue the decoding loop.
When the Huffman Tree is built or updated there is a time to start the
actual encoding or decoding.
But first it could be useful to decide how code for a symbol could be
found.
offs[i] is index in symb[] array.
symb[offs[i]] = L^{i}_{0}
In other words symb[offs[i]] is first symbol with code length i.
_{Start of the loop:}
code = next input bit.
length = 1.
while code < base[length]
code = code << 1.
code += next input bit.
length = length + 1.
symbol = symb[offs[length] + code - base[length]].
Continue loop.
This code is using the special feature of Canonical
Huffman Tree that actual codes can
be calculated and not explicitly passed along with data.
The main problem with CodeBook is a decision how to keep code bits.
The maximal length of code is N-1, where N is the size of alphabet.
In case of typical alphabet with 256 symbols, the maximal symbols
code length is 255 bits.
Therefore we need 256/8 = 32 bytes to keep a single symbol code.
On the first level we find the first, for example up to 8, bits of
code.
On the second level we find the next 8 bits and therefore overall
codes
with length up to 16 bits per code.
The first part of code must be taken from the first level node and the
last part is taken from the current level node.
The next levels are not likely to ever appear with 256 symbols of
8-bits alphabet.
But nobody give guarantee that longer codes not exist - they do and must
be handled.
In this structure each node on each level must be marked if it is
containing the ending
of code or middle. The node with ending must keep the length of the last
part of code it contain.
For 8-bit alphabet I could propose to keep the first level of CodeBook
in Balanced tree,
Hash Table or similar for fast search. On other hand search between tens
of items can be linear
if time is not the main issue.
The second and next levels are more likely to be keeped in simple linked
lists or alike.
The overhead of keeping complex structure can eat any possible benefits
for list
of items that is too short.
In case of encoding with multilevel structure of CodeBook we gain
the only benefit
that is minimal amount of memory yet able to keep codes of arbitrary
lengths.
The key for searching code is symbol and it is of known length.
In order to construct a code from symbol and multilevel structure, the
structure
of the CodeBook have to be enhanced with "parent" pointer for each
node.
There is a corner case when we select the maximal width of levels symbol
to be the biggest
of all possible. In this case we do not the need multilevel approach. We
have the only and single level.
Previously mentioned decoding scheme
with Canonical Huffman Tree is rather intended for this case.
The other far corner of the same thing:
It is Huffman Tree itself.
In this case we have single bit per level
and not 8 like I suggested for 256 symbols alphabet.
And of course, I don`t believe somebody will suggest
B-Tree or Hash-Table for maximum of 2 items :)
Thanks to Simakov Alexander [3] I know somewhat less
effective but simple variant.
Sign L` the maximal allowed length of Huffman code.
Sign L the current maximal length of Huffman code and therefore the
height of Huffman Tree.
Sign RN_{i} the rightmost leaf node on level i.
Sign LN_{i} the leftmost leaf node on level i.
We have a task to make L` < L therefore lets start from level
L.
Start the main loop here.
Place RN_{L} instead of its parent. The sibling node of
RN_{L} we sign S.
Find level j that have leafs and j < (L-1).
Find LN_{j}.
Replace LN_{j} with new node P.
Make S and LN_{j} the children of node P.
Repeat until riches the required height of Huffman Tree.
Notice that loop is removing leafs from the bottom level
first and then continue to the next level.
General notes:
The list of probabilities do not have to contain symbols that are
not in use.
Instead of probabilities it is more practical to keep count of
occurrences of symbol in the source stream.
The only operation that is performed on probability is comparison of two
symbol probabilities.
This simple operation can be done on symbol counters without calculation
of actual probabilities.