Lempel-Ziv Compression algorithms are easy divided in two main groups:
LZ77 and LZ78.
The bold difference between this two groups is that LZ77 do not need an
explicit dictionary
where LZ78 do need it.
LZ77.
The main idea of LZ77 is to find the longest match to the current part
of the input stream
in the already passed part of the input stream.
Lets try to build algorithm for encoding LZ77.
Define constant W to be "window size".
Loop the input stream until the end of the stream is reached.
Sign p the current position in the input stream.
Scan the input stream from symbol max(0, p-W) to max(0,
p-1).
Sign q the scan pointer.
Check the count of symbols that are equal in the
input stream,
starting from the points p and q.
The stream comparison must be stopped on first unmatched
symbol.
Check if the found length of equal peaces of the
stream is maximal.
The offset and length of this longest stream must be
saved.
Output the found backward offset and length.
Continue the loop.
This is the basic algorithm for LZ77.
Notice that uncoded yet symbols can be counted as well as already coded
ones.
The bold example is a sequence of equal symbols.
In this case the encoded stream could be like this:
Output the uncoded symbol.
Output offset one.
Output the length of the sequence minus one.
If for some reason it is impossible to encode the actual length of the
matched sequence,
it is possible to select the maximal length that is possible for
encoding.
The limitation for the sequence length can appear because of the way,
the output of the length is done.
All that nice, however there are at least three problems that must be
solved.
What must be the output if while scanning backward no match was
found ?
Here are some answers:
LZ77 proposes this solution:
Each output consist of three parts - backward
offset,
length and the next input symbol.
If the symbol is found first time, offset or length
are zero.
Notice that one symbol on each encoding step is
pulled out of the encoding sequence.
LZSS is doing it other way:
Uncoded symbol is output only if found length is
less then some constant or zero.
This constant is usually 2 or 3.
There is a need to sign each output if it is a pair
of offset-length or an explicit symbol.
This can be done one of this ways:
Manage special bit for each output.
If byte alignment is important it is possible to
gather
this special bits for several outputs.
This way first output the bit-mask and after
that the actual data.
Output zero offset and explicit symbol or
symbols
according to the value of length.
The actual output of LZ77 contain a lot of redundant data.
The simplest variant can choose a constant amount of bits
for offset and length.
A popular variant is 12 bits for offset and 4 bits for length.
In this case the output remain redundant, but due to simplicity
this
method is usefull for educational purposes and to gain some
advanced modularity.
The offset and length as well as a bit-mask can be encoded.
It is proposed to encode the offset and length separately with:
The search for longest match require a lot of CPU time.
It is possible, however, by cost of amount of memory used
to speed-up the search for longest matches.
Build the source tree.
The main property of this tree is that given any arbitrary point
in the input stream,
the presence of the stream in the earlier parts can be found
with smallest time.
In the root level there is a list of all already
found symbols
from the input stream.
This list can be keeped in one of quick-search
structures
like Hash-Map, Balanced Tree etc.
Each symbol in the tree have pointer to the node of
the next level.
Each symbol in the tree must contain reference to
the input stream.
It must point to the start of the stream that is
finished by this symbol.
Node of each next level contain all possible
symbols-continuations
for the previous symbol that is pointing on.
There is a need for special mark for symbol that is
not scanned yet.
On each occurrence of this mark the rest of the input
stream must be
checked - possible this path will be the longest
one.
Here are the basic operations that must be defined to
work with this tree.
Find the best match.
Keep the best match in variable M.
This includes the offset and length of the found
sequence.
Obviously the both are initialized with zero for
case if nothing is found at all.
Start recursion.
Find the next unencoded symbol in the
current node starting from the root.
Recursively try to find the best match by
continuing with the previously found symbol.
Find if the special mark is present -
possible matching in the unencoded part of stream.
This will not be a case in the root node.
If the mark is found, calculate the matched
length.
Compare the result of recursion and the
match found in the input stream.
If one of this two is not exist, assume its
result length zero.
The maximal length and its offset is the
result of the recursion.
Update the tree with fresh scanned symbols.
Provide advanced structure - linked list L
of nodes that contain
previously defined special mark for continue in
the unscanned input stream.
The stream of symbols to be updated is a
sequence of several symbols,
but we will update the tree by adding this
symbols one by one.
Sign the current symbol to update by s.
For each item in the list L
Create new node with single symbol -
the special mark.
The special mark will point on the same
place like the original one,
but the length is increased by one.
In the current node remove the
special mark with the value of s.
Make pointer from the new symbol s
to the new node created previously.
Check presence of the symbol s in the root
node.
If symbol s is present in the root node then
Find node N, the found symbol s in
the root is pointing on.
Check if node N have the symbol of
special mark.
If special mark is not found, add
one.
else
Create node N where the only symbol
is the special mark.
Add symbol s to the root node.
Line pointer from the newly added
symbol s in the root to the new node N.
Optional clean-up of the oldest branches.
This step is not mandatory.
It is possible to define some amount of nodes to
sign overflow.
In case of overflow the tree can be simple
destroyed.
This solution, however, reduces compression
ratio.
The best solution to clean-up the tree is to
maintain the linked
list of newly inserted symbols in the nodes.
Remain that each symbol in the tree
contain the offset and length
in advance to the pointer to a node with
continuations.
Loop over the list of symbols,
sorted by creation.
Mark the current symbol s.
Recursively destroy all the
nodes that are continuation of the current symbol s.
Remove symbol s from the
node N that contain it.
If node N become empty,
remove it and update the symbol from the previous
level, pointing on it.
Stop the loop after reaching the next
symbol that occurred
closer then the window
size disallow.
LZ78.
The main idea of LZ78 is to enhance dictionary with appearance of each
unencoded symbol.
Each time, if symbol found that can not be encoded, the dictionary is
enhanced with new
entity that is the last found entity plus one symbol.
Lets try to build algorithm for encoding by LZ78.
Initialize dictionary.
Define sequence w, initially empty.
Loop over the input stream.
For each symbol s,
If ws found in the dictionary
Assign w = ws.
else
Output dictionary code for sequence w.
Output explicit symbol s.
Add ws to the dictionary.
Set w = s or empty w that seems to be less
effective.
When implementing this algorithm there are at least three problems to be
solved.
How to initialize the dictionary ?
What values the initialized dictionary must contain ?
LZ78 proposes the empty dictionary with the only value -
NULL-value.
If nothing appropriate is found in the dictionary, the
NULL-value is output.
After NULL-value a single symbol is coming exactly as after any
other dictionary value.
LZW proposes the dictionary initialized with
all possible symbols.
This way there is no more any need in the NULL-value.
How to output the dictionary value ?
We can decide to encode dictionary value with amount of bits
corresponding
to the dictionary size.
The amount of bits must be big enough to represent any value
that is currently
present in the dictionary.
The dictionary values can be encoded by one of entropy
coders.
It can be Huffman, Arithmetic, Universal Integer Coding
etc.
How to store the dictionary the way to allow fast search for input
sequence ?
The obvious way is to keep simple linked list.
The result will be very slow algorithm, but for small dictionary
this method
can be preferred for its simplicity.
Standard fast-search storage like Hash-Map or Balanced Tree.
There is a need to provide special comparison routine, but this
method can be
effective by yet keeping to be simple if the desired standard
storage found and
is ready for use.
The amount of memory needed for this variant is even
bigger then for first variant.
Each node have a list of possible continuation
symbols.
Continuation symbol have pointer to the next
node.
There is a list of nodes, sorted by creation
order.
The basic idea here is quit similar to one proposed for LZ77, however,
there is no need for "special mark", since here it is
impossible to use
uncoded symbols as part of the dictionary.