Lempel-Ziv Compressions.

Author:	Arkadi Kagan
	arkadi_kagan@hotmail.com
Document:	Entropy Compression Methods.

Lempel-Ziv Compressions.

Lempel-Ziv Compression algorithms are easy divided in two main groups: LZ77 and LZ78. The bold difference between this two groups is that LZ77 do not need an explicit dictionary where LZ78 do need it.

LZ77.

The main idea of LZ77 is to find the longest match to the current part of the input stream in the already passed part of the input stream.
Lets try to build algorithm for encoding LZ77.

Define constant W to be "window size".

Loop the input stream until the end of the stream is reached.

Sign p the current position in the input stream.
Scan the input stream from symbol max(0, p-W) to max(0, p-1).
- Sign q the scan pointer.
- Check the count of symbols that are equal in the input stream, starting from the points p and q. The stream comparison must be stopped on first unmatched symbol.
- Check if the found length of equal peaces of the stream is maximal. The offset and length of this longest stream must be saved.
Output the found backward offset and length.
Continue the loop.

This is the basic algorithm for LZ77.
Notice that uncoded yet symbols can be counted as well as already coded ones. The bold example is a sequence of equal symbols. In this case the encoded stream could be like this:

Output the uncoded symbol.
Output offset one.
Output the length of the sequence minus one.

If for some reason it is impossible to encode the actual length of the matched sequence, it is possible to select the maximal length that is possible for encoding. The limitation for the sequence length can appear because of the way, the output of the length is done.
All that nice, however there are at least three problems that must be solved.

What must be the output if while scanning backward no match was found ?
Here are some answers:

LZ77 proposes this solution:

Each output consist of three parts - backward offset, length and the next input symbol.

If the symbol is found first time, offset or length are zero.

Notice that one symbol on each encoding step is pulled out of the encoding sequence.

LZSS is doing it other way:

Uncoded symbol is output only if found length is less then some constant or zero. This constant is usually 2 or 3.

There is a need to sign each output if it is a pair of offset-length or an explicit symbol.
This can be done one of this ways:

Manage special bit for each output. If byte alignment is important it is possible to gather this special bits for several outputs. This way first output the bit-mask and after that the actual data.

Output zero offset and explicit symbol or symbols according to the value of length.

The actual output of LZ77 contain a lot of redundant data.
- The offset and length as well as a bit-mask can be encoded. It is proposed to encode the offset and length separately with:
  - Huffman Coding.
    This is what was done in the Deflate [5] compression.
  - Arithmetic Coding.
  - Universal Integer Codings.
  - RLE - Run Length Encoding.
  - Other Entropy Coders.
The search for longest match require a lot of CPU time. It is possible, however, by cost of amount of memory used to speed-up the search for longest matches.
- Build the source tree.
  The main property of this tree is that given any arbitrary point in the input stream, the presence of the stream in the earlier parts can be found with smallest time.
  - In the root level there is a list of all already found symbols from the input stream. This list can be keeped in one of quick-search structures like Hash-Map, Balanced Tree etc.
  - Each symbol in the tree have pointer to the node of the next level.
  - Each symbol in the tree must contain reference to the input stream. It must point to the start of the stream that is finished by this symbol.
  - Node of each next level contain all possible symbols-continuations for the previous symbol that is pointing on.
  - There is a need for special mark for symbol that is not scanned yet. On each occurrence of this mark the rest of the input stream must be checked - possible this path will be the longest one.
  Here are the basic operations that must be defined to work with this tree.
  1. Optional clean-up of the oldest branches.
    - The best solution to clean-up the tree is to maintain the linked list of newly inserted symbols in the nodes.
      - Loop over the list of symbols, sorted by creation. Mark the current symbol s.
        
        Recursively destroy all the nodes that are continuation of the current symbol s.
        
        Remove symbol s from the node N that contain it.
        
        If node N become empty, remove it and update the symbol from the previous level, pointing on it.
        
        Stop the loop after reaching the next symbol that occurred closer then the window size disallow.

LZ78.

The main idea of LZ78 is to enhance dictionary with appearance of each unencoded symbol. Each time, if symbol found that can not be encoded, the dictionary is enhanced with new entity that is the last found entity plus one symbol.
Lets try to build algorithm for encoding by LZ78.

Initialize dictionary.
Define sequence w, initially empty.
Loop over the input stream.
- For each symbol s,
  If ws found in the dictionary
  - Assign w = ws.
  else
  - Output dictionary code for sequence w.
  - Output explicit symbol s.
  - Add ws to the dictionary.
  - Set w = s or empty w that seems to be less effective.

When implementing this algorithm there are at least three problems to be solved.

How to initialize the dictionary ?
What values the initialized dictionary must contain ?

LZ78 proposes the empty dictionary with the only value - NULL-value. If nothing appropriate is found in the dictionary, the NULL-value is output. After NULL-value a single symbol is coming exactly as after any other dictionary value.

LZW proposes the dictionary initialized with all possible symbols. This way there is no more any need in the NULL-value.

How to output the dictionary value ?

We can decide to encode dictionary value with amount of bits corresponding to the dictionary size. The amount of bits must be big enough to represent any value that is currently present in the dictionary.
The dictionary values can be encoded by one of entropy coders. It can be Huffman, Arithmetic, Universal Integer Coding etc.

How to store the dictionary the way to allow fast search for input sequence ?
- Create special tree.
  - Each node have a list of possible continuation symbols.
  - Continuation symbol have pointer to the next node.
  - There is a list of nodes, sorted by creation order.
  The basic idea here is quit similar to one proposed for LZ77, however, there is no need for "special mark", since here it is impossible to use uncoded symbols as part of the dictionary.

References:

[1] Fundamental Compression Algorithms.
Compression Team (compresswww@rasip.fer.hr).
http://oldwww.rasip.fer.hr/research/compress/algorithms/fund/

[2] Data Compression.
Debra A. Lelewer and Daniel S. Hirschberg.
http://www.ics.uci.edu/~dan/pubs/DataCompression.html

[3] Lossless Compression Algorithms (Entropy Encoding).
Dave Marshall.
http://www.cs.cf.ac.uk/Dave/Multimedia/node207.html

[4] Lossless Image Compression.
Dmitry Pechyonyj - Alexander Bogomjakov.
http://www.cs.technion.ac.il/Labs/Isl/Project/Projects_done/VisionClasses/DIP/Lossless_Compression/lossless.html

[5] DEFLATE Compressed Data Format Specification version 1.3.
L. Peter Deutsch, Jean-Loup Gailly and Mark Adler.
ftp://swrinde.nde.swri.edu/pub/png/documents/zlib/rfc-deflate.html