PPM - Prediction by Partial Matching.

Author:	Arkadi Kagan
	arkadi_kagan@hotmail.com
Document:	Entropy Compression Methods.

PPM - Prediction by Partial Matching.

Prediction by Partial Matching is a method to predict the next symbol depending on n previous. This method is else called prediction by Markov Model [3] of order n. In case of text in natural language like English it is clear intuitively and proved by some researchers [1] that probability of every next symbol is highly dependent on previous symbols. The most obvious method to use this feature is to incorporate it with one of known entropy compressions which have to gather statistics. The natural candidates are Huffman and Arithmetic compressions. It can be Adaptive or Static variants of this algorithms.

The key issue is to change alphabet to be alphabet of sequences and not alphabet of single symbols:

Assume Markov model of order n.
Let the input stream to be sequence x.
For each symbol x_i that is in sequence x on place i:
- Update probability of sequence y,
  y = [x_i-n ... x_i].
- Let y to be a symbol in the target alphabet.
- Update all relevant structures in the basis algorithm.
- Perform the rest of needed steps required in the basis algorithm.

Notice definition of sequence y:
y = [x_i-n ... x_i].
If x_i-n or other particles of the sequence are not available, there are some options to handle this situation:

Output x_i uncoded and skip to the next symbol
Add shortened symbol to the alphabet.

The first option is obvious and do not need farther clarification. The second option can be enhanced and should be explained more detailed. I will not touch this option directly because it can be implemented as variant of Multilevel PPM.

Multilevel PPM.

Instead of constructing alphabet for compression from constant-order Markov Model, it is possible to decide:

Given n is the predefined highest order of Markov Model.
Each sequence y = [x₀ ... x_k] is a symbol in the target alphabet.
Where
- k is 0 ≤ k ≤ n
- each x_i in y is a symbol in the source alphabet
Sequence that was not found (yet ?) in the the source sequence assign probability zero. This is equivalent to excluding or not including the symbol from or to the alphabet.
Recognition of input symbol from the target alphabet in general is done as follow:
- Assign weight for each symbol in the target alphabet. The maximal weight have the longer symbol with higher probability of occurrence. The minimal weight have the shorter symbol with lower probability of occurrence.
- The basis algorithm is encouraged to use this weight instead of probability.
- For each offset i in the source stream find all symbols s_i of the target alphabet that can represent sequence of source symbols that ending in offset i.
- Now loop over the input and find the sequence of the target alphabet symbols to fit the source input with maximal weight.

The last step is a loop over the input and here we have two very probable dangerous:

To find sequence with summar weight that is far from maximal and therefore far from optimal.
To make right but wastefull algorithm that need too many memory and CPU resources.

For best solution I could sought about is modification of Viterbi Algorithm [3]:

Require from the weights the next property:
Given two symbol of the target alphabet A and B, find the way to calculate weights so, that weight of symbol S, S = AB:
weight[S] = weight[A] + weight[B].
Define state machine with number of states as the number of probable symbols in the target alphabet.
Sign symbol corresponding to state k with s_k.
Assign initial accumulative weights for each state to zero.
Mark weight of symbol in the target alphabet, corresponding to state k weight[k].
Loop over the input stream and count steps.
- On each step i loop over states. Sign the state j with s_i,j.
  - Mark the accumulative weight of state j on step i as weight[s_i,j].
  - Find state s_i-1,k. The summar weight of state s_i-1,k and the corresponding symbol number k of the target alphabet must be maximal.
  - Set the accumulative weight of state s_i,j.
    weight[s_i,j] = weight[s_i-1,j] + weight[k].
  - Assign to the state s_i,j a marker on the input sequence.
  - Sign the backward direction within s_i,j structure. The backward direction backward[s_i,j] mean previously found s_i-1,j.
    backward[s_i,j] = s_i-1,j.
- Continue the steps loop until the last of states have a way to continue in the source sequence.
Find the state s from the last step with maximal weight.
Scan backward starting from the found state s. Use previously saved backward[s] to do it.
Save for each found state its corresponding symbol as first symbol in the output sequence.
The accepted sequence of the target alphabet symbols is exactly the output of Multilevel PPM.

The far corner of Multilevel PPM is to define the maximal order of Markov Model to be as big as the length of the input stream.
Implementation of this PPM modification could be the closest to optimal variant of Lempel-Ziv compression scheme. Unfortunately there is no existing implementation that could work with reasonable amount of memory and processing time.

The opposite far corner is defining the maximal Markov Model order to zero. In this case we get the ordinal entropy coder that was taken as basis for PPM.

Some skipped details.

The weight.

Notice undefined entity called weight. The only restrictions I put on are this needed to use the modification of Viterbi Algorithm. If other algorithm is used, this restrictions could be different.
The simplest variants to choose the weight can be as follow:

If s = s₁a, then weight[s] = weight[s₁] + weight[a].
The rest of laws for this variant are obvious.
The weight of symbol is equal to its length.
Here we get a variant of LZ77.

There are possible endless variants and I am not ready yet to classify all of them.

The basis algorithm.

One entity that was not discussed is the basis algorithm, PPM was based on.
Huffman or Arithmetic coders can be used to yet improve the compression rate. However, simpler methods can be used for simplicity or as an attempt to get higher modularity.
One of this simpler methods is to save the target symbols indexes as simple integers of constant bit-length.

The gathered statistics.

As with any compression based on gathered statistics there are three main ways to pass this statistics from compressor to decompressor.

Static coding.
All statistics must be passed somehow from compressor to decompressor. One of variants is encoding statistics data like it is done for Canonical Huffman Tree.
- Build Huffman Tree.
- Refer to the lengths of Huffman codes instead of probabilities.
Adaptive coding.
Statistics is not passed at all. Instead the statistics is gathered in process of encoding and decoding symmetrically. This require a way to encode symbols that found first time.
In the case of Multilevel PPM unlike the case of ordinal Huffman or Arithmetic coding, the Adaptive coding can reduce compression rate dramatically.
Fixed statistics.
This variant is usefull, if at all, only for very specific cases where statistics of the source is constant and well investigated.

References:

[1] Data Compression Using Adaptive Coding and Partial String Matching.
John G. Cleary and Ian H. Witten.
http://pharos.cpsc.ucalgary.ca/Dienst/UI/2.0/Describe/ncstrl.ucalgary_cs/1982-103-22

[2] Prediction by Partial Matching (PPM) FAQ.
Maksim Smirnov and Dmitri Shkarin.
http://www.arctest.narod.ru/descript/ppm-faq.htm

[3] A Tutorial on Hidden Markov Models.
Rakesh Dugad and U. B. Desai.
http://uirvli.ai.uiuc.edu/dugad/hmm_tut.html