PPM - Prediction by Partial
Matching.
Prediction by Partial Matching is a method to predict the next symbol
depending on n previous.
This method is else called prediction by Markov Model [3] of order n.
In case of text in natural language like English it is clear intuitively
and proved by
some researchers [1] that probability of every next
symbol is highly dependent on previous symbols.
The most obvious method to use this feature is to incorporate it with
one of known entropy compressions
which have to gather statistics.
The natural candidates are Huffman and Arithmetic compressions.
It can be Adaptive or Static variants of this algorithms.
The key issue is to change alphabet to be alphabet of sequences and not
alphabet of single symbols:
- Assume Markov model of order n.
- Let the input stream to be sequence x.
- For each symbol x_{i} that is in sequence x on place i:
- Update probability of sequence y,
y = [x_{i-n} ... x_{i}].
- Let y to be a symbol in the target alphabet.
- Update all relevant structures in the basis algorithm.
- Perform the rest of needed steps required in the basis
algorithm.
Notice definition of sequence y:
y = [x_{i-n} ... x_{i}].
If x_{i-n} or other particles of the sequence are not available,
there are some
options to handle this situation:
- Output x_{i} uncoded and skip to the next symbol
- Add shortened symbol to the alphabet.
The first option is obvious and do not need farther clarification.
The second option can be enhanced and should be explained more detailed.
I will not touch this option directly because it can be implemented as
variant of Multilevel PPM.
Multilevel PPM.
Instead of constructing alphabet for compression from constant-order
Markov Model, it is possible to decide:
- Given n is the predefined highest order of Markov Model.
- Each sequence y = [x_{0} ... x_{k}] is a symbol in
the target alphabet.
Where
- k is 0 ≤ k ≤ n
- each x_{i} in y is a symbol in the source
alphabet
- Sequence that was not found (yet ?) in the the source sequence
assign probability zero.
This is equivalent to excluding or not including the symbol from or to
the alphabet.
- Recognition of input symbol from the target alphabet in general is
done as follow:
- Assign weight for each symbol in the target alphabet.
The maximal weight have the longer symbol with higher
probability of occurrence.
The minimal weight have the shorter symbol with lower
probability of occurrence.
- The basis algorithm is encouraged to use this weight instead
of probability.
- For each offset i in the source stream find all symbols
s_{i}
of the target alphabet that can represent sequence of source
symbols
that ending in offset i.
- Now loop over the input and find the sequence of the target
alphabet
symbols to fit the source input with maximal weight.
The last step is a loop over the input and here we have two very
probable dangerous:
- To find sequence with summar weight that is far from maximal and
therefore far from optimal.
- To make right but wastefull algorithm that need too many memory and
CPU resources.
For best solution I could sought about is modification of Viterbi
Algorithm [3]:
- Require from the weights the next property:
Given two symbol of the target alphabet A and B,
find the way to calculate weights so, that weight of symbol S, S
= AB:
weight[S] = weight[A] + weight[B].
- Define state machine with number of states as the number
of probable symbols in the target alphabet.
- Sign symbol corresponding to state k with s_{k}.
- Assign initial accumulative weights for each state to zero.
- Mark weight of symbol in the target alphabet, corresponding to state
k weight[k].
- Loop over the input stream and count steps.
- On each step i loop over states.
Sign the state j with s_{i,j}.
- Mark the accumulative weight of state j on step i as
weight[s_{i,j}].
- Find state s_{i-1,k}.
The summar weight of state s_{i-1,k} and the
corresponding symbol number k of the target alphabet must be
maximal.
- Set the accumulative weight of state
s_{i,j}.
weight[s_{i,j}] = weight[s_{i-1,j}] +
weight[k].
- Assign to the state s_{i,j} a marker on the
input sequence.
- Sign the backward direction within s_{i,j}
structure.
The backward direction backward[s_{i,j}] mean
previously found s_{i-1,j}.
backward[s_{i,j}] = s_{i-1,j}.
- Continue the steps loop until the last of states have a
way to continue in the source sequence.
- Find the state s from the last step with maximal weight.
- Scan backward starting from the found state s.
Use previously saved backward[s] to do it.
Save for each found state its corresponding symbol as first symbol in
the output sequence.
- The accepted sequence of the target alphabet symbols is exactly the
output of Multilevel PPM.
The far corner of Multilevel PPM is to define the maximal order of
Markov Model
to be as big as the length of the input stream.
Implementation of this PPM modification could be the closest to optimal
variant
of Lempel-Ziv compression scheme.
Unfortunately there is no existing implementation that could work with
reasonable
amount of memory and processing time.
The opposite far corner is defining the maximal Markov Model order to
zero.
In this case we get the ordinal entropy coder that was taken as basis
for PPM.
Some skipped details.
The weight.
Notice undefined entity called weight.
The only restrictions I put on are this needed to use the modification
of Viterbi Algorithm.
If other algorithm is used, this restrictions could be different.
The simplest variants to choose the weight can be as follow:
- If s = s_{1}a, then weight[s] = weight[s_{1}] +
weight[a].
The rest of laws for this variant are obvious.
- The weight of symbol is equal to its length.
Here we get a variant of LZ77.
There are possible endless variants and I am not ready yet to classify
all of them.
The basis algorithm.
One entity that was not discussed is the basis algorithm, PPM was based
on.
Huffman or Arithmetic coders can be used to yet improve
the compression rate.
However, simpler methods can be used for simplicity or as
an attempt to get higher modularity.
One of this simpler methods is to save the target symbols indexes
as simple integers of constant bit-length.
The gathered statistics.
As with any compression based on gathered statistics there are three
main ways to
pass this statistics from compressor to decompressor.
- Static coding.
All statistics must be passed somehow from compressor to decompressor.
One of variants is encoding statistics data like it is done for Canonical Huffman Tree.
- Build Huffman Tree.
- Refer to the lengths of Huffman codes instead of
probabilities.
- Adaptive coding.
Statistics is not passed at all.
Instead the statistics is gathered in process of encoding and decoding
symmetrically.
This require a way to encode symbols that found first time.
In the case of Multilevel PPM unlike the case of ordinal Huffman or
Arithmetic coding,
the Adaptive coding can reduce compression rate dramatically.
- Fixed statistics.
This variant is usefull, if at all, only for very specific cases where
statistics of the source is constant and well investigated.
References:
[1] Data Compression Using Adaptive Coding and Partial
String Matching.
John G. Cleary and Ian H. Witten.
http://pharos.cpsc.ucalgary.ca/Dienst/UI/2.0/Describe/ncstrl.ucalgary_cs/1982-103-22
[2] Prediction by Partial Matching (PPM) FAQ.
Maksim Smirnov and Dmitri Shkarin.
http://www.arctest.narod.ru/descript/ppm-faq.htm
[3] A Tutorial on Hidden Markov Models.
Rakesh Dugad and U. B. Desai.
http://uirvli.ai.uiuc.edu/dugad/hmm_tut.html