The content of this page is not a set of actual compression algorithms.
It is rather a set of known transforms that make the compression step
more efficient.
BWT - Burrows-Wheeler Transform.
Burrows and Wheeler [1] proved that after transform
called BWT the actual
encoding by one of Entropy Coding algorithms can gain its compression
ratio.
The original authors compression was to chain algorithms this way: BWT --> MTF --> RLE-0 --> Entropy Coder
But there are a lot of variants.
I will cover only the transform itself.
Forward BWT.
Input is sequence x of length n.
Define n sequences.
The i`th sequence is the same sequence x, rotated by i-1 symbols.
This sequences can be drawn as matrix M(x):
M(x) =
x1
x2
...
xn-1
xn
x2
x3
...
xn
x1
...
xn-1
xn
...
xn-3
xn-2
xn
x1
...
xn-2
xn-1
Sort rows of the matrix M(x) in lexicographical order.
The result of this operation can be signed as matrix M`(x).
The output of BWT is:
The rightmost column of the matrix M`(x).
It will be signed as xbwt
The sequential number of row in the matrix M`(x), containing
the sequence x.
Lets sign this row number R(x).
Inverse BWT.
The inverse transform is not too intuitive, but Burrows and Wheeler [1]
proved it to be right.
The input to the Inverse BWT is sequence xbwt and index
R(x).
Lets restore the known parts of matrix M`(x):
The last column is known and it is xbwt.
Sort the sequence xbwt.
M`(x) is known to be lexicographically sorted.
Therefore the sorted xbwt is the first column of
M`(x).
We signed R(x) to be index of row in M`(x) containing the sequence
x.
The last letter ln of row R(x) in the matrix M`(x) is the
last letter of the sequence x.
Output the last letter that is ln.
Assign R to be R(X).
Assign m to be equal to n.
The loop starts here.
Count the occurrences of letters lm in the last column of
M`(x) before row R.
Sign k the counted number.
Count occurrence k+1 in the first column for letter lm.
Assign this row index to R.
The searched letter lm-1 is found in the last column of
row R.
Output the letter.
Decrease m by 1.
Repeat the loop until all n letters of x are known.
Here is explanation why Inverse BWT work.
Assume i`th symbol in the source sequence is s.
Assume we have k symbols s in the source stream.
Sign row in M` that contain the continuation of the stream,
starting from the symbol i in the first column.
Assume it is row number j between those starting from symbol s.
Remove all rows that do not start from s.
Sign the accepted structure M``.
The structure M`` is Lexicographical Sorted list of rows
that
all are started from symbol s.
There are k of this rows.
The row j of M`` have its continuation the remainder of the
input stream.
Lets get back to the original BWT matrix M`.
Put the last column of matrix M` on the first place, before column
0.
Cut all rows that do not start from s.
The accepted structure is exactly M``.
In the process of Inverse BWT we always know the content of the
first
and last columns of matrix M`.
Now a part of values from this columns are the content of
columns 0 and 1 of structure M``.
We know that starting from symbol i, continuation of the source
stream is in row j of M``.
Therefore if symbol i is found in the column 0 of row j in M``,
then symbol i+1 must be found in the column 1 of row j.
MTF - Move To Front.
The idea of algorithm is to compress instead of arbitrary symbols,
indexes of symbols specially ordered.
This algorithm can be used as pre-processing step for LZ, Huffman,
Arithmetic Codings
and others.
Arrange the source alphabet to a list of indexes.
Start the loop here.
Get the next input symbol s.
Find index[s] - index of symbol s in the list of indexes.
Output index[s].
Move index[s] to front.
If the next encoded symbol is s again, its index will be the smallest
possible,
in our case index`[s] = 0.
Continue the loop until the last symbol is encoded.
Decoding of MTF is stay-forward, however here it is:
Arrange the source alphabet to a list of indexes.
Assure to have initial order the same as for encoding.
Start the loop here.
Get the next encoded index - index[s].
By given index[s], find symbol s according to the list of
indexes.
Output symbol s.
Move index[s] to front.
Continue loop until the last symbol is decoded.
RLE - Run Length Encoding.
In opposite to other algorithms in this document, RLE is indeed
compression.
However, compression ratio of RLE is so small that it is very rarely
used as real compression method.
One of once popular Animator Pro file formats ".cli" is indeed using RLE
as its only
compression method.
Other usages of RLE that I know are using RLE as preprocessing or
postprocessing method.
Select symbol to be marked Escape symbol.
This symbol must appear in potential source as rarely as possible. The
best is if not at all.
Start the loop here.
Count how many times the next symbol s appear in the input stream.
If count > Limit or s is Escape symbol, output Escape
symbol, count and symbol s.
Otherwise, output uncoded symbol s.
Skip count symbols.
Continue the loop until all input stream is encoded.
Decoding RLE is symmetric to encoding.
Clarify, what symbol is Escape symbol.
This symbol must be the same that was used for encoding.
Start the loop here.
Read the next input symbol s.
If the last entity is not Escape symbol
Output s.
Otherwise
Read count of symbols.
Read actual symbol t that was encoded.
Output symbol t count times.
In this description I was using Limit constant.
This constant depend on the way, how symbols and "count" are encoded.
For example, if symbols (including Escape) and count are encoded one
byte each,
Limit could be 3 symbols.
The idea of choosing Limit constant is to avoid unnecessary encoding of
symbol sequences
that are shorter without encoding.
RLE-0 - Zero Run Length Encoding.
If we precisely know the most frequent symbol, we can improve RLE coding
to
encode only this symbol and leave all others uncoded.
There are endless variants to implement such basic idea, however RLE-0
is used in popular [1] implementation of BWT.
Add two new symbols to to the output alphabet - 0a and
0b.
Start the loop here.
Check next symbol s from the input stream.
If symbol is 0, count how many consequent zeros are in this place in
the input stream.
Sign this count with m.
If s is not 0, output s.
Find binary representation of number m+1 -
2↓[log2(m+1)].
Output binary sequence, coded with
↓[log2(m+1)] bits.
Each bit 0 replace with symbol 0a and
each bit 1 replace with symbol 0b.
Continue the encoding loop.
y = ↓[x] here is a sign of round-down - find the maximal integer y,
y ≤ x and x is real.
Decoding RLE-0 is symmetric to encoding, however, if we determined
sequence of x symbols
from the group {0a, 0b}, we must count what number
m was encoded by this sequence.
If decoded binary number can be signed N and m = 2x + N - 1,
then
x = ↓[log(m+1)]
N = (m+1) - 2↓[log(m+1)]
m+1 - N = 2↓[log(m+1)]
log(m+1 - N) = ↓[log(m+1)]
x = log(m+1 - N)
2x = m+1 - N
m = 2x + N - 1
Here we found that source was containing m sequential 0`s that can be
putted
out in the decoding process.
Elias Universal Coding of integers.
Elias Universal Coding is intended for universal coding of individual
integers that
can be distinguished in unmarked stream of bits.
This method can provide us with relatively compact representation of
unlimited integers.
Notice that Elias coding is not "Prefix" coding - several numbers can
start from the same sequence.
The natural assumption when choosing Elias Coding is that smaller
numbers are
more frequent then bigger numbers.
Elias Universal code can be of two types.
Elias Gamma Code.
Γ(x) is binary value of integer x, prefaced by ↓[lg(x)]
zeros.
The most significant bit of value x that is encoded must be 1.
Elias Delta Code.
Δ(x) is encoded as Γ(↓[lg(x)] + 1), followed by numeric
value of x.
Fibonacci Universal Coding of integers.
Fibonacci Code is intended for universal coding of individual integers
that
can be distinguished in unmarked stream of bits.
The intention of Fibonacci Code is similar to Elias Code, however the
method of encoding
is quit different.
Encoding Fibonacci Universal Code for given number x.
Compute the maximal Fibonacci number Fmax,
Fmax < x.
Start with F = 1 and P = 1.
Start the loop.
If F + P < x
F` = F + P
P = F
F = F`
Continue the loop.
Else Fmax = F is the searched number
Assign F = Fmax and P the previous to F Fibonacci
number.
Start the loop.
Output the next bit of Fibonacci Code, starting from the last.
If x > F
Output bit 1.
x = x - F
Otherwise output bit 0.
P` = F - P
F = P
P = P`
Continue the loop until x is not zero.
Decode a sequence of bits, coded with Fibonacci code.
Notice the opposite order of bits in encoding and decoding.
Find the two last 1`s that are the sign for end of Fibonacci
Code.
Initialize F = 1, P = 1 and sum = 0.
Start decoding from the first bit of Fibonacci Code.
Start the loop.
If current bit in Fibonacci Code is 1,
sum = sum + F.
Compute the next Fibonacci number.
F` = F + P
P = F
F = F`
Loop until all the number is decoded.
The variable sum is holding now the decoded value.