The content of this page is not a set of actual compression algorithms.
It is rather a set of known transforms that make the compression step
more efficient.
BWT - Burrows-Wheeler Transform.
Burrows and Wheeler [1] proved that after transform
called BWT the actual
encoding by one of Entropy Coding algorithms can gain its compression
ratio.
The original authors compression was to chain algorithms this way: BWT --> MTF --> RLE-0 --> Entropy Coder
But there are a lot of variants.
I will cover only the transform itself.
Forward BWT.
Input is sequence x of length n.
Define n sequences.
The i`th sequence is the same sequence x, rotated by i-1 symbols.
This sequences can be drawn as matrix M(x):
M(x) =
x1
x2
...
xn-1
xn
x2
x3
...
xn
x1
...
xn-1
xn
...
xn-3
xn-2
xn
x1
...
xn-2
xn-1
Sort rows of the matrix M(x) in lexicographical order.
The result of this operation can be signed as matrix M`(x).
The output of BWT is:
The rightmost column of the matrix M`(x).
It will be signed as xbwt
The sequential number of row in the matrix M`(x), containing
the sequence x.
Lets sign this row number R(x).
Inverse BWT.
The inverse transform is not too intuitive, but Burrows and Wheeler [1]
proved it to be right.
The input to the Inverse BWT is sequence xbwt and index
R(x).
Lets restore the known parts of matrix M`(x):
The last column is known and it is xbwt.
Sort the sequence xbwt.
M`(x) is known to be lexicographically sorted.
Therefore the sorted xbwt is the first column of
M`(x).
We signed R(x) to be index of row in M`(x) containing the sequence
x.
The last letter ln of row R(x) in the matrix M`(x) is the
last letter of the sequence x.
Output the last letter that is ln.
Assign R to be R(X).
Assign m to be equal to n.
The loop starts here.
Count the occurrences of letters lm in the last column of
M`(x) before row R.
Sign k the counted number.
Count occurrence k+1 in the first column for letter lm.
Assign this row index to R.
The searched letter lm-1 is found in the last column of
row R.
Output the letter.
Decrease m by 1.
Repeat the loop until all n letters of x are known.
Here is explanation why Inverse BWT work.
Assume i`th symbol in the source sequence is s.
Assume we have k symbols s in the source stream.
Sign row in M` that contain the continuation of the stream,
starting from the symbol i in the first column.
Assume it is row number j between those starting from symbol s.
Remove all rows that do not start from s.
Sign the accepted structure M``.
The structure M`` is Lexicographical Sorted list of rows
that
all are started from symbol s.
There are k of this rows.
The row j of M`` have its continuation the remainder of the
input stream.
Lets get back to the original BWT matrix M`.
Put the last column of matrix M` on the first place, before column
0.
Cut all rows that do not start from s.
The accepted structure is exactly M``.
In the process of Inverse BWT we always know the content of the
first
and last columns of matrix M`.
Now a part of values from this columns are the content of
columns 0 and 1 of structure M``.
We know that starting from symbol i, continuation of the source
stream is in row j of M``.
Therefore if symbol i is found in the column 0 of row j in M``,
then symbol i+1 must be found in the column 1 of row j.
Add two new symbols to to the output alphabet - 0a and
0b.
Start the loop here.
Check next symbol s from the input stream.
If symbol is 0, count how many consequent zeros are in this place in
the input stream.
Sign this count with m.
If s is not 0, output s.
Find binary representation of number m+1 -
2↓[log2(m+1)].
Output binary sequence, coded with
↓[log2(m+1)] bits.
Each bit 0 replace with symbol 0a and
each bit 1 replace with symbol 0b.
Continue the encoding loop.
y = ↓[x] here is a sign of round-down - find the maximal integer y,
y ≤ x and x is real.
Decoding RLE-0 is symmetric to encoding, however, if we determined
sequence of x symbols
from the group {0a, 0b}, we must count what number
m was encoded by this sequence.
If decoded binary number can be signed N and m = 2x + N - 1,
then
x = ↓[log(m+1)]
N = (m+1) - 2↓[log(m+1)]
m+1 - N = 2↓[log(m+1)]
log(m+1 - N) = ↓[log(m+1)]
x = log(m+1 - N)
2x = m+1 - N
m = 2x + N - 1
Here we found that source was containing m sequential 0`s that can be
putted
out in the decoding process.