Auxilary Compression Transforms.

Author:	Arkadi Kagan
	arkadi_kagan@hotmail.com
Document:	Entropy Compression Methods.

Auxilary Compression Transforms.

The content of this page is not a set of actual compression algorithms. It is rather a set of known transforms that make the compression step more efficient.

BWT - Burrows-Wheeler Transform.

Burrows and Wheeler [1] proved that after transform called BWT the actual encoding by one of Entropy Coding algorithms can gain its compression ratio. The original authors compression was to chain algorithms this way:
BWT --> MTF --> RLE-0 --> Entropy Coder
But there are a lot of variants. I will cover only the transform itself.

Forward BWT.

Input is sequence x of length n.
Define n sequences. The i`th sequence is the same sequence x, rotated by i-1 symbols.
This sequences can be drawn as matrix M(x):
M(x) =

x₁ x₂ ... x_n-1 x_n

x₂ x₃ ... x_n x₁

...

x_n-1 x_n ... x_n-3 x_n-2

x_n x₁ ... x_n-2 x_n-1
Sort rows of the matrix M(x) in lexicographical order. The result of this operation can be signed as matrix M`(x).
The output of BWT is:
1. The rightmost column of the matrix M`(x). It will be signed as x^bwt
2. The sequential number of row in the matrix M`(x), containing the sequence x. Lets sign this row number R(x).

Inverse BWT.

The inverse transform is not too intuitive, but Burrows and Wheeler [1] proved it to be right.

The input to the Inverse BWT is sequence x^bwt and index R(x).
Lets restore the known parts of matrix M`(x):
- The last column is known and it is x^bwt.
- Sort the sequence x^bwt.
  M`(x) is known to be lexicographically sorted. Therefore the sorted x^bwt is the first column of M`(x).
We signed R(x) to be index of row in M`(x) containing the sequence x. The last letter l_n of row R(x) in the matrix M`(x) is the last letter of the sequence x.
Output the last letter that is l_n.
Assign R to be R(X).
Assign m to be equal to n.

The loop starts here.

Count the occurrences of letters l_m in the last column of M`(x) before row R. Sign k the counted number.
Count occurrence k+1 in the first column for letter l_m. Assign this row index to R.
The searched letter l_m-1 is found in the last column of row R. Output the letter.
Decrease m by 1.
Repeat the loop until all n letters of x are known.

Here is explanation why Inverse BWT work.

Assume i`th symbol in the source sequence is s.
Assume we have k symbols s in the source stream.
Sign row in M` that contain the continuation of the stream, starting from the symbol i in the first column. Assume it is row number j between those starting from symbol s.
Remove all rows that do not start from s. Sign the accepted structure M``.
- The structure M`` is Lexicographical Sorted list of rows that all are started from symbol s.
- There are k of this rows.
- The row j of M`` have its continuation the remainder of the input stream.
Lets get back to the original BWT matrix M`.
Put the last column of matrix M` on the first place, before column 0.
Cut all rows that do not start from s. The accepted structure is exactly M``.
In the process of Inverse BWT we always know the content of the first and last columns of matrix M`. Now a part of values from this columns are the content of columns 0 and 1 of structure M``.
We know that starting from symbol i, continuation of the source stream is in row j of M``. Therefore if symbol i is found in the column 0 of row j in M``, then symbol i+1 must be found in the column 1 of row j.

MTF - Move To Front.

The idea of algorithm is to compress instead of arbitrary symbols, indexes of symbols specially ordered. This algorithm can be used as pre-processing step for LZ, Huffman, Arithmetic Codings and others.

Arrange the source alphabet to a list of indexes.

Start the loop here.

Get the next input symbol s.
Find index[s] - index of symbol s in the list of indexes.
Output index[s].
Move index[s] to front. If the next encoded symbol is s again, its index will be the smallest possible, in our case index`[s] = 0.
Continue the loop until the last symbol is encoded.

Decoding of MTF is stay-forward, however here it is:

Arrange the source alphabet to a list of indexes. Assure to have initial order the same as for encoding.

Start the loop here.

Get the next encoded index - index[s].
By given index[s], find symbol s according to the list of indexes.
Output symbol s.
Move index[s] to front.
Continue loop until the last symbol is decoded.

RLE - Run Length Encoding.

In opposite to other algorithms in this document, RLE is indeed compression. However, compression ratio of RLE is so small that it is very rarely used as real compression method. One of once popular Animator Pro file formats ".cli" is indeed using RLE as its only compression method. Other usages of RLE that I know are using RLE as preprocessing or postprocessing method.

Select symbol to be marked Escape symbol. This symbol must appear in potential source as rarely as possible. The best is if not at all.

Start the loop here.

Count how many times the next symbol s appear in the input stream.
- If count > Limit or s is Escape symbol, output Escape symbol, count and symbol s.
- Otherwise, output uncoded symbol s.
Skip count symbols.
Continue the loop until all input stream is encoded.

Decoding RLE is symmetric to encoding.

Clarify, what symbol is Escape symbol. This symbol must be the same that was used for encoding.

Start the loop here.

Read the next input symbol s.
- If the last entity is not Escape symbol
  - Output s.
- Otherwise
  - Read count of symbols.
  - Read actual symbol t that was encoded.
  - Output symbol t count times.

In this description I was using Limit constant. This constant depend on the way, how symbols and "count" are encoded. For example, if symbols (including Escape) and count are encoded one byte each, Limit could be 3 symbols. The idea of choosing Limit constant is to avoid unnecessary encoding of symbol sequences that are shorter without encoding.

RLE-0 - Zero Run Length Encoding.

If we precisely know the most frequent symbol, we can improve RLE coding to encode only this symbol and leave all others uncoded. There are endless variants to implement such basic idea, however RLE-0 is used in popular [1] implementation of BWT.

Add two new symbols to to the output alphabet - 0_a and 0_b.

Start the loop here.

Check next symbol s from the input stream.
If symbol is 0, count how many consequent zeros are in this place in the input stream. Sign this count with m.
If s is not 0, output s.
Find binary representation of number m+1 - 2^{↓[log₂(m+1)]}.
Output binary sequence, coded with ↓[log₂(m+1)] bits. Each bit 0 replace with symbol 0_a and each bit 1 replace with symbol 0_b.
Continue the encoding loop.

y = ↓[x] here is a sign of round-down - find the maximal integer y, y ≤ x and x is real.
Decoding RLE-0 is symmetric to encoding, however, if we determined sequence of x symbols from the group {0_a, 0_b}, we must count what number m was encoded by this sequence.
If decoded binary number can be signed N and m = 2^x + N - 1, then

x = ↓[log(m+1)]
N = (m+1) - 2^{↓[log(m+1)]}
m+1 - N = 2^{↓[log(m+1)]}
log(m+1 - N) = ↓[log(m+1)]
x = log(m+1 - N)
2^x = m+1 - N
m = 2^x + N - 1

Here we found that source was containing m sequential 0`s that can be putted out in the decoding process.

Elias Universal Coding of integers.

Elias Universal Coding is intended for universal coding of individual integers that can be distinguished in unmarked stream of bits. This method can provide us with relatively compact representation of unlimited integers.
Notice that Elias coding is not "Prefix" coding - several numbers can start from the same sequence.
The natural assumption when choosing Elias Coding is that smaller numbers are more frequent then bigger numbers.
Elias Universal code can be of two types.

Elias Gamma Code.
Γ(x) is binary value of integer x, prefaced by ↓[lg(x)] zeros. The most significant bit of value x that is encoded must be 1.

Elias Delta Code.
Δ(x) is encoded as Γ(↓[lg(x)] + 1), followed by numeric value of x.

Fibonacci Universal Coding of integers.

Fibonacci Code is intended for universal coding of individual integers that can be distinguished in unmarked stream of bits. The intention of Fibonacci Code is similar to Elias Code, however the method of encoding is quit different.
Encoding Fibonacci Universal Code for given number x.

Compute the maximal Fibonacci number F_max, F_max < x.
- Start with F = 1 and P = 1.
- If F + P < x
  - F` = F + P
  - P = F
  - F = F`
  - Continue the loop.
- Else F_max = F is the searched number
Assign F = F_max and P the previous to F Fibonacci number.

Start the loop.

Output the next bit of Fibonacci Code, starting from the last.
- If x > F
  - Output bit 1.
  - x = x - F
- Otherwise output bit 0.
P` = F - P
F = P
P = P`
Continue the loop until x is not zero.

Decode a sequence of bits, coded with Fibonacci code. Notice the opposite order of bits in encoding and decoding.

Find the two last 1`s that are the sign for end of Fibonacci Code.
Initialize F = 1, P = 1 and sum = 0.
Start decoding from the first bit of Fibonacci Code.

Start the loop.

If current bit in Fibonacci Code is 1,
- sum = sum + F.
Compute the next Fibonacci number.
F` = F + P
P = F
F = F`
Loop until all the number is decoded.
The variable sum is holding now the decoded value.

References:

[1] A Block-sorting Lossless Data Compression Algorithm.
M. Burrows and D.J. Wheeler.
http://gatekeeper.dec.com/pub/DEC/SRC/research-reports/abstracts/src-rr-124.html

[2] Improvements to Burrows-Wheeler Compression Algorithm.
Sebastian Deorowicz.
http://www-zo.iinf.polsl.gliwice.pl/~sdeor/pub/deo00.ps

[3] Compression Basics.
Pasi Albert Ojala.
http://www.cs.tut.fi/~albert/Dev/pucrunch/packing.html

[4] Conversion between Fibonacci Code and Integers.
Lawrence J. Krakauer.
http://ljkrakauer.com/weight.htm

[5] Data Compression with the Burrows-Wheeler Transform.
Mark Nelson.
http://dogma.net/markn/articles/bwt/bwt.htm