The main idea of Arithmetic coding is to represent sequence of symbols
by single
number of very high precision.
Lets try to define the basic algorithm for Binary Arithmetic
Coding.
Encoding with Binary Arithmetic Coding.
Select range [α, β). Let initial range be [0, 1).
Any range [α, β) is a subrange of the maximal range [0,
1).
Represent α and β as binary floating point numbers.
Position of the floating point is known, therefore α and β
can be represented
as long sequences of {0, 1}*.
Select precision L as the number of bits to represent α or
β.
If a = 0, write a as (0)L - the sequence of zeros
of length L.
If a < 1 and a is the closest number to 1 - write a as a
sequence of (1)L.
Let N to be a size of the source alphabet.
Start the encoding loop.
Slice the range [α, β) to N subranges.
Sign symbol a to be its index in the source alphabet.
Sign fa to be probability of occurrence of symbol
a.
Subrange, corresponding to symbol a is [αa,
βa).
αa = (β -
α)∑i=0a-1fi
βa = fa(β - α)
Read the next input symbol a.
Compute sub-range [αa, βa).
Adjust sub-range to be the new range [α, β).
Check if α and β are in the state of underflow:
The most significant bit msb of α is
msb[α].
In case of underflow msb[α] must be zero.
And msb[β] must be one.
The opposite case must not happen.
It could be a mark of defective algorithm.
Count the bits of α and β starting from the bit
next to msb.
Continue count while
biti[α] is one
and
biti[β] is zero
Shift founded bits without any output.
α = (α << count) ^ 2L-1
β = ((β << count) | (2count - 1))
| 2L-1
Here β is shifted by count bits and freed place is filled
by 1`s.
This is in opposite to α that is filled by 0`s.
Here '|' is a sign for binary 'or' and '^'
is a sign for binary 'exclusive or' operation.
Sign '<<' is a sign of binary operation 'shift to left'.
Freed least significant bits are filled with zeros by '<<'
operation.
The bits that are more significant then bitL are dropped
away.
Increase underflow counter by count.
Check if α and β are in the state of overflow:
Count bits of α and β, starting from the most
significant while
bit i of α is equal to bit i of β.
Sign 'count' the number of counted bits.
If count is not zero do the follow:
Output msb[α].
Output !msb[α] underflow times (if underflow
is zero - output nothing).
Here '!' mean bit value opposite to the given.
This decision is coming from the follow thought:
If msb[α] (and msb[β]) is zero,
this mean that
chosen range is closer to α then to
β.
The original value of α is a sequence
of underflow 1`s.
Therefore the output must be a sequence of
1`s.
If msb[α] is one, the computation is
symmetric.
Set underflow counter to zero.
Output count-1 equal bits starting from the most
significant
but not including msb - it was already put out.
Shift all this bits away from α and
β.
New α and β are computed like this:
α = α << count
or
α = α * 2count
β = (β << count) |
(2count - 1)
Here '&' is a binary 'and' operation.
Continue the main loop until the last input symbol.
If after quiting the main loop, the underflow counter is not zero:
Output msb[β].
Output underflow bits of !msb[β].
Output the reminder of β until
The last '1' is printed out
or
There are no more non-zero bits in α.
The alternative is to put out all reminded bits of β - this
overhead is not too big.
We want the output floating point to be in the right range.
For example, if the range is [0.001, 0.1011) and we have precision only
for 2 bits,
then we will get α = 0.00 and β = 0.10.
Here we see β is in given range, while 0.00 < 0.001 and
therefore α is out of range.
Decoding of Binary Arithmetic Code.
Decoding of Arithmetic code is generally symmetric to encoding.
However there can be some minor differences.
Given L - precision used for encoding, lets choose precision for
decoding:
If we choose precision less then L, it will be possible to
decode wrong symbols.
If decoding precision will be higher or equal to L, decoding
will be right allways.
In the rest of decoding I will not differ between L and decoding
precision.
Read first L coded bits to variable a.
Define range [α, β), initially [0, 1).
Start of the decoding loop.
Slice range [α, β) exactly like it
was done for encoding.
Find sub-range corresponding to the binary number a.
Sign symbol corresponding to this number with x.
Compute subrange exactly like it was done for
encoding:
αx` = (β -
α)∑i=0x-1fi
βx` = fx(β - α)
Shift a by underflow bits:
a = ((a << underflow) ^ 2L-1) | (msb[a] <<
(L-1)).
This computation is exactly the same that was done for α.
Read the next underflow coded bits to a.
Check for overflow.
Count how many bits in α are equal to bits in β on the same
position.
Count starting from msb and stop on the first unmatched bit.
Shift overflow from α, β and a.
Operation is the same like described for underflow:
α = α << overflow
β = (β << overflow) | (2overflow
- 1)
a = a << overflow
Read the next overflow bits to a.
Continue the decoding loop.
Notice that in decoding process we do not have to keep the underflow
counter.
During the encoding we did not know what bits must be put out until the
next overflow.
In the decoding
We have this bits in a.
We do not have any use for this information.
The n-ry Arithmetic Coding.
The described coding is Binary because the output
of the coding is high-precision
floating point number in binary representation.
The Binary Arithmetic Coding is not the
only possible Arithmetic Coding.
Lets try to construct n-ry Arithmetic Coder.
The algorithm of constructing n-ry Coder is quit similar to the Binary
Coder
including care for underflow and overflow.
We decide that we are in the underflow position
if:
αs0 = βs0 -
1.
βs1 = 0.
αs1 = n - 1.
Here I signed γsi as a digit i of number
γ,
starting from the most significant digit
γs0.
All shifts, used in the Binary Coder are now shifts of digits instead of
bits.
When shifting β, it must be filled with digits n-1 instead
of 1`s.
In the rest of this document I will refer to the Binary Arithmetic
Coder.
Precision of Arithmetic Coder.
In the definition of Arithmetic Coding algorithm I was using symbol L to
sign precision
of values α and β.
That mean L is amount of bits used to represent α and β.
If precision is too low, we can end-up in the middle of encoding with
α equal to β
and even this interval can fit for more then one symbol.
For this reason precision L must be big enough that for any
symbol s,
fs can be coded by L-3 bits of binary floating point number
or simple
fs ≥ 1/2L-3.
Why do I insist on probability to be encoded with L-3 non-zero bits:
The maximal value of α without overflow is
.0(1)L-1 or
(2L-1-1/2L)
The minimal value of β without overflow
is
.1(0)L-1 or
(2L-1/2L)
With given α, minimal β` that is not causing underflow is
.11(0)L-2 or
(2L-1+2L-2/2L)
β`-α < β-α`,
therefore β`-α is the minimal width of ranges [α,
β)
that can appear during the computation of Binary
Arithmetic Coding and not
in state of overflow or underflow.
To distinguish symbol s from the next symbol s+1,
subrange [αs, βs) must be big enough.
Particularly, βs-αs ≥
1/2L.
βs-αs =
(β`-α)*fs
(β`-α)*fs ≥
1/2L
(2L-2-1/2L)*fs
≥
1/2L
fs ≥
(1/2L-2-1)
(1/2L-3) >
(1/2L-2-1),
therefore if
fs ≥
(1/2L-3)
then
fs >
(1/2L-2-1)
1/2L-3 =
.(0)L-41
This mean fs must be big enough to be represented with
not more then L-3 non-zero bits.
Gather Statistics.
The next open question is how to gather and pass statistics data.
Static coding.
By given input sequence of symbols, compute probabilities of occurrence
for each symbol.
By this statistics, Arithmetic Coder is able to encode the input
sequence, hopefully
with fewer bits than original input sequence could take.
As mentioned previously, sometimes precision L is too low for one or
more input symbols,
say symbol s.
We have several ways to handle this situation and here are some of them:
Set f`s ≥
1/2L-3.
And recalculate all other probabilities to ensure the basic
property of
probability function:
∑i=0N-1f`i = 1,
where N is the size of alphabet.
Assume, there is a symbol s with probability fs.
And assume fs <
1/2L-3.
Lets choose new probability f`s ≥
1/2L-3.
However, in this case
∑i=0N-1f`s ≠ 1.
This mean f`s will not be right value of probability
function.
Lets update each symbol i ≠ s in order to create probability
function f`.
f`i = fi *
1-f`s/1-fs.
After this calculation
∑i=0N-1f`i =
∑i=0N-1fi *
1-f`s/1-fs -
fs *
1-f`s/1-fs + f`s
=
(∑i=0N-1fi - fs)
*
1-f`s/1-fs +
f`s =
(1-fs) *
1-f`s/1-fs + f`s
= 1.
This mean new f` is valid probability function.
Repeat described operation until every symbol probability can be
represented by L-3
non-zero bits, therefore not exist s, fs <
1/2L-3.
This process can be a bit faster, although not asymptotically.
We can gather all symbols with fs <
1/2L-3.
Start the loop here.
Sign p =
Π1-f`s/1-fs,
where fi <
1/2L-3,
0 ≤ i < N and f1 ≥
1/2L-3.
For each s, if fs ≥
1/2L-3, then
f`s = p * fs.
For each s, if fs <
1/2L-3, then
f`s = f`.
Repeat the loop until no more symbols remain to fit
fs <
1/2L-3.
Notice that this solution is degreeding statistics accuracy
and therefore decreasing compression ratio.
Instead of computing real symbol probabilities we can count
symbol occurrences
in the input sequence.
fs = count[s]/<source
length>.
The limitation on probabilities remain the same,
fs ≥ 1/2L-3,
or count[s] ≥ <source
length>/2L-3.
To assure this property we can create pseudo-symbols that will
approach limitation.
The algorithm must loop over the symbols until each occurred
symbol probability can be
represented with L-3 bits.
Notice that if representing all symbols is impossible, this
algorithm will not do
miracles - it will loop infinitially.
This representation push to do probability computations
on-the-place in the main part
of Arithmetic Cooding algorithm too.
In particular, splitting range in subranges could be rewritten so:
βa = (β - α)fa =
(β - α)(count[a]/<source
length>)
αa =
(β - α)∑i=0a-1fi
=
(β -
α)∑i=0a-1(count[i]/<source
length>)
β - α = (β -
α)(2L/2L)
(2Lαa/2L) =
(β -
α)2L∑i=0a-1(count[i]/<source
length>)/2L
2Lαa =
(β -
α)2L∑i=0a-1(count[i]/<source
length>)
2Lβa =
(β -
α)2L(count[a]/<source
length>)
This way we do not need real floating point operations.
All operations can be integer operations that are usually
faster.
Adaptive coding.
Adaptive scheme is intended to avoid the input stream preprocessing
during the encoding.
Adaptive scheme gain us with two main benefits:
Decompression can be started before compression is
finished.
No need to pass statistics data together with compressed
data stream.
The main disadvantages are:
Complexity:
Statistics data must be calculated and interpreted
on-the-fly.
Compression ratio is generally lower then in static coding.
Although for short input sequences benefit of not sending
statistics data
can be more significant.
Adaptive scheme for Binary Arithmetic Coding can be achieved by two main
methods:
In the start of encoding process, assume all possible
symbols to have
equal probability of occurrence.
After encoding each new symbol, update this symbol probability.
Except for this small difference the rest of the algorithm is
exactly the same as
the static Arithmetic Coding algorithm.
This is including the check if each possible symbol probability
can be
represented by L-3 bits.
Set initial probability of occurrence for each symbol to
zero.
Define special symbol with non-zero probability - Escape
symbol.
This algorithm can be implemented like this:
Initialize all internal structures as defined by
Arithmetic Coding
and set probabilities table as defined above.
Start encoding loop here.
Read the next input symbol s from the input
stream.
Check probability of symbol s.
If probability of symbol s, fs
can be represented by L-3 bits,
fs >
1/2L-3,
assign symbol t, t = s.
Otherwise assign t = Escape symbol.
Slice range [α, β)
exactly like it was
done for static Arithmetic Coding.
Encode symbol t with Arithmetic Coding by computing
new α and β.
If needed, update probability of Escape symbol.
Ensure fEsc ≥
1/2L-3.
Repeat loop until the last symbol is encoded.
Finish Arithmetic Coding by output of underflow bits
and reminder.
Notice that Adaptive Coding in this form is not requiring long
recomputations for
each symbol that can not be represented by L-3 bits.
Both of this schemes will benefit from representing symbol probabilities
by symbol counts
and not real floating point probabilities.
In this case updating symbol probability is single operation, where
computing real
probabilities require recomputation for each symbol.
Advanced possibilities for Arithmetic Coding.
MTF - Move To Front scheme is applicable to Arithmetic Coding.
PPM - Prediction by Partial Match is applicable too.
BWT - Burrows-Wheeler Transform can improve compression ratio.