0% found this document useful (0 votes)
36 views5 pages

Fibonacci Coding Within The Burrows-Wheeler Compression Scheme

This document summarizes the Burrows-Wheeler compression scheme in three steps: 1) Calculate the Burrows-Wheeler transformation (BWT) of the input string by building a matrix of the string's cyclic permutations and outputting the last column; 2) Run a move-to-front transformation on the BWT output, grouping common symbols; 3) Encode the output with an entropy encoder like Huffman or arithmetic coding, achieving compression by lowering the string's entropy. The document then introduces an alternative to move-to-front called distance coding and analyzes the compression performance of BWT on different types of files.

Uploaded by

Marcos Marcos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views5 pages

Fibonacci Coding Within The Burrows-Wheeler Compression Scheme

This document summarizes the Burrows-Wheeler compression scheme in three steps: 1) Calculate the Burrows-Wheeler transformation (BWT) of the input string by building a matrix of the string's cyclic permutations and outputting the last column; 2) Run a move-to-front transformation on the BWT output, grouping common symbols; 3) Encode the output with an entropy encoder like Huffman or arithmetic coding, achieving compression by lowering the string's entropy. The document then introduces an alternative to move-to-front called distance coding and analyzes the compression performance of BWT on different types of files.

Uploaded by

Marcos Marcos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ELECTRONICS AND ELECTRICAL ENGINEERING

ISSN 1392 – 1215 2010. No. 1(97)


ELEKTRONIKA IR ELEKTROTECHNIKA

SYSTEM ENGINEERING, COMPUTER TECHNOLOGY


T 120
SISTEMŲ INŽINERIJA, KOMPIUTERINĖS TECHNOLOGIJOS

Fibonacci Coding Within the Burrows-Wheeler Compression Scheme


R. Bastys
Faculty of Mathematics and Informatics, Vilnius University,
Naugarduko str. 24, LT-03225 Vilnius, Lithuania, phone: +370-674-45577, e-mail: rbastys@yahoo.com

Introduction STEP 2: Run Move-To-Front (MTF) transformation


on BWT output: MTF algorithm renders
Burrows-Wheeler algorithm (BWA for short) is a BWT output into a sequence of integers:
lossless data compression scheme, named after authors 1. Fix some alphabet permutation, e.g. sort it
Michael Burrows and David Wheeler. The classical work ascending.
here is [1]. Also known as block-sorting, currently it is 2. Encode the next message symbol by its position in
among the best textual data archivers in terms of current alphabet permutation.
compression speed and ratio. In this work we describe our 3. Move encoded symbol to the beginning of the
BWA based data compressor implementation working alphabet.
principles and compare it with some other popular file 4. Repeat steps 2 and 3 until the whole message is
archivers. As for prerequisites, the reader is expected to be encoded (Fig. 2):
familiar with basic lossless data compression techniques.

Burrows-Wheeler compression scheme

In this section we provide a detailed exposition of


BWA and review some of the standard facts on lossless
data compression.
Original Burrows-Wheeler scheme archives input
string (we use “caracaras” throughout our examples) in
three major steps:
STEP 1: Calculate Burrows-Wheeler transformation
(BWT) of . We denote it briefly by
Transformation permutes input string symbols as follows:
1. Build input string cyclic permutations matrix.
2. Sort matrix rows ascending.
3. Output last sorted matrix column (Fig. 1).
Fig. 2. MTF transformation of string "rccrsaaaa"

STEP 3: Encode MTF output by any entropy


encoder (EE), e.g. Huffman [2] or Arithmetic [3].
Thus, the entire Burrows-Wheeler compression
scheme may be written as

(1)

From now on, s stands for a string over the


Fig. 1. BWT of string "caracaras" alphabet . Assume each appears
times in , which fixes s length to be . The
To calculate the inverse transformation (restore
entropy of s, denoted by , is defined to be the sum
provided ) one must also know original string index in
sorted matrix [1]. Hence the complete BWT output is
(2)

28
Also referred to as Shannon’s entropy, BWT sort approaches are Bentley-Sedgewick sort
represents the minimum average number of bits required to (modification of quick sort) and suffix sort. Both have their
encode one symbol of . An equivalent formulation of this advantages and disadvantages, but we will not develop this
fact is point here. For a deeper discussion of these algorithms we
refer the reader to [5] – [8].
(3)
where denotes archiver output file size in bytes
and is given by

(4)

On the other hand any good EE algorithm achieves


the reverse inequality

(5)

the constant being relatively small and independent


of . For instance,
(see [2] and [3] for more details).
EXAMPLE 1: Let .
Then
Fig. 3. Schematic BWT view of a long text input

One may mistakenly conjecture that BWA is suitable


for compressing any input. Of course, technically it is
Now let us run MTF on and check how it affects possible to calculate both and of any file,
the entropy. but the overall BWA efficiency highly depends on the
source characteristics, especially the amount of consistent
patterns (to put it simply – words) in it. As a rule BWT
works well on plain text files (Fig. 4), yet it is rather
useless when applied to large entropy binary files (Fig. 5).

This straightforward example demonstrates rather


strikingly a couple important facts. First thing, the
structure of makes it obvious that MTF renders
recurring consecutive symbols into series of zeroes,
furthermore, the entropy of significantly exceeds that
of . Loosely speaking, scattered alphabet symbols
probabilities decrease string entropy, therefore it is to be
expected that (Fig. 10), hence that Fig. 4. BWT fragment of this article in .tex format: numerous runs
and finally that . of consecutive identical symbols
BWA elegantly brings these properties together. It is not
obvious on short “caracaras” example, so let us take
another case. BWT of a long text input, such as the
Wikipedia article on the Caracara
(http://en.wikipedia.org/wiki/Caracara), should look
similar to that in Fig. 3.
It is evident that the transformation groups together
symbols preceding similar contexts, thus producing long
homogenous chains in the last sorted matrix column. MTF
converts them into series of zeroes, which reduces string
entropy, and, in consequence, makes it a fitter input for any
entropy encoder. Fig. 5. BWT fragment of this article in .pdf format: chaotic
It is left to show that BWT can be calculated within a structure, occasional runs of consecutive identical symbols
reasonable amount of time and does not require excessive
PC memory usage. Compressing, say, ordinary 1 MB file The next section is devoted to the study of Distance
involves sorting permutations matrix of size , Coding algorithm, which is in all likelihood the most
which may become quite an expensive operation in case an successful MTF alternative at the second BWA step.
improper sorting algorithm is applied. The two leading

29
Distance Coding property by our calculations on the Canterbury Corpus
files [10].
Distance Coding (DC) was originally proposed by 1. Unlike MTF, DC truncates input on the account of
Edgar Binder at comp.compression newsgroup [9] in 2000. homogenous chains in (Fig. 7)
There is no official paper on DC by Edgar, so we provide
the algorithm (Fig. 6) here:

Fig. 7. Canterbury Corpus .txt files:

2. is a sequence of non-negative integers; so is .


Small numbers prevail in both (Fig. 8).

Fig. 8. Canterbury Corpus plrabn12.txt file:


and

Fig. 6. DC algorithm for string "rccrsaaaa" 3. Since MTF encodes each symbol by its index in ,
we have . DC measures distance between
1. Fix . Someway mark identical symbols, and hence can contain
symbols yet to be encoded (encircled in our example). numbers up to (Fig. 9)
2. Encode the next message symbol by the number of
unencoded (encircled) characters till the next
occurrence of (or end-of-file pointer if there are no
-as left).
3. Unmark the second .
4. Repeat steps 2 and 3 until the whole message is
encoded.
Assume we are encoding symbol , whose right
neighbor is not yet encoded. This clearly forces
, since otherwise some other symbol would have
already pointed to . Such deduction allows us not to
encode it any extra (steps 6, 10, 11 and 12), this way
shortening DC output:

Fig. 9. Canterbury Corpus ptt5 file: tail is longer and


heavier than that of
Let us compile some basic facts on and
provided s satisfies the conditions of the previous section 4. The previous property together with (2) also yield
( ). We illustrate each (Fig. 10)

30
The practical advantage of using Fibonacci Code
within BWT + DC scheme lies in the fact that FC is
universal code, hence there is no need to store bulky
alphabet. Let us compare our implementation of the BWA
scheme to some other data compressors performance on
the relatively large text files (Fig. 11):

Fig. 10. Canterbury Corpus .txt files:

Comparison of and (which is basically


a combination of properties 1 and 4) shows that DC has a
higher potential (Fig. 11). Unfortunately, classical entropy
encoders are not well adapted to compressing due to a
very large (property 3). Storing the alphabet creates
significant overhead, thus decreasing BWT + DC + EE
scheme efficiency. This difficulty disappears entirely if we
replace entropy encoder by some universal code [11], such
as Fibonacci.

Fibonacci Coding

Any positive integer can be represented as the Fig. 11. Canterbury Corpus .txt files: – length in bytes;
and – theoretical lower bounds in bytes for
sum
various data compression techniques (4); zip – popular
commercial file archiver [12]; bzip2 – one of the best open source
(6) BWA implementations [13]

1. BWA based compressors are roughly twice as effective


where is the -th Fibonacci number as plain entropy encoders:
( ), . Fibonacci Code (FC)
[11] of is defined by
2. bzip2 output is shorter than the classical BWA scheme
(7)
theoretical lower bound:
The important point to note is that no two adjacent
coefficients can equal 1, therefore token 11 (= )
immediately indicates the end of code. Thus, FC The reason behind that is bzip2 includes several
transforms any sequence of integers into uniquely additional compression layers, the most important being
decodable binary string. The structure of FC also implies Run-Length Encoding (RLE) (see [14] and the references
that smaller numbers are being mapped into shorter code given there).
words (Table 1): 3. Both and bzip2 excel zip - the most popular
commercial file archiver.
Table 1. Fibonacci Codes, We were mostly surprised at finding out that
N FC(N) nearly achieves BWT + DC + EE theoretical lower bound:

To sum up, BWT + DC + FC is a highly effective


scheme for compressing plain text input (this also includes
.html, .xml files, various programming languages source
code, etc.). Compression and decompression algorithms
are simple and relatively fast.
Compression scheme BWT + MTF + FC is also
8
possible, but it is unlikely to achieve noticeable results,
9
because even is markedly greater
10
than . Besides, it is inexpensive using regular
entropy encoder together with MTF, since is
Although in theory FC is not as effective as entropy
usually small.
encoders, we were interested in investigating the scheme
31
Conclusions and Future Work 5. Bentley J. L., Sedgewick R. Fast algorithms for sorting and
searching strings // Proceedings of the 8th Annual ACM-
1. The main Distance Coding advantage over MTF is SIAM Symposium on Discrete Algorithms. – 1997. – P. 360–
reduced input length; disadvantage – large output 369.
alphabet. 6. Larsson J., Sadakane K. Faster Suffix Sorting // Technical
report LU-CS-TR:99-214, Department of Computer Science,
2. Fibonacci Code is very well adapted to compressing
Lund University, Sweden. – 1999.
DC output – results obtained on text files are close to 7. Sadakane K. A Fast Algorithm for Making Suffix Arrays
theoretical lower bounds. BWT + DC + FC encoded and for Burrows-Wheeler Transformation // Proceedings of
file requires very little metadata information, since FC the IEEE Data Compression Conference, Snowbird, Utah. –
is a universal code. 1998. – P. 129–138.
3. It is natural to try out other universal codes within 8. Sadakane K. A Comparison among Suffix Array
Burrows-Wheeler scheme, possibly combining them Construction Algorithms.
with entropy codes and/or RLE. http://citeseer.ist.psu.edu/187464.html. – 1997.
9. Binder E. Distance Coding algorithm.
References http://groups.google.com/group/comp.compression/msg/27d4
6abca0799d12. – 2000.
1. Burrows M., Wheeler D. J. A block sorting lossless data 10. University of Canterbury.
compression algorithm // Technical Report 124, Digital http://en.wikipedia.org/wiki/Canterbury_Corpus. – 1997.
Equipment Corporation, Palo Alto, California. – 1994. 11. Fraenkel A. S., Klein S. T. Robust universal complete codes
2. Huffman D. A. A Method for the Construction of Minimum- for transmission and compression // Discrete Applied
Redundancy Codes // Proceedings of the Institute of Radio Mathematics. – 1996. – Vol. 64, No. 1. – P. 31–55.
Engineers. – 1952. – P. 1098–1102. 12. Katz P. ZIP data compression algorithm.
3. Witten I., Neal R., Cleary J. Arithmetic Coding for Data http://en.wikipedia.org/wiki/PKZIP.
Compression. Communications of the ACM. – 1987. – Vol. 13. Seward J. bzip2 data compression algorithm.
30, No. 6. – P. 520–540. http://bzip.org/.
4. Shannon C. E. A Mathematical Theory of Communication. 14. bzip2 data compression algorithm description.
Bell System Technical Journal. – Vol. 27. – 1948. – P. 379– http://en.wikipedia.org/wiki/Bzip2.
423, 623–656.
Received 2009 09 02

R. Bastys. Fibonacci Coding Within the Burrows-Wheeler Compression Scheme // Electronics and Electrical Engineering. –
Kaunas: Technologija, 2010. – No. 1(97). – P. 28–32.
Burrows-Wheeler data compression algorithm (BWA) is one of the most effective textual data compressors. BWA includes three
main iterations: Burrows-Wheeler transform (BWT), Move-To-Front transformation (MTF) and some zeroth order entropy encoder (e.g.
Huffman). The paper discusses little investigated scheme when MTF is replaced by the less popular Distance Coding (DC). Some
relevant advantages and downsides of such modified scheme are indicated, the most critical being heavy DC output alphabet. It is shown
that applying Fibonacci Code instead of entropy encoder elegantly deals with this technical problem. The results we obtain on the
Canterbury Corpus text files are very close to the theoretical lower bounds. Our compressor outperforms the most widely used
commercial zip archiver and achieves sophisticated BWA implementation bzip2 compression. Ill. 11, bibl. 14, tabl. 1 (in English;
abstracts in English, Russian and Lithuanian).

P. Бастис. Код Фибоначчи использование в схеме сжатия данных Барроуза-Вилера // Электроника и электротехника. –
Каунас: Технология, 2010. – № 1(97). – C. 28–32.
Алгоритм Барроуза-Вилера (BWA) является одним из наиболее эффективных методов сжатия текстовых данных. BWA
включает три основных итерации: преобразование Барроуза-Вилера (BWT), трансформацию Move-To-Front (MTF) и
энтропийный код (например, Хафмана). В статье анализируется мало исследованная схема, когда MTF заменяется
кодированием расстояний (англ. Distance Coding). Приводятся основные плюсы и недостатки модифицированного BWA, среди
которых черезмерно большой алфавит идентифицируется как основной. Для решения этой технической проблемы
предлагается универсальный код Фибоначчи. Его използование на третьем шаге BWA с текстовыми данными позволяет
достичь результаты весьма близкие к теоретическим. Описываемый алгоритм сжатия данных также превосходит популярный
коммерческий архиватор zip и близок по эффективности к намного более сложной BWA реализации bzip2. Ил. 11, библ. 14,
табл. 1 (на английском языке; рефераты на английском, русском и литовском яз.).

R. Bastys. Fibonačio kodo panaudojimas Burrowso ir Wheelerio duomenų kompresijos schemoje // Elektronika ir
elektrotechnika. – Kaunas: Technologija, 2010. – Nr. 1(97). – P. 28–32.
Burrowso ir Wheelerio duomenų kompresijos algoritmas (BWA) yra vienas efektyviausių tekstinių duomenų archyvavimo metodų.
BWA jungia tris iteracijas: Burrowso ir Wheelerio transformaciją (BWT), Move-To-Front transformaciją (MTF) ir pasirinktą entropinį
kodą (pavyzdžiui, Hafmano). Straipsnyje analizuojama mažai išnagrinėta schema, kai MTF pakeičiamas atstumų kodavimu. Nurodomi
modifikuoto algoritmo pranašumai ir trūkumai, palyginti su klasikiniu BWA, identifikuojama pagrindinė techninė problema – perteklinė
abėcėlė. Jai spręsti pasitelkiamas universalus Fibonačio kodas, leidžiantis išvengti abėcėlės saugojimo sąnaudas. Rezultatai, pasiekti
koduojant tekstinius duomenis aprašoma schema, yra labai artimi teoriniams. Pateikiamo algoritmo efektyvumas pranoksta plačiai
naudojamą komercinį archyvatorių zip ir yra panašus į vienos sudėtingiausių BWA realizacijų (bzip2) suspaudimo koeficientą. Il. 11,
bibl. 14, lent. 1 (anglų kalba; santraukos anglų, rusų ir lietuvių k.).

32

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy