Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
Lecture I: Data Compression Data Encoding: Efficient Information Encoding To
Jarosław Duda
Nokia, Kraków
25. IV. 2016
1
Symbol frequencies from a version of the British
National Corpus containing 67 million words
( http://www.yorku.ca/mack/uist2011.html#f1 ):
Brute: 𝒍𝒈(𝟐𝟕) ≈ 𝟒. 𝟕𝟓 bits/symbol (𝑥 → 27𝑥 + 𝑠)
Huffman uses: 𝐻 ′ = ∑𝑖 𝑝𝑖 𝒓𝒊 ≈ 𝟒. 𝟏𝟐 bits/symbol
Shannon: 𝐻 = ∑𝑖 𝑝𝑖 𝐥𝐠(𝟏/𝒑𝒊 ) ≈ 𝟒. 𝟎𝟖 bits/symbol
We can theoretically improve by ≈ 𝟏% here
𝚫𝑯 ≈ 𝟎. 𝟎𝟒 bits/symbol
Order 1 Markov: ~ 3.3 bits/symbol (~gzip)
Order 2: ~3.1 bps, word: ~2.1 bps (~ZSTD)
Currently best compression: cmix-v9 (PAQ)
(http://mattmahoney.net/dc/text.html )
109 bytes of text from Wikipedia (enwik9)
into 123874398 bytes: ≈ 𝟏 bit/symbol
108 → 𝟏𝟓𝟔27636 106 → 𝟏𝟖𝟏476
Hilberg conjecture: 𝐻(𝑛) ~ 𝑛𝛽 , 𝛽 < 0.9
General purpose (lossless, e.g. gzip, rar, lzma, bzip, zpaq, Brotli, ZSTD) vs
specialized compressors (e.g. audiovisual, text, DNA, numerical, mesh,…)
5
JPEG LS – simple prediction
https://en.wikipedia.org/wiki/Lossless_JPEG
YCbCr:
Y – luminance/brightness
Cb, Cr – blue, red difference
Chroma subsampling
In lossy compression
(usually 4:2:0)
Color depth:
3x8 – true color
3x10,12,16 – deep color
7
Karhunen–Loève transform to decorrelate into independent variables
Often close to Fourier (DCT):
(energy preserving, frequency domain)
(stored quantized in JPEG)
9
DCT – fixed support, Wavelets – varying support, e.g. JPEG2000
Haar wavelets:
Progressive decoding: send low frequencies first, then further to improve quality
10
Lempel-Ziv (large family)
Replace repeats with
position and length
(decoding is just parsing)
https://github.com/inikep/lzbench
i5-4300U,
100MB total files (from W8.1):
method Compr.[MB/s] Decomp.[MB/s] Ratio
lz4fast r131 level 17 LZ 994 3172 74 %
lz4 r131 LZ 497 2492 62 %
lz5hc v1.4.1 level 15 LZ 2.29 724 44 %
Zlib (gzip) level 1 LZ + Huf 45 197 49 %
Zlib (gzip) level 9 LZ + Huf 7.5 216 45 %
ZStd v0.6.0 level 1 LZ + tANS 231 595 49 %
ZStd v0.6.0 level 22 LZ + tANS 2.25 446 37 %
Google Brotli level 0 LZ + o1 Huf 210 195 50 %
Google Brotli level 11 LZ + o1 Huf 0.25 170 36 %
11
Burrows – Wheeler Transform: sort lexicographically all cyclic shifts, take last column
14
Part 2: (Entropy) coding – the heart of data compression
Encoding (𝑝𝑖 ) distribution with entropy coder optimal for (𝑞𝑖 ) distribution costs
𝑝 1 (𝑝𝑖 −𝑞𝑖 )2
Δ𝐻 = ∑𝑖 𝑝𝑖 lg(1/𝑞𝑖 ) − ∑𝑖 𝑝𝑖 lg(1/𝑝𝑖 ) = ∑𝑖 𝑝𝑖 lg ( 𝑖 ) ≈ ∑𝑖
𝑞𝑖 ln(4) 𝑝𝑖
more bits/symbol - so called Kullback-Leibler “distance” (not symmetric)
15
Imagine we want to store n values (e.g. hash) of m bits: directly 𝑚 ⋅ 𝑛 bits
Their order contains lg(𝑛!) ≈ lg((𝑛/𝑒)𝑛 ) = 𝑛 lg(𝑛) − 𝑛 lg(𝑒) bits of information
If order is not important, we could use only ≈ 𝑛(𝑚 − lg(𝑛) + 1.443)
It means halving storage for 178k of 32bit hashes, 11G of 64bit hashes
How to cheaply
approximate it?
𝑛! (𝑛/𝑒 )𝑛
(𝑝𝑛)! ((1−𝑝)𝑛)!
≈ (1−𝑝)𝑛 = 2𝑛 lg(𝑛)−𝑝𝑛 lg(𝑝𝑛)−(1−𝑝)𝑛 lg((1−𝑝)𝑛) = 2𝑛(−𝑝 lg(𝑝)−(1−𝑝)lg(1−𝑝))
(𝑝𝑛/𝑒 )𝑝𝑛 ((1−𝑝)𝑛/𝑒)
example: {0014044010001001404441000010010000000000000000000000
100000000000000000000000000000000000000000000000000}
{0,1,2,3,4} alphabet, length = 104, direct encoding: 104 ⋅ lg(5) ≈ 𝟐𝟒𝟏 bits
Static probability (stored?) : {89, 8, 0, 0, 7}/104 104 ⋅ ∑𝑖 𝑝𝑖 lg(1/𝑝𝑖 ) ≈ 77 bits
Symbol-wise adaptation starting from flat (uniform) or averaged initial distribution:
21
Encoding (𝑝𝑖 ) distribution with entropy coder optimal for (𝑞𝑖 ) distribution costs
𝑝𝑖 1 (𝑝𝑖 −𝑞𝑖 )2
Δ𝐻 = ∑𝑖 𝑝𝑖 lg(1/𝑞𝑖 ) − ∑𝑖 𝑝𝑖 lg(1/𝑝𝑖 ) = ∑𝑖 𝑝𝑖 lg ( ) ≈ ∑𝑖
𝑞𝑖 ln(4) 𝑝𝑖
more bits/symbol - so called Kullback-Leibler “distance” (not symmetric).
general (Apple LZFSE, Fb ZSTD), DNA (CRAM), games (LZNA, BitKnit), Google VP10, WebP
26
Huffman coding (HC), prefix codes: example of Huffman penalty
most of everyday compression, for truncated 𝜌(1 − 𝜌)𝑛 distribution
e.g. zip, gzip, cab, jpeg, gif, png, tiff, pdf, mp3… (can’t use less than 1 bits/symbol)
Zlibh (the fastest generally available implementation):
Encoding ~ 320 MB/s (/core, 3GHz) 8/H zlibh FSE
Decoding ~ 300-350 MB/s Ratio →
𝜌 = 0.5 4.001 3.99 4.00
Range coding (RC): large alphabet arithmetic 𝜌 = 0.6 4.935 4.78 4.93
coding, needs multiplication, e.g. 7-Zip, VP Google 𝜌 = 0.7 6.344 5.58 6.33
video codecs (e.g. YouTube, Hangouts). 𝜌 = 0.8 8.851 6.38 8.84
Encoding ~ 100-150 MB/s tradeoffs 𝜌 = 0.9 15.31 7.18 15.29
Decoding ~ 80 MB/s 𝜌 = 0.95 26.41 7.58 26.38
𝜌 = 0.99 96.95 7.90 96.43
(binary) Arithmetic coding (AC):
H.264, H.265 video, ultracompressors e.g. PAQ Huffman: 1byte → at least 1 bit
Encoding/decoding ~ 20-30MB/s ratio ≤ 8 here
Asymmetric Numeral Systems (ANS) tabled (tANS) - without multiplication
FSE implementation of tANS: Encoding ~ 350 MB/s Decoding ~ 500 MB/s
RC → ANS: ~7x decoding speedup, no multiplication (switched e.g. in LZA compressor)
HC → ANS means better compression and ~ 1.5x decoding speedup (e.g. zhuff, lzturbo)
27
Operating on fractional number of bits
30
rANS - range variant for large alphabet 𝒜 = {0, . . , 𝑚 − 1}
assume Pr(𝑠) = 𝑓𝑠 /2𝑛 𝑐𝑠 : = 𝑓0 + 𝑓1 + ⋯ + 𝑓𝑠−1
start with base 2𝑛 numeral system and merge length 𝑓𝑠 ranges
𝑛
for 𝑥 ∈ {0,1, … , 2𝑛 − 1}, 𝒔(𝒙) = 𝐦𝐚𝐱{𝒔: 𝒄𝒔 ≤ 𝒙}, 𝑚𝑎𝑠𝑘 = 2 − 1
encoding: 𝐶 (𝑠, 𝑥 ) = ⌊𝑥/𝑓𝑠 ⌋ ≪ 𝑛 + mod(𝑥, 𝑓𝑠 ) + 𝑐𝑠
decoding: 𝑠 = 𝑠(𝑥 & 𝑚𝑎𝑠𝑘 ) (e.g. tabled, alias method)
𝐷 (𝑥 ) = (𝑠, 𝑓𝑠 ⋅ (𝑥 ≫ 𝑛) + ( 𝑥 & 𝑚𝑎𝑠𝑘 ) − 𝑐𝑠 )
32
uABS - uniform binary variant (𝒜 = {0,1}) - extremely accurate
Assume binary alphabet, 𝒑 ≔ 𝐏𝐫(𝟏), denote 𝒙𝒔 = {𝒚 < 𝒙: 𝒔(𝒚) = 𝒔} ≈ 𝒙𝒑𝒔
For uniform symbol distribution we can choose:
𝒙𝟏 = ⌈𝒙𝒑⌉ 𝒙𝟎 = 𝒙 − 𝒙𝟏 = 𝒙 − ⌈𝒙𝒑⌉
𝑠(𝑥) = 1 if there is jump on next position: 𝑠 = 𝑠(𝑥) = ⌈(𝑥 + 1)𝑝⌉ − ⌈𝑥𝑝⌉
decoding function: 𝐷 (𝑥) = (𝑠, 𝑥𝑠 )
its inverse – coding function:
𝑥+1
𝐶 (0, 𝑥) = ⌈ ⌉ − 1
1−𝑝
𝑥
𝐶 (1, 𝑥) = ⌊ ⌋
𝑝
33
Stream version – renormalization
Up to now: we encode using succeeding 𝐶 functions into a huge number 𝑥,
then decode (in opposite direction!) using succeeding 𝐷.
Like in arithmetic coding, we need renormalization to limit working precision -
enforce 𝒙 ∈ 𝑰 = {𝑳, … , 𝒃𝑳 − 𝟏} by transferring base-𝒃 youngest digits:
ANS decoding step from state 𝑥 encoding step for symbol 𝑠 from state 𝑥
(𝑠, 𝑥) = 𝐷(𝑥); while 𝒙 ≥ 𝒎𝒂𝒙𝑿[𝒔] // = 𝑏𝐿𝑠
useSymbol(𝑠); {writeDigit(mod(𝒙, 𝒃)); 𝒙 = ⌊𝒙/𝒃⌋};
while 𝒙 < 𝑳, 𝒙 = 𝒃𝒙 + 𝐫𝐞𝐚𝐝𝐃𝐢𝐠𝐢𝐭(); 𝑥 = 𝐶(𝑠, 𝑥);
For unique decoding,
we need to ensure that there is a single way to perform above loops:
𝐼 = {𝐿, … , 𝑏𝐿 − 1}, 𝐼𝑠 = {𝐿𝑠 , … , 𝑏𝐿𝑠 − 1} where 𝐼𝑠 = {𝑥: 𝐶 (𝑠, 𝑥) ∈ 𝐼}
35
RENORMALIZATION to prevent 𝑥 → ∞
Enforce 𝑥 ∈ 𝐼 = {𝐿, … ,2𝐿 − 1} by
transmitting lowest bits to bitstream
“buffer” 𝒙 contains 𝐥𝐠(𝒙) bits of information
– produces bits when accumulated
Symbol of 𝑃𝑟 = 𝑝 contains lg(1/𝑝) bits:
lg(𝑥 ) → lg(𝑥 ) + lg(1/𝑝) modulo 1
36
37
38
Single step of stream version:
to get 𝑥 ∈ 𝐼 to 𝐼𝑠 = {𝐿𝑠 , … , 𝑏𝐿𝑠 − 1}, we need to transfer 𝑘 digits:
𝑠
𝑥 → (𝐶(𝑠, ⌊𝑥/𝑏 𝑘 ⌋), mod(𝑥, 𝑏 𝑘 )) where 𝑘 = ⌊log 𝑏 (𝑥/𝐿𝑠 )⌋
𝑘 = 𝑘𝑠 or 𝑘 = 𝑘𝑠 − 1 for 𝒌𝒔 = −⌊𝐥𝐨𝐠 𝒃 (𝒑𝒔 )⌋ = −⌊log 𝑏 (𝐿𝑠 /𝐿)⌋
e. g.: 𝒑𝒔 = 𝟏𝟑/𝟔𝟔, 𝒃 = 𝟐, 𝑘𝑠 = 3, 𝐿𝑠 = 13, 𝐿 = 66, 𝑏 𝑘𝑠 𝑥 + 3 = 115, 𝑥 = 14:
for 𝑠 = 0 to 𝑛 − 1 do
put((0.5/𝑝𝑠 , 𝑠));
for 𝑋 = 0 to 𝐿 − 1 do
{(𝑣, 𝑠) = getmin;
put((𝑣 + 1/𝑝𝑠 , 𝑠));
𝑠𝑦𝑚𝑏𝑜𝑙 [𝑋] = 𝑠;
}
𝐚𝐩𝐩𝐫𝐨𝐱𝐢𝐦𝐚𝐭𝐞𝐥𝐲:
𝐚𝐥𝐩𝐡𝐚𝐛𝐞𝐭 𝐬𝐢𝐳𝐞 𝟐
𝚫𝑯 ∝ ( )
𝑳
43
Tuning: 𝑝𝑠 ≈ 𝑞𝑠 /𝐿. Can we “tune” spread of 𝑞𝑠 symbols accordingly to 𝑝𝑠 ?
Shift symbols right when 𝒑𝒔 < 𝒒𝒔 /𝑳, left otherwise
𝟏
Assume 𝐏𝐫(𝒙) ≈
𝒙 𝐥𝐧(𝟐)
𝑏
∑𝑥=𝑎… 𝑏 Pr(𝑥) ≈ lg ( )
𝑎
𝑠 appears 𝑞𝑠 times: 𝑖 ∈ 𝐼𝑠 = {𝑞𝑠 , … ,2𝑞𝑠 − 1}
𝑖+1
Pr(𝑖-th interval) ≈ lg ( )
𝑖
To fulfill the Pr(𝑥) assumption,
𝑥 for this interval should fulfill:
𝑖+1 1
lg ( ) ⋅ 𝑝𝑠 ≈
𝑖 𝑥 ln(2)
1
𝑥≈
𝑝𝑠 ln(1 + 1/𝑖)
is the preferred position for
𝑖 ∈ 𝐼𝑠 = {𝑞𝑠 , … ,2𝑞𝑠 − 1} appearance
of 𝑝𝑠 ≈ 𝑞𝑠 /𝐿 symbol
https://github.com/JarekDuda/AsymmetricNumeralSystemsToolkit
44
tABS
test all possible
symbol distributions
for binary alphabet
no branches,
no bit-by-bit renormalization
state is single number (e.g. SIMD)
45
Additional tANS advantage – simultaneous encryption
we can use huge freedom while initialization: choosing symbol distribution –
slightly disturb 𝒔(𝒙) using PRNG initialized with cryptographic key
ADVANTAGES comparing to standard (symmetric) cryptography:
standard, e.g. DES, AES ANS based cryptography (initialized)
based on XOR, permutations highly nonlinear operations
bit blocks fixed length pseudorandomly varying lengths
“brute force” just start decoding perform initialization first for new
or QC attacks to test cryptokey cryptokey, fixed to need e.g. 0.1s
speed online calculation most calculations while initialization
entropy operates on bits operates on any input distribution
46
tANS (2007) - fully tabled: Apple LZFSE, Facebook ZSTD, lzturbo
fast: no multiplication (FPGA!), less memory efficient (~8kB for 2048 states)
static in ~32kB blocks, costly to update (rather needs rebuilding),
allows for simultaneous encryption (PRNG to perturb symbol spread)
tANS decoding step Encoding step (for symbol s)
t = decodingTable[x]; nbBits = (x + nb[s]) >> r ;
writeSymbol(t.symbol); writeBits(x, nbBits);
x = t.newX + readBits(t.nbBits); x = encodingTable[start[s] + (x >> nbBits)];
rANS (2013) – needs multiplication – CRAM (DNA), VP10 (video), LZNA, BitKnit
more memory effective – especially for large alphabet and precision (CDF only)
better for adaptation (𝐶𝐷𝐹 [𝑠] ≤ 𝑦 < 𝐶𝐷𝐹[𝑠 + 1] - tabled, alias, binary search/SIMD)
rANS decoding step (𝑚𝑎𝑠𝑘 = 2𝑛 − 1) Encoding step (s) (msk = 216 - 1)
s = symbol(x & mask); writeSymbol(s); if(x > bound[s])
x = f[s] (x >> n) + (x & mask) – CDF[s]; {write16bits(x & msk); x >>= 16; }
if(x < 216) x = x << 16 + read16bits(); x = (x / f[s]) << n + (x % f[s]) + CDF[s];