[IR] BWT+MTF+AC

 

BWT (Burrows–Wheeler_transform)数据转换算法html

MTF(Move-to-front transform)数据转换web

基于统计的压缩算法:游程编码算法

良心PPT: bwt_based_compression_verbin.pptapp


 

BWT Ideaide

压缩技术主要的工做方式就是找到重复的模式,进行紧密的编码。post

BWT(Burrows–Wheeler_transform)将原来的文本转换为一个类似的文本,转换后使得相同的字符位置连续或者相邻;编码

以后能够使用其余技术如:Move-to-front transform 和 游程编码(RLE) 进行文本压缩。url

 

通常压缩能够将文本先使用Burrows–Wheeler transform生成局部相关性很好的序列再使用MTF减小信息熵最后再进行压缩。idea

 

Burrows–Wheeler transform + Run-length codingspa

Ori:rabcabcababaabacabcabcabcababaa$
BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$
RLE:aab4ccac3rcba10b5a$

 

Encoding:

1. All rotations

#BANANAS
S#BANANA
AS#BANAN
NAS#BANA
ANAS#BAN
NANAS#BA
ANANAS#B
BANANAS#

2. Sort the rows and pick up the last col.

#BANANAS
ANANAS#B
ANAS#BAN
AS#BANAN
BANANAS#
NANAS#BA
NAS#BANA
S#BANANA

正好是个行列一致的Matrix。

ori=[#BANANAS]T

bwt=[SBNN#AAA]T

 

Decoding:

while循环len(bwt)次 ) {

  "前插一列,字典排序"

}

Result:

#BANANAS
ANANAS#B
ANAS#BAN
AS#BANAN
BANANAS#
NANAS#BA
NAS#BANA
S#BANANA

 


 

MTF Idea: 

可见,BWT转换将文本转换为局部相关性很好的序列。

而恰好MTF转换就是利用了空间局部性原理减小信息熵。

即:最近访问的字符老是出如今“recently used symbols”的前面位置,若是字符的空间局部性较好,编码以后就会出现不少小的数字,如”0“或”1“。

因此,在以下List列中,使用的字符便可提早到head pos处,如此一来,index标号比较小,省空间。

这样,就方便了下一步压缩策略。

The main idea is that each symbol in the data is replaced by its index in the stack of “recently used symbols”.

For example, long sequences of identical symbols are replaced by as many zeroes, whereas when a symbol that has not been used in a long time appears, it is replaced with a large number.

Thus at the end the data is transformed into a sequence of integers; if the data exhibits a lot of local correlations, then these integers tend to be small.

 

Encoding:

先创建字符集大小的List,“recently used symbols”,这里只考虑26个小写字母a~z。

From: https://en.wikipedia.org/wiki/Move-to-front_transform

Input Stream Sequence List Operation
bananaaa 1 (abcdefghijklmnopqrstuvwxyz) b放到head位置
bananaaa 1,1 (bacdefghijklmnopqrstuvwxyz) a放到head位置
bananaaa 1,1,13 (abcdefghijklmnopqrstuvwxyz) n放到head位置
bananaaa 1,1,13,1 (nabcdefghijklmopqrstuvwxyz) a放到head位置
bananaaa 1,1,13,1,1 (anbcdefghijklmopqrstuvwxyz) n放到head位置
bananaaa 1,1,13,1,1,1 (nabcdefghijklmopqrstuvwxyz) a放到head位置
bananaaa 1,1,13,1,1,1,0 (anbcdefghijklmopqrstuvwxyz) a放到head位置
bananaaa 1,1,13,1,1,1,0,0 (anbcdefghijklmopqrstuvwxyz) a放到head位置

Final

1,1,13,1,1,1,0,0

(anbcdefghijklmopqrstuvwxyz)

 

ori=[b a n  a n a a a]T

mtf=[1 1 13 1 1 1 0 0]T

 

Decoding:

Input Stream Sequence     List
1,1,13,1,1,1,0,0 b
(abcdefghijklmnopqrstuvwxyz)
1,1,13,1,1,1,0,0 ba (bacdefghijklmnopqrstuvwxyz)
1,1,13,1,1,1,0,0 ban (abcdefghijklmnopqrstuvwxyz)
1,1,13,1,1,1,0,0 bana (nabcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 banan (anbcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 banana (nabcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 bananaa (anbcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 bananaaa (anbcdefghijklmopqrstuvwxyz)

 

可见,这里的思想就在于,利用了变化过程当中的顺序性!

这样一来,string变为了数字。若是以前的string是对应一堆大的数字的话,那么,此次变为了一堆小数字,且0,1较多,Entropy小了!

 

Burrows–Wheeler transform + Move-To-Front

Ori:rabcabcababaabacabcabcabcababaa$
BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$
MTF:0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,$
ARI:<(6+1)bytes>

 

问:MTF是否确实提升了压缩率呢?一堆小数字对压缩来讲总会是好事。减少了信息熵,便天然想到了huffman或者arithmetic coding。

 

过程以下:

Input Stream Sequence List
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0 (abcdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0 (abcdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1 (abcdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0 (bacdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2 (bacdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0 (cbadefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2 (cbadefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1 (acbdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0 (cabdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17 (cbadefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1 (rcbadefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2 (rcbadefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3 (brcadefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0 (abrcdefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1 (abrcdefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0 (barcdefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1 (barcdefghijklmnopqstuvwxyz)

Final

0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1

 

 

继续采用Arithmetic coding处理:

32 in total(暂未考虑$)

0:  22/32

1:  5/32

2:  3/32

3:  1/32

17: 1/32

使用了8 Bytes

 

计算Entropy:

计算式:(-22/32)*log2(22/32)+(-5/32)*log2(5/32)+(-3/32)*log2(3/32)+(-1/32)*log2(1/32)+(-1/32)*log2(1/32)

计算器:https://web2.0calc.com/


 

 

对比: 

大约每一个字符平均采用1.43bits来进行编码,也就是:

1.43*32=45.76 约为46bits,即6Bytes

加上结尾符:6+1($) = 7Bytes

Finally, 15 Bytes 搞定!

 

RLE:aab4ccac3rcba10b5a$

共18 Bytes!

 


 

总结:

有不少种组合方式,以上只是探讨了BWT+MTF+Arithmetic Coding的效果。

 


 

BWT+Run-Length --> Run-length FM-Index

 

参考: [IR] Advanced XML Compression - XBW

都利用了一样的压缩思想(Run-length encoding)。这里,经过B、S列替代了Last列。B列更有利于压缩。

 

注意这里是'1'为首,'0'为尾。

经过B'、S列替代了Front列。B'列更有利于压缩。但S须要在使用前排序,也就是S-->C列。

注意:S、C列中每一个elem表明的是block。且相同elem内部具备保序性。

index  Last   B   S*   B'  Front   C* 
1 e 1 e 1 $ $
2 e 0 d 1 _ _
3 d 1 _ 1 _ _
4 _ 1 n 1 a a
5 n 1 r 1 d d
6 r 1 h 1 e e
7 r 0 t 0 e e
8 h 1 $ 1 e h
9 h 0 a 0 e n
10 t 1 e 1 h r
11 $ 1 _ 0 h t
12 a 1   1 n  
13 e 1   1 r  
14 e 0   0 r  
15 _ 1   1 t  

 

如此这般,

B+B' = 15bits + 15bits = 2 Bytes.

S:11 Bytes

In total: 2+11+1(表len)=14 Bytes.

 

原来的方式:

Last列:15 Bytes

C Table (for Front列):row:9 * col:2 = 18 Bytes

 $     1   
_ 2
a 4
d 5
e 6
h 10
n 12
r 13
t 15

In total: 15+18=33 Bytes.

 

时间换空间,仍是有点效果。

本站公众号
   欢迎关注本站公众号,获取更多信息