[IR] Graph Compression

Ref: [IR] Compressionhtml

Ref: [IR] Link Analysisnode


 

Planar Graph算法

From: http://www.csie.ntnu.edu.tw/~u91029/PlanarGraph.html#1oop

由於缺少優美規律,所以談論對偶圖時,習慣忽略同構。post

最特別的對偶圖例子,就是橋( bridge )與自環( loop )。ui

舉例來說,原圖是一棵樹,對偶圖是一個點以及一大堆自環;各種樹對應各種自環包覆方式。this

 

Spanning Tree编码

  From: http://www.csie.ntnu.edu.tw/~u91029/SpanningTree.htmlurl

图中提取树的方法,可参见:[Optimization] Greedy method 中最小生成树算法等相关内容。spa

如下探讨如何压缩Graph的策略。

Idea:

能表示在spanning tree上的链接使用-+方式记录信息。

未表示在spanning tree上的链接则补充balance bracket。

1   2   3   2   4   2   1   5   6   5   7   8   7   5
  -   -   +   -   +   +   -   -   +   -   -   +   +  
        ((       )((   ((       )   ))(       )   )       )
        12       234   56       6   547       7   3       1

最终编码形态:--((+-)((+((+-)-))(+-)-)++)

其实就是基于DFS表示tree,而后剩余的连接拿平衡括号来表达。(哄小孩儿的伎俩)

 

邻接矩阵,邻接链表

Each vertex associated with an (sorted / unsorted) array of adjacent vertices.
More space efficient for sparse graph.

其实就是基础的邻接表,解决稀疏信息的问题。

 


 

Web Graph representation and compression

Link: http://www.touchgraph.com/TGGoogleBrowser.html

面临的问题主要是:

• Graph is highly dynamic
  – Nodes and edges are added/deleted often
  – Content of existing nodes is also subject to change
  – Pages and hyperlinks created on the fly
• Apart from primary connected component there are also smaller disconnected components

 

具备的主要特色是:

Locality: usually most of the hyperlinks are local, i.e, they point to other URLs on the same host.
               The literature reports that on average 80% of the hyperlinks are local.

Consecutivity: links within same page are likely to be consecutive respecting to the lexicographic order.

Similarity: Pages on the same host tend to have many hyperlinks pointing to the same pages.

 

如下内容能够combined with [IR] Compression. (都具备一样相似的压缩思想)

 

Connectivity Server: URL compression

其实就是相似于”Front coding, 前缀冗余“的方案。

 

 

Delta Encoding of the Adjacency Lists

压缩效果:

Avg. inlink size: 34 bits   --> 8.9 bits
Avg. outlink size: 24 bits --> 11.03 bits

 

原理:

Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files;

more generally this is known as data differencing. Delta encoding is sometimes called delta compression,

particularly where archival histories of changes are required (e.g., in revision control software).

就是经过只记录“差异”而达到压缩的效果。

 

Interlist compression with representative list

ref : relative index of the representative adjacency list;
deletes: set of URL-ids to delete from the representative list; 删掉第几个data
adds: set of URL-ids to add to the representative list. 替换为这个data

压缩效果:

Avg. inlink size: 5.66 bits
Avg. outlink size: 5.61 bits

 

 

 (WebGraph Framework)

                -- 过程以下介绍

压缩效果:

Avg. inlink size: 3.08 bits
Avg. outlink size: 2.89 bits

 

Compressing Gaps

注意:

S1-X的值看正负,而后经过v(x)来得出Successors列的头一个值。

v(x)的值,其实:

    • 如果奇数:x <0
    • 如果偶数:x>=0

 

 

Using copy lists

可以使用copy方式,好比这里使用Node15 Outdegree11为基准作01序列(1:copy操做)

其余列以这一列为基准,只需保存没copy操做的便可。

 

但貌似在01序列中有太多的0出现,咱们能不能针对性的作些什么?

 

Using copy blocks ()

Feature: copy and skip是交替进行的。

这里有几个地方比较绕,开启傻瓜式的讲解方式:

Encoding:

1. The last block is omitted; 忽略最后一个block。
2. The first copy block is 0 if the copy list starts with 0; ‘01’序列start with 0,则copy block 也start with 0。
3. The length is decremented by one for all blocks except the first one.

 

16, 10, 1, 01110011010

第一个0算是个标志位,第二个0才是下面的1st block的0.

1st block: 0

2nd block: 3-1=2

3rd block: 2-1=1

4th block: 2-1=1

5th block: 1-1=0

6th block: 1-1=0

7th block: 1-1=0  // The last block is omitted;

 

其实,最起码,copy blocks --> copy lists。

注意,copy与skip之间是否后接本身的数字,能够利用“递增”的特性来判断

 

Decoding:

copy next 2+1=3 -> 15 16 17

skip next 1+1=2 -> 15 16 17

copy next 1+1=2 -> 15 16 17 22 23 24   //由于递增,”22“比“23,24”小

skip next 0+1=1 -> 15 16 17 22 23 24 

copy next 0+1=1 -> 15 16 17 22 23 24 315

copy left 0+1=1 -> 15 16 17 22 23 24 315 316 317 3041

 

补充

由于 “01” 序列其实不为0,那么The first copy block is not 0。

 

Conclusions

The compression techniques are specialized for Web Graphs.
The average link size decreases with the increase of the graph.
The average link access time increases with the increase of the graph.
The  seems to have the best trade-off between avg. bit size and access time.

相关文章
相关标签/搜索