Greenplum 行存、列存，堆表、AO表的原理和选择

时间 2019-12-09

标签 greenplum 原理选择繁體版

原文原文链接

转载自：html

https://github.com/digoal/blog/blob/master/201708/20170818_02.md?spm=a2c4e.11153940.blogcont179210.17.6f682764HWr8pC&file=20170818_02.mdgit

背景

Greenplum支持行存和列存，支持堆表和AO表，那么他们有什么不一样，如何选择呢？github

行存和列存的原理

一、行存，以行为形式组织存储，一行是一个tuple，存在一块儿。当须要读取某列时，须要将这列前面的全部列都进行deform，因此访问第一列和访问最后一列的成本其实是不同的。sql

在这篇文档中，有deform的详细介绍。《PostgreSQL 向量化执行插件(瓦片式实现) 10x提速OLAP》数据库

行存小结：架构

全表扫描要扫描更多的数据块。app

压缩比较低。dom

读取任意列的成本不同，越靠后的列，成本越高。函数

不适合向量计算、JIT架构。（简单来讲，就是不适合批处理形式的计算）oop

须要REWRITE表时，须要对全表进行REWRITE，例如加字段有默认值。

二、列存，以列为形式组织存储，每列对应一个或一批文件。读取任一列的成本是同样的，可是若是要读取多列，须要访问多个文件，访问的列越多，开销越大。

列存小结：

压缩比高。

仅仅支持AO存储（后面会将）。

读取任意列的成本是同样的。

很是适合向量计算、JIT架构。对大批量数据的访问和统计，效率更高。

读取不少列时，因为须要访问更多的文件，成本更高。例如查询明细。

须要REWRITE表时，不须要对全表操做，例如加字段有默认值，只是添加字段对应的那个文件。

何时选择行存

若是OLTP的需求偏多，例如常常须要查询表的明细（输出不少列），须要更多的更新和删除操做时。能够考虑行存。

何时选择列存

若是OLAP的需求偏多，常常须要对数据进行统计时，选择列存。

须要比较高的压缩比时，选择列存。

若是用户有混合需求，能够采用分区表，例如按时间维度的需求分区，近期的数据明细查询多，那就使用行存，对历史的数据统计需求多那就使用列存。

堆表和AO表的原理

一、堆表，实际上就是PG的堆存储，堆表的全部变动都会产生REDO，能够实现时间点恢复。可是堆表不能实现逻辑增量备份（由于表的任意一个数据块都有可能变动，不方便经过堆存储来记录位点。）。

一个事务结束时，经过clog以及REDO来实现它的可靠性。同时支持经过REDO来构建MIRROR节点实现数据冗余。

二、AO表，看名字就知道，只追加的存储，删除更新数据时，经过另外一个BITMAP文件来标记被删除的行，经过bit以及偏移对齐来断定AO表上的某一行是否被删除。

事务结束时，须要调用FSYNC，记录最后一次写入对应的数据块的偏移。（而且这个数据块即便只有一条记录，下次再发起事务又会从新追加一个数据块）同时发送对应的数据块给MIRROR实现数据冗余。

所以AO表不适合小事务，由于每次事务结束都会FSYNC，同时事务结束后这个数据块即便有空余也不会被复用。（你能够测试一下，AO表单条提交的IO放大很严重）。

虽然如此，AO表很是适合OLAP场景，批量的数据写入，高压缩比，逻辑备份支持增量备份，所以每次记录备份到的偏移量便可。加上每次备份全量的BITMAP删除标记（很小）。

何时选择堆表

当数据写入时，小事务偏多时选择堆表。

当须要时间点恢复时，选择堆表。

何时选择AO表

当须要列存时，选择AO表。

当数据批量写入时，选择AO表。

测试对比行存deform和列存的性能差异

一、建立一个函数，用于建立400列的表（行存堆表、AO行存表、AO列存表）。

create or replace function f(name, int, text) returns void as $$  
declare  
  res text := '';  
begin  
  for i in 1..$2 loop  
    res := res||'c'||i||' int8,';  
  end loop;  
  res := rtrim(res, ',');  
  if $3 = 'ao_col' then  
    res := 'create table '||$1||'('||res||') with  (appendonly=true, blocksize=8192, compresstype=none, orientation=column)';  
  elsif $3 = 'ao_row' then  
    res := 'create table '||$1||'('||res||') with  (appendonly=true, blocksize=8192, orientation=row)';  
  elsif $3 = 'heap_row' then  
    res := 'create table '||$1||'('||res||') with  (appendonly=false)';  
  else  
    raise notice 'use ao_col, ao_row, heap_row as $3';  
    return;  
  end if;  
  execute res;  
end;  
$$ language plpgsql;

二、建立表以下

postgres=# select f('tbl_ao_col', 400, 'ao_col');  
postgres=# select f('tbl_ao_row', 400, 'ao_row');  
postgres=# select f('tbl_heap_row', 400, 'heap_row');

三、建立1个函数，用于填充数据，其中第一个和最后3个字段为测试数据的字段，其余都填充1。

create or replace function f_ins1(name, int, int8) returns void as $$  
declare  
  res text := '';  
begin  
  for i in 1..($2-4) loop  
    res := res||'1,';  
  end loop;  
  res := 'id,'||res;  
  res := rtrim(res, ',');  
  res := 'insert into '||$1||' select '||res||'id,random()*10000,random()*100000 from generate_series(1,'||$3||') t(id)';  
  execute res;  
end;  
$$ language plpgsql;

四、填充数据

postgres=# select f_ins1('tbl_ao_col',400,1000000);

五、建立1个函数，用于填充数据，其中前4个字段为测试数据的字段，其余都填充1。

create or replace function f_ins2(name, int, int8) returns void as $$  
declare  
  res text := '';  
begin  
  for i in 1..($2-4) loop  
    res := res||'1,';  
  end loop;  
  res := 'id,id,random()*10000,random()*100000,'||res;  
  res := rtrim(res, ',');  
  res := 'insert into '||$1||' select '||res||' from generate_series(1,'||$3||') t(id)';  
  execute res;  
end;  
$$ language plpgsql;

六、填充数据

postgres=# select f_ins1('tbl_ao_col',400,1000000);  
 f_ins1   
--------  
   
(1 row)  
  
postgres=# insert into tbl_ao_row select * from tbl_ao_col;  
INSERT 0 1000000  
postgres=# insert into tbl_heap_row select * from tbl_ao_col;  
INSERT 0 1000000

七、表分析

postgres=# analyze tbl_ao_col ;  
ANALYZE  
postgres=# analyze tbl_ao_row;  
ANALYZE  
postgres=# analyze tbl_heap_row;  
ANALYZE

八、表大小

postgres=# select pg_size_pretty(pg_relation_size('tbl_ao_col'));  
 pg_size_pretty   
----------------  
 3060 MB  
(1 row)  
  
postgres=# select pg_size_pretty(pg_relation_size('tbl_ao_row'));  
 pg_size_pretty   
----------------  
 3117 MB  
(1 row)  
  
postgres=# select pg_size_pretty(pg_relation_size('tbl_heap_row'));  
 pg_size_pretty   
----------------  
 3473 MB  
(1 row)

九、行存堆表，前面几个字段的统计

postgres=# explain analyze select c2,count(*),sum(c3),avg(c3),min(c3),max(c3) from tbl_heap_row group by c2;  
                                                                        QUERY PLAN                                                                           
-----------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=136132.40..136132.42 rows=1 width=96)  
   Rows out:  1 rows at destination with 135 ms to end, start offset by 1.922 ms.  
   ->  HashAggregate  (cost=136132.40..136132.42 rows=1 width=96)  
         Group By: tbl_heap_row.c2  
         Rows out:  1 rows (seg42) with 0.002 ms to first row, 36 ms to end, start offset by 48 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=136132.35..136132.37 rows=1 width=96)  
               Hash Key: tbl_heap_row.c2  
               Rows out:  48 rows at destination (seg42) with 53 ms to end, start offset by 48 ms.  
               ->  HashAggregate  (cost=136132.35..136132.35 rows=1 width=96)  
                     Group By: tbl_heap_row.c2  
                     Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.008 ms to first row, 1.993 ms to end, start offset by 48 ms.  
                     ->  Seq Scan on tbl_heap_row  (cost=0.00..121134.54 rows=20831 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 40 ms to first row, 73 ms to end, start offset by 50 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 345K bytes.  
   (slice1)    Executor memory: 751K bytes avg x 48 workers, 751K bytes max (seg0).  
   (slice2)    Executor memory: 359K bytes avg x 48 workers, 374K bytes max (seg42).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 138.524 ms  
(22 rows)

十、行存堆表，末尾几个字段的统计

postgres=# explain analyze select c398,count(*),sum(c399),avg(c399),min(c399),max(c399) from tbl_heap_row group by c398;  
                                                                         QUERY PLAN                                                                           
------------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=136576.82..136799.05 rows=9877 width=96)  
   Rows out:  10001 rows at destination with 212 ms to end, start offset by 1.917 ms.  
   ->  HashAggregate  (cost=136576.82..136799.05 rows=206 width=96)  
         Group By: tbl_heap_row.c398  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.001 ms to first row, 70 ms to end, start offset by 14 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=136132.35..136329.89 rows=206 width=96)  
               Hash Key: tbl_heap_row.c398  
               Rows out:  Avg 8762.2 rows x 48 workers at destination.  Max 9422 rows (seg46) with 93 ms to end, start offset by 48 ms.  
               ->  HashAggregate  (cost=136132.35..136132.35 rows=206 width=96)  
                     Group By: tbl_heap_row.c398  
                     Rows out:  Avg 8762.2 rows x 48 workers.  Max 8835 rows (seg2) with 0.003 ms to first row, 12 ms to end, start offset by 49 ms.  
                     ->  Seq Scan on tbl_heap_row  (cost=0.00..121134.54 rows=20831 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 40 ms to first row, 133 ms to end, start offset by 51 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 377K bytes.  
   (slice1)    Executor memory: 1156K bytes avg x 48 workers, 1156K bytes max (seg0).  
   (slice2)    Executor memory: 414K bytes avg x 48 workers, 414K bytes max (seg1).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 214.024 ms  
(22 rows)

十一、行存AO表，前面几个字段的统计

postgres=# explain analyze select c2,count(*),sum(c3),avg(c3),min(c3),max(c3) from tbl_ao_row group by c2;  
                                                                  QUERY PLAN                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=124755.04..124755.07 rows=1 width=96)  
   Rows out:  1 rows at destination with 149 ms to end, start offset by 1.890 ms.  
   ->  HashAggregate  (cost=124755.04..124755.07 rows=1 width=96)  
         Group By: tbl_ao_row.c2  
         Rows out:  1 rows (seg42) with 0.004 ms to first row, 55 ms to end, start offset by 64 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=124755.00..124755.02 rows=1 width=96)  
               Hash Key: tbl_ao_row.c2  
               Rows out:  48 rows at destination (seg42) with 32 ms to end, start offset by 64 ms.  
               ->  HashAggregate  (cost=124755.00..124755.00 rows=1 width=96)  
                     Group By: tbl_ao_row.c2  
                     Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.001 ms to first row, 46 ms to end, start offset by 59 ms.  
                     ->  Append-only Scan on tbl_ao_row  (cost=0.00..109755.00 rows=20834 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 24 ms to end, start offset by 59 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 345K bytes.  
   (slice1)    Executor memory: 770K bytes avg x 48 workers, 770K bytes max (seg0).  
   (slice2)    Executor memory: 359K bytes avg x 48 workers, 374K bytes max (seg42).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 152.386 ms  
(22 rows)

十二、行存AO表，末尾几个字段的统计

postgres=# explain analyze select c398,count(*),sum(c399),avg(c399),min(c399),max(c399) from tbl_ao_row group by c398;  
                                                                     QUERY PLAN                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=125186.01..125401.52 rows=9578 width=96)  
   Rows out:  10001 rows at destination with 183 ms to end, start offset by 1.846 ms.  
   ->  HashAggregate  (cost=125186.01..125401.52 rows=200 width=96)  
         Group By: tbl_ao_row.c398  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.003 ms to first row, 97 ms to end, start offset by 22 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=124755.00..124946.56 rows=200 width=96)  
               Hash Key: tbl_ao_row.c398  
               Rows out:  Avg 8762.2 rows x 48 workers at destination.  Max 9422 rows (seg46) with 32 ms to end, start offset by 68 ms.  
               ->  HashAggregate  (cost=124755.00..124755.00 rows=200 width=96)  
                     Group By: tbl_ao_row.c398  
                     Rows out:  Avg 8762.2 rows x 48 workers.  Max 8835 rows (seg2) with 0.013 ms to first row, 48 ms to end, start offset by 22 ms.  
                     ->  Append-only Scan on tbl_ao_row  (cost=0.00..109755.00 rows=20834 width=16)  
                           Rows out:  Avg 20833.3 rows x 48 workers.  Max 20854 rows (seg42) with 22 ms to end, start offset by 71 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 377K bytes.  
   (slice1)    Executor memory: 1144K bytes avg x 48 workers, 1144K bytes max (seg0).  
   (slice2)    Executor memory: 414K bytes avg x 48 workers, 414K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 184.723 ms  
(22 rows)

1三、列存AO表，前面几个字段的统计

postgres=# explain analyze select c2,count(*),sum(c3),avg(c3),min(c3),max(c3) from tbl_ao_col group by c2;  
                                                                    QUERY PLAN                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=122928.04..122928.07 rows=1 width=96)  
   Rows out:  1 rows at destination with 104 ms to end, start offset by 1.878 ms.  
   ->  HashAggregate  (cost=122928.04..122928.07 rows=1 width=96)  
         Group By: tbl_ao_col.c2  
         Rows out:  1 rows (seg42) with 0.003 ms to first row, 18 ms to end, start offset by 55 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=122928.00..122928.02 rows=1 width=96)  
               Hash Key: tbl_ao_col.c2  
               Rows out:  48 rows at destination (seg42) with 30 ms to end, start offset by 55 ms.  
               ->  HashAggregate  (cost=122928.00..122928.00 rows=1 width=96)  
                     Group By: tbl_ao_col.c2  
                     Rows out:  Avg 1.0 rows x 48 workers.  Max 1 rows (seg0) with 0.007 ms to first row, 3.991 ms to end, start offset by 54 ms.  
                     ->  Append-only Columnar Scan on tbl_ao_col  (cost=0.00..107928.00 rows=20834 width=16)  
                           Rows out:  0 rows (seg0) with 40 ms to end, start offset by 56 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 345K bytes.  
   (slice1)    Executor memory: 903K bytes avg x 48 workers, 903K bytes max (seg0).  
   (slice2)    Executor memory: 359K bytes avg x 48 workers, 374K bytes max (seg42).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 106.859 ms  
(22 rows)

1四、列存AO表，末尾几个字段的统计

postgres=# explain analyze select c398,count(*),sum(c399),avg(c399),min(c399),max(c399) from tbl_ao_col group by c398;  
                                                                       QUERY PLAN                                                                         
--------------------------------------------------------------------------------------------------------------------------------------------------------  
 Gather Motion 48:1  (slice2; segments: 48)  (cost=123364.18..123582.28 rows=9693 width=96)  
   Rows out:  10001 rows at destination with 120 ms to end, start offset by 1.921 ms.  
   ->  HashAggregate  (cost=123364.18..123582.28 rows=202 width=96)  
         Group By: tbl_ao_col.c398  
         Rows out:  Avg 208.4 rows x 48 workers.  Max 223 rows (seg17) with 0.001 ms to first row, 54 ms to end, start offset by 35 ms.  
         ->  Redistribute Motion 48:48  (slice1; segments: 48)  (cost=122928.00..123121.86 rows=202 width=96)  
               Hash Key: tbl_ao_col.c398  
               Rows out:  Avg 8762.2 rows x 48 workers at destination.  Max 9422 rows (seg46) with 31 ms to end, start offset by 63 ms.  
               ->  HashAggregate  (cost=122928.00..122928.00 rows=202 width=96)  
                     Group By: tbl_ao_col.c398  
                     Rows out:  Avg 8762.2 rows x 48 workers.  Max 8835 rows (seg2) with 0.004 ms to first row, 8.004 ms to end, start offset by 82 ms.  
                     ->  Append-only Columnar Scan on tbl_ao_col  (cost=0.00..107928.00 rows=20834 width=16)  
                           Rows out:  0 rows (seg0) with 28 ms to end, start offset by 64 ms.  
 Slice statistics:  
   (slice0)    Executor memory: 377K bytes.  
   (slice1)    Executor memory: 1272K bytes avg x 48 workers, 1272K bytes max (seg0).  
   (slice2)    Executor memory: 414K bytes avg x 48 workers, 414K bytes max (seg0).  
 Statement statistics:  
   Memory used: 128000K bytes  
 Settings:  optimizer=off  
 Optimizer status: legacy query optimizer  
 Total runtime: 122.173 ms  
(22 rows)

对于非分布键的分组聚合请求，Greenplum采用了多阶段聚合以下：

第一阶段，在SEGMENT本地聚合。（须要扫描全部数据，这里不一样存储，前面的列和后面的列的差异就体现出来了，行存储的deform开销，在对后面的列进行统计时性能影响很明显。） Greenplum会根据字段的distinct值的比例，考虑是直接重分布数据，仍是先在本地聚合后再重分布数据（减小重分布的数据量）。

第二阶段，根据分组字段，将结果数据重分布。（重分布须要用到的字段，此时结果很小。）

第三阶段，再次在SEGMENT本地聚合。（须要对重分布后的数据进行聚合。）

第四阶段，返回结果给master，有必要的话master节点调用聚合函数的final func（已是不多的记录数和运算量）。

如何查看表的存储结构

pg_class.relstorage表示这个对象是什么存储：

postgres=# select distinct relstorage from pg_class ;    
 relstorage     
------------    
 a  -- 行存储AO表    
 h  -- heap堆表、索引    
 x  -- 外部表(external table)    
 v  -- 视图    
 c  -- 列存储AO表    
(5 rows)

查询当前数据库有哪些AO表：

postgres=# select t2.nspname, t1.relname from pg_class t1, pg_namespace t2 where t1.relnamespace=t2.oid and relstorage in ('c', 'a');    
 nspname  |      relname          
----------+-------------------    
 postgres | tbl_tag    
 postgres | tbl_pos_1_prt_p1    
 postgres | tbl_pos_1_prt_p2    
 postgres | tbl_pos_1_prt_p3    
 postgres | tbl_pos_1_prt_p4    
 postgres | tbl_pos_1_prt_p5    
 postgres | tbl_pos_1_prt_p6    
 postgres | tbl_pos_1_prt_p7    
 postgres | tbl_pos_1_prt_p8    
 postgres | tbl_pos_1_prt_p9    
 postgres | tbl_pos_1_prt_p10    
 postgres | tbl_pos    
 postgres | xx_czrk_qm_col    
 postgres | ao1    
(14 rows)

查询当前数据库有哪些堆表：

select t2.nspname, t1.relname from pg_class t1, pg_namespace t2 where t1.relnamespace=t2.oid and relstorage in ('h') and relkind='r';

参考

《PostgreSQL 向量化执行插件(瓦片式实现) 10x提速OLAP》

http://greenplum.org/docs/best_practices/schema.html