PostgreSQL的B-tree索引

时间 2020-03-28

标签 postgresql tree 索引栏目 Postgre SQL 繁體版

原文原文链接

结构

B-tree索引适合用于存储排序的数据。对于这种数据类型须要定义大于、大于等于、小于、小于等于操做符。算法

一般状况下，B-tree的索引记录存储在数据页中。叶子页中的记录包含索引数据（keys）以及指向heap tuple记录（即表的行记录TIDs）的指针。内部页中的记录包含指向索引子页的指针和子页中最小值。sql

B-tree有几点重要的特性：数据库

一、B-tree是平衡树，即每一个叶子页到root页中间有相同个数的内部页。所以查询任何一个值的时间是相同的。express

二、B-tree中一个节点有多个分支，即每页（一般8KB）具备许多TIDs。所以B-tree的高度比较低，一般4到5层就能够存储大量行记录。less

三、索引中的数据以非递减的顺序存储（页之间以及页内都是这种顺序），同级的数据页由双向链表链接。所以不须要每次都返回root，经过遍历链表就能够获取一个有序的数据集。ide

下面是一个索引的简单例子，该索引存储的记录为整型并只有一个字段：函数

该索引最顶层的页是元数据页，该数据页存储索引root页的相关信息。内部节点位于root下面，叶子页位于最下面一层。向下的箭头表示由叶子节点指向表记录（TIDs）。post

等值查询

例如经过"indexed-field = expression"形式的条件查询49这个值。优化

root节点有三个记录：(4,32,64)。从root节点开始进行搜索，因为32≤ 49 < 64，因此选择32这个值进入其子节点。经过一样的方法继续向下进行搜索一直到叶子节点，最后查询到49这个值。spa

实际上，查询算法远不止看上去的这么简单。好比，该索引是非惟一索引时，容许存在许多相同值的记录，而且这些相同的记录不止存放在一个页中。此时该如何查询？咱们返回到上面的的例子，定位到第二层节点(32,43,49)。若是选择49这个值并向下进入其子节点搜索，就会跳过前一个叶子页中的49这个值。所以，在内部节点进行等值查询49时，定位到49这个值，而后选择49的前一个值43，向下进入其子节点进行搜索。最后，在底层节点中从左到右进行搜索。

(另一个复杂的地方是，查询的过程当中树结构可能会改变，好比分裂)

非等值查询

经过"indexed-field ≤ expression" (or "indexed-field ≥ expression")查询时，首先经过"indexed-field = expression"形式进行等值（若是存在该值）查询，定位到叶子节点后，再向左或向右进行遍历检索。

下图是查询 n ≤ 35的示意图：

大于和小于能够经过一样的方法进行查询。查询时须要排除等值查询出的值。

范围查询

范围查询"expression1 ≤ indexed-field ≤ expression2"时，须要经过 "expression1 ≤ indexed-field =expression2"找到一匹配值，而后在叶子节点从左到右进行检索，一直到不知足"indexed-field ≤ expression2" 的条件为止；或者反过来，首先经过第二个表达式进行检索，在叶子节点定位到该值后，再从右向左进行检索，一直到不知足第一个表达式的条件为止。

下图是23 ≤ n ≤ 64的查询示意图:

案例

下面是一个查询计划的实例。经过demo database中的aircraft表进行介绍。该表有9行数据，因为整个表只有一个数据页，因此执行计划不会使用索引。为了解释说明问题，咱们使用整个表进行说明。

demo=# select * from aircrafts;
 aircraft_code |        model        | range
---------------+---------------------+-------
 773           | Boeing 777-300      | 11100
 763           | Boeing 767-300      |  7900
 SU9           | Sukhoi SuperJet-100 |  3000
 320           | Airbus A320-200     |  5700
 321           | Airbus A321-200     |  5600
 319           | Airbus A319-100     |  6700
 733           | Boeing 737-300      |  4200
 CN1           | Cessna 208 Caravan  |  1200
 CR2           | Bombardier CRJ-200  |  2700
(9 rows)
demo=# create index on aircrafts(range);
demo=# set enable_seqscan = off;

（更准确的方式：create index on aircrafts using btree(range)，建立索引时默认构建B-tree索引。）

等值查询的执行计划：

demo=# explain(costs off) select * from aircrafts where range = 3000;
                    QUERY PLAN                     
---------------------------------------------------
 Index Scan using aircrafts_range_idx on aircrafts
   Index Cond: (range = 3000)
(2 rows)

非等值查询的执行计划：

demo=# explain(costs off) select * from aircrafts where range < 3000;
                    QUERY PLAN                    
---------------------------------------------------
 Index Scan using aircrafts_range_idx on aircrafts
   Index Cond: (range < 3000)
(2 rows)

范围查询的执行计划：

demo=# explain(costs off) select * from aircrafts
where range between 3000 and 5000;
                     QUERY PLAN                      
-----------------------------------------------------
 Index Scan using aircrafts_range_idx on aircrafts
   Index Cond: ((range >= 3000) AND (range <= 5000))
(2 rows)

排序

再次强调，经过index、index-only或bitmap扫描，btree访问方法能够返回有序的数据。所以若是表的排序条件上有索引，优化器会考虑如下方式：表的索引扫描；表的顺序扫描而后对结果集进行排序。

排序顺序

当建立索引时能够明确指定排序顺序。以下所示，在range列上创建一个索引，而且排序顺序为降序：

demo=# create index on aircrafts(range desc);

本案例中，大值会出如今树的左边，小值出如今右边。为何有这样的需求？这样作是为了多列索引。建立aircraft的一个视图，经过range分红3部分：

demo=# create view aircrafts_v as
select model,
       case
           when range < 4000 then 1
           when range < 10000 then 2
           else 3
       end as class
from aircrafts;
 
 
demo=# select * from aircrafts_v;
        model        | class
---------------------+-------
 Boeing 777-300      |     3
 Boeing 767-300      |     2
 Sukhoi SuperJet-100 |     1
 Airbus A320-200     |     2
 Airbus A321-200     |     2
 Airbus A319-100     |     2
 Boeing 737-300      |     2
 Cessna 208 Caravan  |     1
 Bombardier CRJ-200  |     1
(9 rows)

而后建立一个索引（使用下面表达式）：

demo=# create index on aircrafts(  (case when range < 4000 then 1 when range < 10000 then 2 else 3 end),  model);

如今，能够经过索引以升序的方式获取排序的数据：

demo=# select class, model from aircrafts_v order by class, model;
 class |        model        
-------+---------------------
     1 | Bombardier CRJ-200
     1 | Cessna 208 Caravan
     1 | Sukhoi SuperJet-100
     2 | Airbus A319-100
     2 | Airbus A320-200
     2 | Airbus A321-200
     2 | Boeing 737-300
     2 | Boeing 767-300
     3 | Boeing 777-300
(9 rows)
 
 
demo=# explain(costs off)
select class, model from aircrafts_v order by class, model;
                       QUERY PLAN                       
--------------------------------------------------------
 Index Scan using aircrafts_case_model_idx on aircrafts
(1 row)

一样，能够以降序的方式获取排序的数据：

demo=# select class, model from aircrafts_v order by class desc, model desc;
 class |        model        
-------+---------------------
     3 | Boeing 777-300
     2 | Boeing 767-300
     2 | Boeing 737-300
     2 | Airbus A321-200
     2 | Airbus A320-200
     2 | Airbus A319-100
     1 | Sukhoi SuperJet-100
     1 | Cessna 208 Caravan
     1 | Bombardier CRJ-200
(9 rows)
demo=# explain(costs off)
select class, model from aircrafts_v order by class desc, model desc;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Index Scan BACKWARD using aircrafts_case_model_idx on aircrafts
(1 row)

然而，若是一列以升序一列以降序的方式获取排序的数据的话，就不能使用索引，只能单独排序：

demo=# explain(costs off)
select class, model from aircrafts_v order by class ASC, model DESC;
                   QUERY PLAN                    
-------------------------------------------------
 Sort
   Sort Key: (CASE ... END), aircrafts.model DESC
   ->  Seq Scan on aircrafts
(3 rows)

（注意，最终执行计划会选择顺序扫描，忽略以前设置的enable_seqscan = off。由于这个设置并不会放弃表扫描，只是设置他的成本----查看costs on的执行计划）

如有使用索引，建立索引时指定排序的方向：

demo=# create index aircrafts_case_asc_model_desc_idx on aircrafts(
 (case
    when range < 4000 then 1
    when range < 10000 then 2
    else 3
  end) ASC,
  model DESC);
 
 
demo=# explain(costs off)
select class, model from aircrafts_v order by class ASC, model DESC;
                           QUERY PLAN                            
-----------------------------------------------------------------
 Index Scan using aircrafts_case_asc_model_desc_idx on aircrafts
(1 row)

列的顺序

当使用多列索引时与列的顺序有关的问题会显示出来。对于B-tree，这个顺序很是重要：页中的数据先以第一个字段进行排序，而后再第二个字段，以此类推。

下图是在range和model列上构建的索引：

固然，上图这么小的索引在一个root页足以存放。可是为了清晰起见，特地将其分红几页。

从图中可见，经过相似的谓词class = 3（仅按第一个字段进行搜索）或者class = 3 and model = 'Boeing 777-300'（按两个字段进行搜索）将很是高效。

然而，经过谓词model = 'Boeing 777-300'进行搜索的效率将大大下降：从root开始，判断不出选择哪一个子节点进行向下搜索，所以会遍历全部子节点向下进行搜索。这并不意味着永远没法使用这样的索引----它的效率有问题。例如，若是aircraft有3个classes值，每一个class类中有许多model值，此时不得不扫描索引1/3的数据，这可能比全表扫描更有效。

可是，当建立以下索引时：

demo=# create index on aircrafts(  model,  (case when range < 4000 then 1 when range < 10000 then 2 else 3 end));

索引字段的顺序会改变：

经过这个索引，model = 'Boeing 777-300'将会颇有效，但class = 3则没这么高效。

NULLs

PostgreSQL的B-tree支持在NULLs上建立索引，能够经过IS NULL或者IS NOT NULL的条件进行查询。

考虑flights表，容许NULLs：

demo=# create index on flights(actual_arrival);
demo=# explain(costs off) select * from flights where actual_arrival is null;
                      QUERY PLAN                       
-------------------------------------------------------
 Bitmap Heap Scan on flights
   Recheck Cond: (actual_arrival IS NULL)
   ->  Bitmap Index Scan on flights_actual_arrival_idx
         Index Cond: (actual_arrival IS NULL)
(4 rows)

NULLs位于叶子节点的一端或另外一端，这依赖于索引的建立方式（NULLS FIRST或NULLS LAST）。若是查询中包含排序，这就显得很重要了：若是SELECT语句在ORDER BY子句中指定NULLs的顺序索引构建的顺序同样（NULLS FIRST或NULLS LAST），就可使用整个索引。

下面的例子中，他们的顺序相同，所以可使用索引：

demo=# explain(costs off)
select * from flights order by actual_arrival NULLS LAST;
                       QUERY PLAN                      
--------------------------------------------------------
 Index Scan using flights_actual_arrival_idx on flights
(1 row)

下面的例子，顺序不一样，优化器选择顺序扫描而后进行排序：

demo=# explain(costs off)
select * from flights order by actual_arrival NULLS FIRST;
               QUERY PLAN              
----------------------------------------
 Sort
   Sort Key: actual_arrival NULLS FIRST
   ->  Seq Scan on flights
(3 rows)

NULLs必须位于开头才能使用索引：

demo=# create index flights_nulls_first_idx on flights(actual_arrival NULLS FIRST);
demo=# explain(costs off)
select * from flights order by actual_arrival NULLS FIRST;
                     QUERY PLAN                      
-----------------------------------------------------
 Index Scan using flights_nulls_first_idx on flights
(1 row)

像这样的问题是由NULLs引发的而不是没法排序，也就是说NULL和其余这比较的结果没法预知：

demo=# \pset null NULL
demo=# select null < 42;
 ?column?
----------
 NULL
(1 row)

这和B-tree的概念背道而驰而且不符合通常的模式。然而NULLs在数据库中扮演者很重要的角色，所以不得不为NULL作特殊设置。

因为NULLs能够被索引，所以即便表上没有任何标记也可使用索引。（由于这个索引包含表航记录的全部信息）。若是查询须要排序的数据，并且索引确保了所需的顺序，那么这多是由意义的。这种状况下，查询计划更倾向于经过索引获取数据。

属性

下面介绍btree访问方法的特性。

 amname |     name      | pg_indexam_has_property
--------+---------------+-------------------------
 btree  | can_order     | t
 btree  | can_unique    | t
 btree  | can_multi_col | t
 btree  | can_exclude   | t

能够看到，B-tree可以排序数据而且支持惟一性。同时还支持多列索引，可是其余访问方法也支持这种索引。咱们将在下次讨论EXCLUDE条件。

     name      | pg_index_has_property
---------------+-----------------------
 clusterable   | t
 index_scan    | t
 bitmap_scan   | t
 backward_scan | t

Btree访问方法能够经过如下两种方式获取数据：index scan以及bitmap scan。能够看到，经过tree能够向前和向后进行遍历。

      name          | pg_index_column_has_property
--------------------+------------------------------
 asc                | t
 desc               | f
 nulls_first        | f
 nulls_last         | t
 orderable          | t
 distance_orderable | f
 returnable         | t
 search_array       | t
 search_nulls       | t

前四种特性指定了特定列如何精确的排序。本案例中，值以升序（asc）进行排序而且NULLs在后面（nulls_last）。也能够有其余组合。

search_array的特性支持向这样的表达式：

demo=# explain(costs off)
select * from aircrafts where aircraft_code in ('733','763','773');
                           QUERY PLAN                            
-----------------------------------------------------------------
 Index Scan using aircrafts_pkey on aircrafts
   Index Cond: (aircraft_code = ANY ('{733,763,773}'::bpchar[]))
(2 rows)

returnable属性支持index-only scan，因为索引自己也存储索引值因此这是合理的。下面简单介绍基于B-tree的覆盖索引。

具备额外列的惟一索引

前面讨论了：覆盖索引包含查询所需的全部值，需不要再回表。惟一索引能够成为覆盖索引。

假设咱们查询所须要的列添加到惟一索引，新的组合惟一键可能再也不惟一，同一列上将须要2个索引：一个惟一，支持完整性约束；另外一个是非惟一，为了覆盖索引。这固然是低效的。

在咱们公司 Anastasiya Lubennikova @ lubennikovaav 改进了btree，额外的非惟一列能够包含在惟一索引中。咱们但愿这个补丁能够被社区采纳。实际上PostgreSQL11已经合了该补丁。

考虑表bookings：

demo=# begin;
demo=# alter table bookings drop constraint bookings_pkey cascade;
demo=# alter table bookings add primary key using index bookings_pkey2;
demo=# alter table tickets add foreign key (book_ref) references bookings (book_ref);
demo=# commit;

而后表结构：

demo=# \d bookings
              Table "bookings.bookings"
    Column    |           Type           | Modifiers
--------------+--------------------------+-----------
 book_ref     | character(6)             | not null
 book_date    | timestamp with time zone | not null
 total_amount | numeric(10,2)            | not null
Indexes:
    "bookings_pkey2" PRIMARY KEY, btree (book_ref) INCLUDE (book_date)
Referenced by:
TABLE "tickets" CONSTRAINT "tickets_book_ref_fkey" FOREIGN KEY (book_ref) REFERENCES bookings(book_ref)

此时，这个索引能够做为惟一索引工做也能够做为覆盖索引：

demo=# explain(costs off)
select book_ref, book_date from bookings where book_ref = '059FC4';
                    QUERY PLAN                    
--------------------------------------------------
 Index Only Scan using bookings_pkey2 on bookings
   Index Cond: (book_ref = '059FC4'::bpchar)
(2 rows)

建立索引

众所周知，对于大表，加载数据时最好不要带索引；加载完成后再建立索引。这样作不只提高效率还能节省空间。

建立B-tree索引比向索引中插入数据更高效。全部的数据大体上都已排序，而且数据的叶子页已建立好，而后只需构建内部页直到root页构建成一个完整的B-tree。

这种方法的速度依赖于RAM的大小，受限于参数maintenance_work_mem。所以增大该参数值能够提高速度。对于惟一索引，除了分配maintenance_work_mem的内存外，还分配了work_mem的大小的内存。

比较

前面，提到PG须要知道对于不一样类型的值调用哪一个函数，而且这个关联方法存储在哈希访问方法中。一样，系统必须找出如何排序。这在排序、分组（有时）、merge join中会涉及。PG不会将自身绑定到操做符名称，由于用户能够自定义他们的数据类型并给出对应不一样的操做符名称。

例如bool_ops操做符集中的比较操做符：

postgres=# select   amop.amopopr::regoperator as opfamily_operator,
         amop.amopstrategy
from     pg_am am,
         pg_opfamily opf,
         pg_amop amop
where    opf.opfmethod = am.oid
and      amop.amopfamily = opf.oid
and      am.amname = 'btree'
and      opf.opfname = 'bool_ops'
order by amopstrategy;
  opfamily_operator  | amopstrategy
---------------------+--------------
 <(boolean,boolean)  |            1
 <=(boolean,boolean) |            2
 =(boolean,boolean)  |            3
 >=(boolean,boolean) |            4
 >(boolean,boolean)  |            5
(5 rows)

这里能够看到有5种操做符，可是不该该依赖于他们的名字。为了指定哪一种操做符作什么操做，引入策略的概念。为了描述操做符语义，定义了5种策略：

1 — less

2 — less or equal

3 — equal

4 — greater or equal

5 — greater

postgres=# select   amop.amopopr::regoperator as opfamily_operator
from     pg_am am,
         pg_opfamily opf,
         pg_amop amop
where    opf.opfmethod = am.oid
and      amop.amopfamily = opf.oid
and      am.amname = 'btree'
and      opf.opfname = 'integer_ops'
and      amop.amopstrategy = 1
order by opfamily_operator;
  pfamily_operator  
----------------------
 <(integer,bigint)
 <(smallint,smallint)
 <(integer,integer)
 <(bigint,bigint)
 <(bigint,integer)
 <(smallint,integer)
 <(integer,smallint)
 <(smallint,bigint)
 <(bigint,smallint)
(9 rows)

一些操做符族能够包含几种操做符，例如integer_ops包含策略1的几种操做符：

正因如此，当比较类型在一个操做符族中时，不一样类型值的比较，优化器能够避免类型转换。

索引支持的新数据类型

文档中提供了一个建立符合数值的新数据类型，以及对这种类型数据进行排序的操做符类。该案例使用C语言完成。但不妨碍咱们使用纯SQL进行对比试验。

建立一个新的组合类型：包含real和imaginary两个字段

postgres=# create type complex as (re float, im float);

建立一个包含该新组合类型字段的表：

postgres=# create table numbers(x complex);
postgres=# insert into numbers values ((0.0, 10.0)), ((1.0, 3.0)), ((1.0, 1.0));

如今有个疑问，若是在数学上没有为他们定义顺序关系，如何进行排序？

已经定义好了比较运算符：

postgres=# select * from numbers order by x;
   x    
--------
 (0,10)
 (1,1)
 (1,3)
(3 rows)

默认状况下，对于组合类型排序是分开的：首先比较第一个字段而后第二个字段，与文本字符串比较方法大体相同。可是咱们也能够定义其余的排序方式，例如组合数字能够当作一个向量，经过模值进行排序。为了定义这样的顺序，咱们须要建立一个函数：

postgres=# create function modulus(a complex) returns float as $$
    select sqrt(a.re*a.re + a.im*a.im);
$$ immutable language sql;
 
 
//此时，使用整个函数系统的定义5种操做符：
postgres=# create function complex_lt(a complex, b complex) returns boolean as $$
    select modulus(a) < modulus(b);
$$ immutable language sql;
 
postgres=# create function complex_le(a complex, b complex) returns boolean as $$
    select modulus(a) <= modulus(b);
$$ immutable language sql;
 
postgres=# create function complex_eq(a complex, b complex) returns boolean as $$
    select modulus(a) = modulus(b);
$$ immutable language sql;
 
postgres=# create function complex_ge(a complex, b complex) returns boolean as $$
    select modulus(a) >= modulus(b);
$$ immutable language sql;
 
postgres=# create function complex_gt(a complex, b complex) returns boolean as $$
    select modulus(a) > modulus(b);
$$ immutable language sql;

而后建立对应的操做符：

postgres=# create operator #<#(leftarg=complex, rightarg=complex, procedure=complex_lt);
postgres=# create operator #<=#(leftarg=complex, rightarg=complex, procedure=complex_le);
postgres=# create operator #=#(leftarg=complex, rightarg=complex, procedure=complex_eq);
postgres=# create operator #>=#(leftarg=complex, rightarg=complex, procedure=complex_ge);
postgres=# create operator #>#(leftarg=complex, rightarg=complex, procedure=complex_gt);

此时，能够比较数字：

postgres=# select (1.0,1.0)::complex #<# (1.0,3.0)::complex;
 ?column?
----------
 t
(1 row)

除了整个5个操做符，还须要定义函数：小于返回-1；等于返回0；大于返回1。其余访问方法可能须要定义其余函数：

postgres=# create function complex_cmp(a complex, b complex) returns integer as $$
    select case when modulus(a) < modulus(b) then -1
                when modulus(a) > modulus(b) then 1
                else 0
           end;
$$ language sql;

建立一个操做符类：

postgres=# create operator class complex_ops
default for type complex
using btree as
    operator 1 #<#,
    operator 2 #<=#,
    operator 3 #=#,
    operator 4 #>=#,
    operator 5 #>#,
function 1 complex_cmp(complex,complex);
 
//排序结果：
postgres=# select * from numbers order by x;
   x    
--------
 (1,1)
 (1,3)
 (0,10)
(3 rows)
 
//可使用此查询获取支持的函数：
 
postgres=# select amp.amprocnum,
       amp.amproc,
       amp.amproclefttype::regtype,
       amp.amprocrighttype::regtype
from   pg_opfamily opf,
       pg_am am,
       pg_amproc amp
where  opf.opfname = 'complex_ops'
and    opf.opfmethod = am.oid
and    am.amname = 'btree'
and    amp.amprocfamily = opf.oid;
 amprocnum |   amproc    | amproclefttype | amprocrighttype
-----------+-------------+----------------+-----------------
         1 | complex_cmp | complex        | complex
(1 row)

内部结构

使用pageinspect插件观察B-tree结构：

demo=# create extension pageinspect;

索引的元数据页：

demo=# select * from bt_metap('ticket_flights_pkey');
 magic  | version | root | level | fastroot | fastlevel
--------+---------+------+-------+----------+-----------
 340322 |       2 |  164 |     2 |      164 |         2
(1 row)

值得关注的是索引level：不包括root，有一百万行记录的表其索引只须要2层就能够了。

Root页，即164号页面的统计信息：

demo=# select type, live_items, dead_items, avg_item_size, page_size, free_size
from bt_page_stats('ticket_flights_pkey',164);
 type | live_items | dead_items | avg_item_size | page_size | free_size
------+------------+------------+---------------+-----------+-----------
 r    |         33 |          0 |            31 |      8192 |      6984
(1 row)

该页中数据：

demo=# select itemoffset, ctid, itemlen, left(data,56) as data
from bt_page_items('ticket_flights_pkey',164) limit 5;
 itemoffset |  ctid   | itemlen |                           data                           
------------+---------+---------+----------------------------------------------------------
          1 | (3,1)   |       8 |
          2 | (163,1) |      32 | 1d 30 30 30 35 34 33 32 33 30 35 37 37 31 00 00 ff 5f 00
          3 | (323,1) |      32 | 1d 30 30 30 35 34 33 32 34 32 33 36 36 32 00 00 4f 78 00
          4 | (482,1) |      32 | 1d 30 30 30 35 34 33 32 35 33 30 38 39 33 00 00 4d 1e 00
          5 | (641,1) |      32 | 1d 30 30 30 35 34 33 32 36 35 35 37 38 35 00 00 2b 09 00
(5 rows)

第一个tuple指定该页的最大值，真正的数据从第二个tuple开始。很明显最左边子节点的页号是163，而后是323。反过来，可使用相同的函数搜索。

PG10版本提供了"amcheck"插件，该插件能够检测B-tree数据的逻辑一致性，使咱们提早探知故障。

原文

https://habr.com/en/company/postgrespro/blog/443284/