编写Postgres扩展之三:调试


在上一篇关于编写Postgres Extensions的文章中,咱们从头开始建立了一个新的数据类型base36。 然而,咱们留下了一个致使咱们的服务器崩溃的严重bug。html

如今让咱们用调试器来查找这个错误并完成测试工具。node

咱们根据这些系列中关于编写PostgreSQL扩展的内容建立了一个专用的github repo。上一篇文章中的代码能够在分支部分ii上找到,今天的更改在分支部分iii上。git

Bug

首先让咱们复现这个bug。github

test=# CREATE EXTENSION base36;
test=# CREATE TABLE base36_test(val base36);
test=#  EXPLAIN SELECT * FROM base36_test where '3c'::base36 > val;
server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Time: 680,225 ms
!>

咱们固然不但愿这种状况发生在咱们的生产数据库上,因此让咱们找出问题所在。咱们只编写了两个相对简单的C函数base36_outbase36_in。若是咱们假设咱们并不比PostgreSQL核心团队的人更聪明 - 至少对我我的来讲这是一个合理的假设 - 那么bug必定在其中一个中。sql

#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(base36_in);
Datum
base36_in(PG_FUNCTION_ARGS)
{
    long result;
    char *str = PG_GETARG_CSTRING(0);
    result = strtol(str, NULL, 36);
    PG_RETURN_INT32((int32)result);
}

PG_FUNCTION_INFO_V1(base36_out);
Datum
base36_out(PG_FUNCTION_ARGS)
{
    int32 arg = PG_GETARG_INT32(0);
    if (arg < 0)
        ereport(ERROR,
            (
             errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
             errmsg("negative values are not allowed"),
             errdetail("value %d is negative", arg),
             errhint("make it positive")
            )
        );
    char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";

    /* max 6 char + '\0' */
    char *buffer        = palloc(7 * sizeof(char));
    unsigned int offset = 7 * sizeof(char);
    buffer[--offset]    = '\0';

    do {
        buffer[--offset] = base36[arg % 36];
    } while (arg /= 36);

    PG_RETURN_CSTRING(&buffer[offset]);
}

设置调试环境

为了使用想LLDB这样的调试器,你须要开启调试符号编译Postgresql。下面的简短指南适应于安装了Postgresql和Homebrew的MacOS。经过Xcode使用LLDB。shell

首先,咱们先关闭全部正在运行的Postgresql实例-你不想搞乱你现有的数据库或其余工做:)数据库

$ cd /usr/local/opt/postgresql
$ launchctl unload homebrew.mxcl.postgresql.plist
# Double check it’s not running:
$ psql some_db
psql: could not connect to server: No such file or directory
  Is the server running locally and accepting
  connections on Unix domain socket "/tmp/.s.PGSQL.5432"?

而后,咱们经过运行下面的脚本下载postrgesql源码。安全

$ cd ~
$ curl https://ftp.postgresql.org/pub/source/v9.4.4/postgresql-9.4.4.tar.bz2 | bzip2 -d | tar x
$ cd postgresql-9.4.4

并启用调试选项进行构建。服务器

$  ./configure --enable-cassert --enable-debug CFLAGS="-ggdb"
$ make
$ sudo make install

咱们将跳过Postgres文档推荐的adduser命令。相反,我将使用本身的用户账户运行Postgres,以便更轻松地进行调试。app

$ sudo chown $(whoami) /usr/local/pgsql

而后初始化数据目录

$ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data

接着启动服务器

$ /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /usr/local/pgsql/data/postmaster.log start

将新安装的pgsql/bin 路径添加到path环境变量中

$ export PATH=/usr/local/pgsql/bin:$PATH

安装扩展(因为上面的export,此次使用新安装的pgxn

$ make && make install

如今咱们能够建立test数据库了

$ /usr/local/pgsql/bin/createdb test

而后链接到这个数据库

$ /usr/local/pgsql/bin/psql test

检查是否正常

test=# CREATE EXTENSION base36 ;
CREATE EXTENSION
test=# CREATE TABLE base36_test(val base36);
CREATE TABLE
test=# INSERT INTO base36_test VALUES ('123'), ('3c'), ('5A'), ('zZz');
INSERT 0 4
test=# EXPLAIN SELECT * FROM base36_test where val='3c';
server closed the connection unexpectedly
  This probably means the server terminated abnormally
  before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

调试

如今咱们已经配置好了调试环境,让咱们开始实际追踪问题。首先,咱们看一下日志文件。也就是咱们用-l指定的文件。根据上面的步骤,文件路径是/usr/local/pgsql/data/postmaster.log

TRAP: FailedAssertion("!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", File: "mcxt.c", Line: 699)
LOG:  server process (PID 6515) was terminated by signal 6: Abort trap
DETAIL:  Failed process was running: EXPLAIN SELECT * FROM base36_test where val='3c';
LOG:  terminating any other active server processes
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and repeat your command.
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted; last known up at 2015-10-09 15:11:18 CEST
LOG:  database system was not properly shut down; automatic recovery in progress
LOG:  redo starts at 0/22D0868
LOG:  record with zero length at 0/2359140
LOG:  redo done at 0/2359110
LOG:  last completed transaction was at log time 2015-10-09 15:12:01.344859+02
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

从新链接到数据库并找出当前数据库会话的pid

test=# SELECT pg_backend_pid();
 pg_backend_pid
----------------
           6644
(1 row)

将LLDB与pid链接(在另外一个终端中)

lldb -p 6644

在psql会话中运行以前失败命令

EXPLAIN SELECT * FROM base36_test where val='3c';

继续LLDB

(lldb) c
Process 6644 resuming
Process 6644 stopped
* thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill:
->  0x7fff906d2286 <+10>: jae    0x7fff906d2290            ; <+20>
    0x7fff906d2288 <+12>: movq   %rax, %rdi
    0x7fff906d228b <+15>: jmp    0x7fff906cdc53            ; cerror_nocancel
    0x7fff906d2290 <+20>: retq

从LLDB中获取到一个回溯

(lldb) bt
* thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff910f39f9 libsystem_pthread.dylib`pthread_kill + 90
    frame #2: 0x00007fff848f19b3 libsystem_c.dylib`abort + 129
    frame #3: 0x0000000108328549 postgres`ExceptionalCondition(conditionName="!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", errorType="FailedAssertion", fileName="mcxt.c", lineNumber=699) + 137 at assert.c:54
    frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
    frame #5: 0x00000001082ab9e3 postgres`get_const_expr(constval=0x00007ff9e1806708, context=0x00007fff57e824c8, showtype=0) + 707 at ruleutils.c:8002
    frame #6: 0x00000001082a5f79 postgres`get_rule_expr(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01') + 281 at ruleutils.c:6647
    frame #7: 0x00000001082acf22 postgres`get_rule_expr_paren(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01', parentNode=0x00007ff9e1806788) + 146 at ruleutils.c:6600

...more

咱们获得了什么?异常在mcxt.c:699定义的pfree抛出。pfree在ruleutils.c:8002中被get_const_expr调用。若是咱们去调用堆栈四次。 咱们最终到了这里:

(lldb) up
frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699
   696     * allocated chunk.
   697     */
   698    Assert(pointer != NULL);
-> 699    Assert(pointer == (void *) MAXALIGN(pointer));
   700
   701    /*
   702     * OK, it's probably safe to look at the chunk header.

让咱们看看mcxt.c中的源码

文件名mcxt.c:699

/*
 * pfree
 *    Release an allocated chunk.
 */
void
pfree(void *pointer)
{
  MemoryContext context;

  /*
   * Try to detect bogus pointers handed to us, poorly though we can.
   * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an
   * allocated chunk.
   */
  Assert(pointer != NULL);
  Assert(pointer == (void *) MAXALIGN(pointer));

  /*
   * OK, it's probably safe to look at the chunk header.
   */
  context = ((StandardChunkHeader *)
         ((char *) pointer - STANDARDCHUNKHEADERSIZE))->context;

  AssertArg(MemoryContextIsValid(context));

  (*context->methods->free_p) (context, pointer);
  VALGRIND_MEMPOOL_FREE(context, pointer);
}

Postgres使用pfree从当前内存上下文释放内存。不知何故,咱们把内存搞砸了。

咱们看一下指针的内容

(lldb) print (char *)pointer
(char *) $0 = 0x00007ff9e1813674 "3c"

这确实是咱们的搜索条件3c。因此咱们这里作错了什么?正如第一篇文章提到的,pfreepalloc是Postgres中freemalloc的对等体,用于在当前内存上下文中安全地分配和释放内存。不知何故咱们搞错了。在base36_out中咱们使用

char *buffer = palloc0(7 * sizeof(char));

来分配了7个字节的内存。最终咱们偏移4返回一个指针

PG_RETURN_CSTRING(&buffer[offset]);

mcxt.c:699中的断言以下:

Assert(pointer == (void *) MAXALIGN(pointer));

确保要释放的数据正确对齐。这里的条件是

(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))

当指针以8字节的倍数开始时读取?因为返回的地址与分配的地址不一样,所以pfree会因指针没有对齐而报错。

让咱们来修复他!

文件名:base36.c

PG_FUNCTION_INFO_V1(base36_out);
Datum
base36_out(PG_FUNCTION_ARGS)
{
    int32 arg = PG_GETARG_INT32(0);
    if (arg < 0)
        ereport(ERROR,
            (
             errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
             errmsg("negative values are not allowed"),
             errdetail("value %d is negative", arg),
             errhint("make it positive")
            )
        );
    char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz";

    /* max 6 char + '\0' */
    char buffer[7];
    unsigned int offset = sizeof(buffer);
    buffer[--offset] = '\0';

    do {
        buffer[--offset] = base36[arg % 36];
    } while (arg /= 36);

    PG_RETURN_CSTRING(pstrdup(&buffer[offset]));
}

如今咱们从堆栈中分配缓冲区(第18行),最后咱们使用pstrdup复制字符串到新分配的内存(第26行)。这种实现更接近 - 几乎至关于维基百科

您可能已经猜到pstrdup是Postgres对应的strdup。它经过palloc安全地从当前内存上下文中获取内存,并在事务结束时自动释放。

类型转换

如今咱们能够为咱们的类型输入和输出数据了。从其余类型转换或者转换为其余类型都行。

文件名:base36–0.0.1.sql

-- type and operators omitted

CREATE CAST (integer as base36) WITHOUT FUNCTION AS IMPLICIT;
CREATE CAST (base36 as integer) WITHOUT FUNCTION AS IMPLICIT;

这是相对容易的。 因为整数和base36是二进制可强制的(即二进制内部表示相同),转换能够免费完成(WITHOUT FUNCTION)。咱们还将此转换标记为隐式转换,从而告诉postgres它能够在适当的时候自动执行转换。例如,考虑这个查询:

test=# SELECT 10::integer + '5a'::base36;
 ?column?
----------
      200
(1 row)

没有定义整数+ base36运算符,可是经过将base36隐式转换为integer Postgres可使用integer + integer运算符并将结果转换为integer。然而,隐式转换应该当心定义,由于某些操做的结果多是可疑的。对于上面的操做,用户不知道结果是整数仍是base36,所以可能会误解它。若是咱们后来决定添加一个返回base36的运算符integer + base36,查询将彻底中断。

更使人困惑的多是这个查询结果

test=# SELECT -50::base36;
 ?column?
----------
      -50
(1 row)

虽然咱们不容许负值,但咱们在这里获得了一个,这怎么可能呢?内部Postgres执行如下操做:

test=# SELECT -(50::base36)::integer;
 ?column?
----------
      -50

咱们能够也应该避免这种使人困惑的行为。一种选择是向base36输出添加前缀(就像十六进制或八进制数同样),或者将责任交给用户,只容许显式强制转换。

澄清事情的另外一个选择是标记zhuanhuan AS ASSIGNMENT.
只有当您为base36类型指定一个整数时,才会自动执行该转换,反之亦然。这一般适用于INSERT或UPDATE语句。让咱们试试这个

文件名:base36–0.0.1.sql

-- type and operators omitted

CREATE CAST (integer as base36) WITHOUT FUNCTION AS ASSIGNMENT;
CREATE CAST (base36 as integer) WITHOUT FUNCTION AS ASSIGNMENT;

填充咱们的表格

test=# INSERT INTO base36_test SELECT i FROM generate_series(1,100) i;
INSERT 0 100
SELECT * FROM base36_test ORDER BY val LIMIT 12;
 val
-----
 1
 2
 3
 4
 5
 6
 7
 8
 9
 a
 b
 c
(12 rows)

更多内容

You have seen how important it is to test everything, not only to find bugs that in the worst case might crash the server, but also to specify the expected output from certain operations such as casts. In the next post we’ll elaborate on that creating a full-coverage test suite.

相关文章
相关标签/搜索