在上一篇关于编写Postgres Extensions的文章中,咱们从头开始建立了一个新的数据类型base36。 然而,咱们留下了一个致使咱们的服务器崩溃的严重bug。html
如今让咱们用调试器来查找这个错误并完成测试工具。node
咱们根据这些系列中关于编写PostgreSQL扩展的内容建立了一个专用的github repo。上一篇文章中的代码能够在分支部分ii上找到,今天的更改在分支部分iii上。git
首先让咱们复现这个bug。github
test=# CREATE EXTENSION base36; test=# CREATE TABLE base36_test(val base36); test=# EXPLAIN SELECT * FROM base36_test where '3c'::base36 > val; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. Time: 680,225 ms !>
咱们固然不但愿这种状况发生在咱们的生产数据库上,因此让咱们找出问题所在。咱们只编写了两个相对简单的C函数base36_out
和base36_in
。若是咱们假设咱们并不比PostgreSQL核心团队的人更聪明 - 至少对我我的来讲这是一个合理的假设 - 那么bug必定在其中一个中。sql
#include "postgres.h" #include "fmgr.h" #include "utils/builtins.h" PG_MODULE_MAGIC; PG_FUNCTION_INFO_V1(base36_in); Datum base36_in(PG_FUNCTION_ARGS) { long result; char *str = PG_GETARG_CSTRING(0); result = strtol(str, NULL, 36); PG_RETURN_INT32((int32)result); } PG_FUNCTION_INFO_V1(base36_out); Datum base36_out(PG_FUNCTION_ARGS) { int32 arg = PG_GETARG_INT32(0); if (arg < 0) ereport(ERROR, ( errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg("negative values are not allowed"), errdetail("value %d is negative", arg), errhint("make it positive") ) ); char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz"; /* max 6 char + '\0' */ char *buffer = palloc(7 * sizeof(char)); unsigned int offset = 7 * sizeof(char); buffer[--offset] = '\0'; do { buffer[--offset] = base36[arg % 36]; } while (arg /= 36); PG_RETURN_CSTRING(&buffer[offset]); }
为了使用想LLDB这样的调试器,你须要开启调试符号编译Postgresql。下面的简短指南适应于安装了Postgresql和Homebrew的MacOS。经过Xcode使用LLDB。shell
首先,咱们先关闭全部正在运行的Postgresql实例-你不想搞乱你现有的数据库或其余工做:)数据库
$ cd /usr/local/opt/postgresql $ launchctl unload homebrew.mxcl.postgresql.plist # Double check it’s not running: $ psql some_db psql: could not connect to server: No such file or directory Is the server running locally and accepting connections on Unix domain socket "/tmp/.s.PGSQL.5432"?
而后,咱们经过运行下面的脚本下载postrgesql源码。安全
$ cd ~ $ curl https://ftp.postgresql.org/pub/source/v9.4.4/postgresql-9.4.4.tar.bz2 | bzip2 -d | tar x $ cd postgresql-9.4.4
并启用调试选项进行构建。服务器
$ ./configure --enable-cassert --enable-debug CFLAGS="-ggdb" $ make $ sudo make install
咱们将跳过Postgres文档推荐的adduser命令。相反,我将使用本身的用户账户运行Postgres,以便更轻松地进行调试。app
$ sudo chown $(whoami) /usr/local/pgsql
而后初始化数据目录
$ /usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data
接着启动服务器
$ /usr/local/pgsql/bin/pg_ctl -D /usr/local/pgsql/data -l /usr/local/pgsql/data/postmaster.log start
将新安装的pgsql/bin
路径添加到path环境变量中
$ export PATH=/usr/local/pgsql/bin:$PATH
安装扩展(因为上面的export,此次使用新安装的pgxn
)
$ make && make install
如今咱们能够建立test数据库了
$ /usr/local/pgsql/bin/createdb test
而后链接到这个数据库
$ /usr/local/pgsql/bin/psql test
检查是否正常
test=# CREATE EXTENSION base36 ; CREATE EXTENSION test=# CREATE TABLE base36_test(val base36); CREATE TABLE test=# INSERT INTO base36_test VALUES ('123'), ('3c'), ('5A'), ('zZz'); INSERT 0 4 test=# EXPLAIN SELECT * FROM base36_test where val='3c'; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed. !>
如今咱们已经配置好了调试环境,让咱们开始实际追踪问题。首先,咱们看一下日志文件。也就是咱们用-l
指定的文件。根据上面的步骤,文件路径是/usr/local/pgsql/data/postmaster.log
。
TRAP: FailedAssertion("!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", File: "mcxt.c", Line: 699) LOG: server process (PID 6515) was terminated by signal 6: Abort trap DETAIL: Failed process was running: EXPLAIN SELECT * FROM base36_test where val='3c'; LOG: terminating any other active server processes WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2015-10-09 15:11:18 CEST LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/22D0868 LOG: record with zero length at 0/2359140 LOG: redo done at 0/2359110 LOG: last completed transaction was at log time 2015-10-09 15:12:01.344859+02 LOG: MultiXact member wraparound protections are now enabled LOG: database system is ready to accept connections LOG: autovacuum launcher started
从新链接到数据库并找出当前数据库会话的pid
test=# SELECT pg_backend_pid(); pg_backend_pid ---------------- 6644 (1 row)
将LLDB与pid链接(在另外一个终端中)
lldb -p 6644
在psql会话中运行以前失败命令
EXPLAIN SELECT * FROM base36_test where val='3c';
继续LLDB
(lldb) c Process 6644 resuming Process 6644 stopped * thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10 libsystem_kernel.dylib`__pthread_kill: -> 0x7fff906d2286 <+10>: jae 0x7fff906d2290 ; <+20> 0x7fff906d2288 <+12>: movq %rax, %rdi 0x7fff906d228b <+15>: jmp 0x7fff906cdc53 ; cerror_nocancel 0x7fff906d2290 <+20>: retq
从LLDB中获取到一个回溯
(lldb) bt * thread #1: tid = 0x84aa4, 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT * frame #0: 0x00007fff906d2286 libsystem_kernel.dylib`__pthread_kill + 10 frame #1: 0x00007fff910f39f9 libsystem_pthread.dylib`pthread_kill + 90 frame #2: 0x00007fff848f19b3 libsystem_c.dylib`abort + 129 frame #3: 0x0000000108328549 postgres`ExceptionalCondition(conditionName="!(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))", errorType="FailedAssertion", fileName="mcxt.c", lineNumber=699) + 137 at assert.c:54 frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699 frame #5: 0x00000001082ab9e3 postgres`get_const_expr(constval=0x00007ff9e1806708, context=0x00007fff57e824c8, showtype=0) + 707 at ruleutils.c:8002 frame #6: 0x00000001082a5f79 postgres`get_rule_expr(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01') + 281 at ruleutils.c:6647 frame #7: 0x00000001082acf22 postgres`get_rule_expr_paren(node=0x00007ff9e1806708, context=0x00007fff57e824c8, showimplicit='\x01', parentNode=0x00007ff9e1806788) + 146 at ruleutils.c:6600 ...more
咱们获得了什么?异常在mcxt.c:699
定义的pfree抛出。pfree
在ruleutils.c:8002中被get_const_expr
调用。若是咱们去调用堆栈四次。 咱们最终到了这里:
(lldb) up frame #4: 0x000000010836129d postgres`pfree(pointer=0x00007ff9e1813674) + 173 at mcxt.c:699 696 * allocated chunk. 697 */ 698 Assert(pointer != NULL); -> 699 Assert(pointer == (void *) MAXALIGN(pointer)); 700 701 /* 702 * OK, it's probably safe to look at the chunk header.
让咱们看看mcxt.c中的源码
文件名mcxt.c:699
/* * pfree * Release an allocated chunk. */ void pfree(void *pointer) { MemoryContext context; /* * Try to detect bogus pointers handed to us, poorly though we can. * Presumably, a pointer that isn't MAXALIGNED isn't pointing at an * allocated chunk. */ Assert(pointer != NULL); Assert(pointer == (void *) MAXALIGN(pointer)); /* * OK, it's probably safe to look at the chunk header. */ context = ((StandardChunkHeader *) ((char *) pointer - STANDARDCHUNKHEADERSIZE))->context; AssertArg(MemoryContextIsValid(context)); (*context->methods->free_p) (context, pointer); VALGRIND_MEMPOOL_FREE(context, pointer); }
Postgres使用pfree
从当前内存上下文释放内存。不知何故,咱们把内存搞砸了。
咱们看一下指针的内容
(lldb) print (char *)pointer (char *) $0 = 0x00007ff9e1813674 "3c"
这确实是咱们的搜索条件3c
。因此咱们这里作错了什么?正如第一篇文章提到的,pfree
和palloc
是Postgres中free
和malloc
的对等体,用于在当前内存上下文中安全地分配和释放内存。不知何故咱们搞错了。在base36_out
中咱们使用
char *buffer = palloc0(7 * sizeof(char));
来分配了7个字节的内存。最终咱们偏移4返回一个指针
PG_RETURN_CSTRING(&buffer[offset]);
mcxt.c:699
中的断言以下:
Assert(pointer == (void *) MAXALIGN(pointer));
确保要释放的数据正确对齐。这里的条件是
(pointer == (void *) (((uintptr_t) ((pointer)) + ((8) - 1)) & ~((uintptr_t) ((8) - 1))))
当指针以8字节的倍数开始时读取?因为返回的地址与分配的地址不一样,所以pfree
会因指针没有对齐而报错。
让咱们来修复他!
文件名:base36.c
PG_FUNCTION_INFO_V1(base36_out); Datum base36_out(PG_FUNCTION_ARGS) { int32 arg = PG_GETARG_INT32(0); if (arg < 0) ereport(ERROR, ( errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), errmsg("negative values are not allowed"), errdetail("value %d is negative", arg), errhint("make it positive") ) ); char base36[36] = "0123456789abcdefghijklmnopqrstuvwxyz"; /* max 6 char + '\0' */ char buffer[7]; unsigned int offset = sizeof(buffer); buffer[--offset] = '\0'; do { buffer[--offset] = base36[arg % 36]; } while (arg /= 36); PG_RETURN_CSTRING(pstrdup(&buffer[offset])); }
如今咱们从堆栈中分配缓冲区(第18行),最后咱们使用pstrdup复制字符串到新分配的内存(第26行)。这种实现更接近 - 几乎至关于维基百科。
您可能已经猜到pstrdup
是Postgres对应的strdup
。它经过palloc
安全地从当前内存上下文中获取内存,并在事务结束时自动释放。
如今咱们能够为咱们的类型输入和输出数据了。从其余类型转换或者转换为其余类型都行。
文件名:base36–0.0.1.sql
-- type and operators omitted CREATE CAST (integer as base36) WITHOUT FUNCTION AS IMPLICIT; CREATE CAST (base36 as integer) WITHOUT FUNCTION AS IMPLICIT;
这是相对容易的。 因为整数和base36是二进制可强制的(即二进制内部表示相同),转换能够免费完成(WITHOUT FUNCTION)。咱们还将此转换标记为隐式转换,从而告诉postgres它能够在适当的时候自动执行转换。例如,考虑这个查询:
test=# SELECT 10::integer + '5a'::base36; ?column? ---------- 200 (1 row)
没有定义整数+ base36运算符,可是经过将base36隐式转换为integer Postgres可使用integer + integer运算符并将结果转换为integer。然而,隐式转换应该当心定义,由于某些操做的结果多是可疑的。对于上面的操做,用户不知道结果是整数仍是base36,所以可能会误解它。若是咱们后来决定添加一个返回base36的运算符integer + base36,查询将彻底中断。
更使人困惑的多是这个查询结果
test=# SELECT -50::base36; ?column? ---------- -50 (1 row)
虽然咱们不容许负值,但咱们在这里获得了一个,这怎么可能呢?内部Postgres执行如下操做:
test=# SELECT -(50::base36)::integer; ?column? ---------- -50
咱们能够也应该避免这种使人困惑的行为。一种选择是向base36输出添加前缀(就像十六进制或八进制数同样),或者将责任交给用户,只容许显式强制转换。
澄清事情的另外一个选择是标记zhuanhuan AS ASSIGNMENT.
只有当您为base36类型指定一个整数时,才会自动执行该转换,反之亦然。这一般适用于INSERT或UPDATE语句。让咱们试试这个
文件名:base36–0.0.1.sql
-- type and operators omitted CREATE CAST (integer as base36) WITHOUT FUNCTION AS ASSIGNMENT; CREATE CAST (base36 as integer) WITHOUT FUNCTION AS ASSIGNMENT;
填充咱们的表格
test=# INSERT INTO base36_test SELECT i FROM generate_series(1,100) i; INSERT 0 100 SELECT * FROM base36_test ORDER BY val LIMIT 12; val ----- 1 2 3 4 5 6 7 8 9 a b c (12 rows)
You have seen how important it is to test everything, not only to find bugs that in the worst case might crash the server, but also to specify the expected output from certain operations such as casts. In the next post we’ll elaborate on that creating a full-coverage test suite.