阿里云环境部署Hyperledger Fabric之SIGSEGV问题分析和解决经验分享【转】

时间 2019-12-07

标签阿里环境部署 hyperledger fabric sigsegv 问题分析解决经验分享栏目阿里巴巴繁體版

原文原文链接

摘要：引言最近收到Hyperledger社区的一些朋友反馈在阿里云环境上部署开源区块链项目Hyperledger Fabric的过程当中遇到了和SIGSEV相关的fatal error，正好我此前也遇到并解决过相似的问题，所以这里分享一下当时问题的分析过程和解决的经验，但愿能带来一点启发和帮助。git

最近收到Hyperledger社区的一些朋友反馈在阿里云环境上部署开源区块链项目Hyperledger Fabric的过程当中遇到了和SIGSEV相关的fatal error，正好笔者此前也遇到并解决过相似的问题，所以这里分享一下当时问题的分析过程和解决的经验，但愿能为你们带来一点启发和帮助。github

问题描述
在部署Hyperledger Fabric过程当中，peer、orderer服务启动失败，同时cli容器上执行cli-test.sh测试时也报错。错误类型均是signal SIGSEGV: segmentation violation。错误日志示例以下：golang

2017-11-01 02:44:04.247 UTC [peer] updateTrustedRoots -> DEBU 2a0 Updating trusted root authorities for channel mychannel
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x63 pc=0x7f9d15ded259]
runtime stack:
runtime.throw(0xdc37a7, 0x2a)
/opt/go/src/runtime/panic.go:566 +0x95
runtime.sigpanic()
/opt/go/src/runtime/sigpanic_unix.go:12 +0x2cc
goroutine 64 [syscall, locked to thread]:
runtime.cgocall(0xb08d50, 0xc4203bcdf8, 0xc400000000)
/opt/go/src/runtime/cgocall.go:131 +0x110 fp=0xc4203bcdb0 sp=0xc4203bcd70
net._C2func_getaddrinfo(0x7f9d000008c0, 0x0, 0xc420323110, 0xc4201a01e8, 0x0, 0x0, 0x0)
分析过程
咱们进行了深刻分析和试验，在Hyperledger Fabric这个bug https://jira.hyperledger.org/browse/FAB-5822的启发下，采用了以下workaround能够解决这个问题：docker

在docker compose yaml里对peer、orderer、cli的环境变量加入GODEBUG=netdns=go
这个设置的做用是不采用cgo resolver （从错误日志里可看到是cgo resolver抛出的错误）而采用pure go resolver。dom

进一步分析golang在什么状况下会在cgo resolver和pure go resolver之间切换：区块链

golang的官方文档说明：https://golang.org/pkg/net/测试

Name Resolution
The method for resolving domain names, whether indirectly with functions like Dial or directly with functions like LookupHost and LookupAddr, varies by operating system.
On Unix systems, the resolver has two options for resolving names. It can use a pure Go resolver that sends DNS requests directly to the servers listed in /etc/resolv.conf, or it can use a cgo-based resolver that calls C library routines such as getaddrinfo and getnameinfo.
By default the pure Go resolver is used, because a blocked DNS request consumes only a goroutine, while a blocked C call consumes an operating system thread. When cgo is available, the cgo-based resolver is used instead under a variety of conditions: on systems that do not let programs make direct DNS requests (OS X), when the LOCALDOMAIN environment variable is present (even if empty), when the RES_OPTIONS or HOSTALIASES environment variable is non-empty, when the ASR_CONFIG environment variable is non-empty (OpenBSD only), when /etc/resolv.conf or /etc/nsswitch.conf specify the use of features that the Go resolver does not implement, and when the name being looked up ends in .local or is an mDNS name.
The resolver decision can be overridden by setting the netdns value of the GODEBUG environment variable (see package runtime) to go or cgo, as in:
export GODEBUG=netdns=go # force pure Go resolver
export GODEBUG=netdns=cgo # force cgo resolver*ui

根据这一线索，咱们对比了此前部署成功环境和最近部署失败环境各自的底层配置文件，最终找到了不一样之处：this

在老环境（区块链部署成功)上的容器里，查看阿里云

cat /etc/resolv.conf

nameserver 127.0.0.11
options ndots:0
在新环境（区块链部署失败）上的容器里，查看

cat /etc/resolv.conf

nameserver 127.0.0.11
options timeout:2 attempts:3 rotate single-request-reopen ndots:0
这个差别致使了老的成功环境是采用pure Go resolver的，而在新的失败环境被切换到cgo resolver，这是由于含有pure Go resolver不支持的options single-request-reopen。

注：Pure Go resolver目前仅支持ndots, timeout, attempts, rotate
https://github.com/golang/go/blob/964639cc338db650ccadeafb7424bc8ebb2c0f6c/src/net/dnsconfig_unix.go

case "options": // magic options
        for _, s := range f[1:] {
            switch {
            case hasPrefix(s, "ndots:"):
                n, _, _ := dtoi(s[6:])
                if n < 0 {
                    n = 0
                } else if n > 15 {
                    n = 15
                }
                conf.ndots = n
            case hasPrefix(s, "timeout:"):
                n, _, _ := dtoi(s[8:])
                if n < 1 {
                    n = 1
                }
                conf.timeout = time.Duration(n) * time.Second
            case hasPrefix(s, "attempts:"):
                n, _, _ := dtoi(s[9:])
                if n < 1 {
                    n = 1
                }
                conf.attempts = n
            case s == "rotate":
                conf.rotate = true
            default:
                conf.unknownOpt = true
            }
        }

进一步的，咱们尝试分析是什么缘由致使了新老容器内的resolv.conf的内容变化，发现了原来是最近宿主机ECS的配置文件发生了变化：

失败的环境 - 新建立的ECS:

cat /etc/resolv.conf

Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

nameserver 100.100.2.138
nameserver 100.100.2.136
options timeout:2 attempts:3 rotate single-request-reopen
成功的环境 - 原来的ECS:

cat /etc/resolv.conf

Dynamic resolv.conf(5) file for glibc resolver(3) generated by resolvconf(8)

DO NOT EDIT THIS FILE BY HAND -- YOUR CHANGES WILL BE OVERWRITTEN

nameserver 100.100.2.136
nameserver 100.100.2.138
另外一方面，咱们也尝试分析为何切换到cgo resolver以后会产生SIGSEGV的错误，如下这篇文章解释了static link cgo会致使SIGSEGV的错误：
https://tschottdorf.github.io/golang-static-linking-bug

而这个Hyperledger Fabric的bug则指出了Hyperledger Fabric的build（尤为是和getaddrinfo相关方法）正是static link的：
https://jira.hyperledger.org/browse/FAB-6403

至此，咱们找到了问题的根源和复盘了整个问题发生的逻辑：

近期新建立的ECS主机中的resolv.conf内容发生了变化 -> 致使Hyperledger Fabric的容器内域名解析从pure Go resolver切换至cgo resolver -> 触发了一个已知的由静态连接cgo致使的SIGSEGV错误 -> 致使Hyperledger Fabric部署失败。
解决方法建议
更新Hyperledger Fabric的docker compose yaml模板，为全部Hyperledger Fabric的节点（如orderer, peer, ca, cli等）添加环境变量GODEBUG=netdns=go以强制使用pure Go resolver。

阿里云容器服务区块链解决方案
咱们在阿里云容器服务上为开发者提供了Hyperledger Fabric的自动化配置和部署的基础解决方案，帮助开发者屏蔽底层复杂的操做、更加专一于区块链业务应用的创新，若有兴趣进一步了解，可参考：

阿里云容器服务区块链解决方案介绍
阿里云容器服务区块链解决方案产品文档

转自https://yq.aliyun.com/articles/238940