最近一周都在解决filebeat dns解析失败的问题。filebeat经过daemonset方式部署在k8s集群中,从而收集整个主机pods的日志。在主机os为centos7.4 的版本集群中,没有任何问题。可是os为centos7.6的集群中,却出现了解析dns失败,致使日志没法发送到kafka集群。html
查看filebeat错误日志以下:linux
Failed to connect to broker sg.main2.kafka.metis.service:9092: dial tcp: lookup sg.main2.kafka.metis.service: Try again
因而开启了debug过程,首先怀疑是coredns出了问题,去exec到pod中进行dig。golang
dig @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service ; <<>> DiG 9.12.4-P2 <<>> @[10.247.3.10](10.247.3.10) sg.main2.kafka.metis.service ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44350 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;sg.main2.kafka.metis.service. IN A ;; ANSWER SECTION: sg.main2.kafka.metis.service. 30 IN A [10.21.42.97](10.21.42.97) ;; Query time: 1 msec ;; SERVER: [10.247.3.10](10.247.3.10)#53([10.247.3.10](10.247.3.10)) ;; WHEN: Sun Jan 05 14:13:26 UTC 2020 ;; MSG SIZE rcvd: 101
pod中是能够正常解析的,那么问题能够定位到代码了。centos
这个时候须要strace出马了。app
发现filebeat 在向127.0.0.1 53 去作dns解析。结果可想而知,解析失败。dom
须要对应一下golang源码了。tcp
// Copyright 2009 The Go Authors. All rights reserved. 2// Use of this source code is governed by a BSD-style 3// license that can be found in the LICENSE file. 4 5// +build aix darwin dragonfly freebsd linux netbsd openbsd solaris 6 7// Read system DNS config from /etc/resolv.conf 8 9package net 10 11import ( 12 "internal/bytealg" 13 "os" 14 "sync/atomic" 15 "time" 16) 17 18var ( 19 defaultNS = []string{"127.0.0.1:53", "[::1]:53"} 20 getHostname = os.Hostname // variable for testing 21) 22 23type dnsConfig struct { 24 servers []string // server addresses (in host:port form) to use 25 search []string // rooted suffixes to append to local name 26 ndots int // number of dots in name to trigger absolute lookup 27 timeout time.Duration // wait before giving up on a query, including retries 28 attempts int // lost packets before giving up on server 29 rotate bool // round robin among servers 30 unknownOpt bool // anything unknown was encountered 31 lookup []string // OpenBSD top-level database "lookup" order 32 err error // any error that occurs during open of resolv.conf 33 mtime time.Time // time of resolv.conf modification 34 soffset uint32 // used by serverOffset 35 singleRequest bool // use sequential A and AAAA queries instead of parallel queries 36 useTCP bool // force usage of TCP for DNS resolutions 37} 38 39// See resolv.conf(5) on a Linux machine. 40func dnsReadConfig(filename string) *dnsConfig { 41 conf := &dnsConfig{ 42 ndots: 1, 43 timeout: 5 * time.Second, 44 attempts: 2, 45 } 46 file, err := open(filename) 47 if err != nil { 48 conf.servers = defaultNS 49 conf.search = dnsDefaultSearch() 50 conf.err = err 51 return conf 52 } 53 defer file.close() 54 if fi, err := file.file.Stat(); err == nil { 55 conf.mtime = fi.ModTime() 56 } else { 57 conf.servers = defaultNS 58 conf.search = dnsDefaultSearch() 59 conf.err = err 60 return conf 61 } 62 for line, ok := file.readLine(); ok; line, ok = file.readLine() { 63 if len(line) > 0 && (line[0] == ';' || line[0] == '#') { 64 // comment. 65 continue 66 } 67 f := getFields(line) 68 if len(f) < 1 { 69 continue 70 } 71 switch f[0] { 72 case "nameserver": // add one name server 73 if len(f) > 1 && len(conf.servers) < 3 { // small, but the standard limit 74 // One more check: make sure server name is 75 // just an IP address. Otherwise we need DNS 76 // to look it up. 77 if parseIPv4(f[1]) != nil { 78 conf.servers = append(conf.servers, JoinHostPort(f[1], "53")) 79 } else if ip, _ := parseIPv6Zone(f[1]); ip != nil { 80 conf.servers = append(conf.servers, JoinHostPort(f[1], "53")) 81 } 82 } 83 84 case "domain": // set search path to just this domain 85 if len(f) > 1 { 86 conf.search = []string{ensureRooted(f[1])} 87 } 88 89 case "search": // set search path to given servers 90 conf.search = make([]string, len(f)-1) 91 for i := 0; i < len(conf.search); i++ { 92 conf.search[i] = ensureRooted(f[i+1]) 93 } 94 95 case "options": // magic options 96 for _, s := range f[1:] { 97 switch { 98 case hasPrefix(s, "ndots:"): 99 n, _, _ := dtoi(s[6:]) 100 if n < 0 { 101 n = 0 102 } else if n > 15 { 103 n = 15 104 } 105 conf.ndots = n 106 case hasPrefix(s, "timeout:"): 107 n, _, _ := dtoi(s[8:]) 108 if n < 1 { 109 n = 1 110 } 111 conf.timeout = time.Duration(n) * time.Second 112 case hasPrefix(s, "attempts:"): 113 n, _, _ := dtoi(s[9:]) 114 if n < 1 { 115 n = 1 116 } 117 conf.attempts = n 118 case s == "rotate": 119 conf.rotate = true 120 case s == "single-request" || s == "single-request-reopen": 121 // Linux option: 122 // http://man7.org/linux/man-pages/man5/resolv.conf.5.html 123 // "By default, glibc performs IPv4 and IPv6 lookups in parallel [...] 124 // This option disables the behavior and makes glibc 125 // perform the IPv6 and IPv4 requests sequentially." 126 conf.singleRequest = true 127 case s == "use-vc" || s == "usevc" || s == "tcp": 128 // Linux (use-vc), FreeBSD (usevc) and OpenBSD (tcp) option: 129 // http://man7.org/linux/man-pages/man5/resolv.conf.5.html 130 // "Sets RES_USEVC in _res.options. 131 // This option forces the use of TCP for DNS resolutions." 132 // https://www.freebsd.org/cgi/man.cgi?query=resolv.conf&sektion=5&manpath=freebsd-release-ports 133 // https://man.openbsd.org/resolv.conf.5 134 conf.useTCP = true 135 default: 136 conf.unknownOpt = true 137 } 138 } 139 140 case "lookup": 141 // OpenBSD option: 142 // https://www.openbsd.org/cgi-bin/man.cgi/OpenBSD-current/man5/resolv.conf.5 143 // "the legal space-separated values are: bind, file, yp" 144 conf.lookup = f[1:] 145 146 default: 147 conf.unknownOpt = true 148 } 149 } 150 if len(conf.servers) == 0 { 151 conf.servers = defaultNS 152 } 153 if len(conf.search) == 0 { 154 conf.search = dnsDefaultSearch() 155 } 156 return conf 157} 158 159// serverOffset returns an offset that can be used to determine 160// indices of servers in c.servers when making queries. 161// When the rotate option is enabled, this offset increases. 162// Otherwise it is always 0. 163func (c *dnsConfig) serverOffset() uint32 { 164 if c.rotate { 165 return atomic.AddUint32(&c.soffset, 1) - 1 // return 0 to start 166 } 167 return 0 168} 169 170func dnsDefaultSearch() []string { 171 hn, err := getHostname() 172 if err != nil { 173 // best effort 174 return nil 175 } 176 if i := bytealg.IndexByteString(hn, '.'); i >= 0 && i < len(hn)-1 { 177 return []string{ensureRooted(hn[i+1:])} 178 } 179 return nil 180} 181 182func hasPrefix(s, prefix string) bool { 183 return len(s) >= len(prefix) && s[:len(prefix)] == prefix 184} 185 186func ensureRooted(s string) string { 187 if len(s) > 0 && s[len(s)-1] == '.' { 188 return s 189 } 190 return s + "." 191}
因为咱们一样的代码在centos7.4版本的集群中,运行没有问题,因此怀疑是基础镜像alpine3.8和centos 7.6存在某些兼容性的问题。函数
咱们知道golang dns解析支持cgo和purego两种模式。那多是某些设置致使golang 经过cgo去解析,而后alpine 使用的是比较特殊的musl库。可能这个库和centos7.6 不兼容。ui
var lookupOrderName = map[hostLookupOrder]string{ hostLookupCgo: "cgo", hostLookupFilesDNS: "files,dns", hostLookupDNSFiles: "dns,files", hostLookupFiles: "files", hostLookupDNS: "dns", }
其中hostLookupCgo
是一类,表示直接调用libc的getaddrinfo方法去解析。this
域名解析函数,Dial函数会间接调用到,而LokupHost和LookupAddr则会直接调用域名解析函数,不一样的操做系统实现不一样, 在Unix系统中有两种方法进行域名解析:
- 纯GO语言实现的域名解析,从/etc/resolv.conf中取出本地dns server地址列表, 发送DNS请求(UDP报文)并得到结果
- 使用cgo方式, 最终会调用到c标准库的getaddrinfo或getnameinfo函数(不建议使用对GO协程不友好)
能够经过GODEBUG环境变量来设置go语言的默认DNS解析方式 纯go或cgo,
export GODEBUG=netdns=go # force pure Go resolver 纯go 方式
export GODEBUG=netdns=cgo # force cgo resolver cgo 方式
为了印证猜测,分析GO语言的域名解析流程,强制export GODEBUG=netdns=go+9,问题不出现,设置为export GODEBUG=netdns=cgo+9,问题出现,在go1.11的版本中会走到cgo流程.
而后在编译filebeat的时候禁用cgo,以下:
CGO_ENABLED=0 go build --ldflags -w -o filebeat
一劳永逸解决。
在go调用C函数入口(getaddrinfo)增长了打印,发现正常和异常的场景下,入参是一致的,可是到lib库中的行为与低版本操做系统存在差别,存在lib库兼容性问题。