谨防猴子补丁以及Python中排查技巧

时间 2019-11-06

原文原文链接

背景

前两天晚上线上系统突发故障，在立马打开线上错误日志以后，却只能获得一堆毫无心义的程序调用栈(traceback)的输出，因而团队成员陷入漫长而又抓瞎的问题排查过程当中。问题很幸运地获得了解决，可是我一直想不明白为何日志里打印的调用栈毫无心义，按照经验，它应该打印的是异常产生过程当中的调用栈才是。在通过后续的源码分析和排查以后，我才发现实际上是由于项目中一个老旧的代码使用了猴子补丁致使，这也是这篇文章想要讨论的内容。html

什么是猴子补丁

猴子补丁是一种用来在运行时修改（增长、变动、删除等）系统软件行为的编程方式。在动态语言里有普遍的猴子补丁应用的影子，好比 Ruby 的打开类的特性支持运行时扩展类的定义甚至替换方法的实现，Python 的方法或者函数因为能够在运行时进行替换而使得猴子补丁的应用很是方便，其余像 JavaScript 语言一样能够应用猴子补丁。python

猴子补丁是把双刃剑

猴子补丁以其灵活性，能够实现补丁代码和应用代码的彻底分离，同时使得应用代码在调用方式上保持调用方式始终不变。
从应用代码的角度来看，它调用的就是某个模块的原始定义的方法或者函数；而从被调用的方法或者函数的角度来看，猴子补丁的存在对它是透明的存在，如下展现一个 Python 语言的 Demo：redis

咱们从一个极简例子开始，向这个美好的世界问好：shell

def greet():
    print("Hello World!")

if __name__ == "__main__":
    greet()

假如执行以上脚本，获得的结果是：django

$ python demo.py
Hello World!

这个很简单，接下来假如打一个猴子补丁：咱们扩充原来的 greet 的行为，如今除了打印信息，还要打印下当前的时间：编程

from datetime import datetime
def greet():
    print("Hello World!")

# monkey patch
original_greet = greet
def greet_with_time():
    original_greet()
    print(datetime.now())
greet = greet_with_time  # replace the implementation
# monkey patch

if __name__ == "__main__":
    greet() # 这里的调用和原来没有变化

运行它，获得的结果是：ruby

$ python demo.py
Hello World!
2019-09-21 23:40:42.575782

咱们获得了预期的结果！
从代码分析，咱们添加了一个新的函数 greet_with_time，其会调用原来的 greet 函数，而后打印当前时间，最后将 greet 函数经过将函数赋值给变量的方式完成对 greet 函数的替换。而对于最后的 greet 函数的调用，却无需任何改动，以此达到了一样仍是调用 greet 函数，行为却截然不同的目的。
上面的 demo 只是限于篇幅简化了代码，真实项目里的猴子补丁代码老是在另外的模块或者文件里。想象在一个复杂的大型工程里，若是你的代码里猴子补丁泛滥，可想对于系统的行为分析以及问题排查，将是一种灾难性的挑战。app

如今对猴子补丁有了必定的了解以后，咱们再来看看我在实际项目中遇到的例子。python2.7

一堆毫无心义的堆栈信息

我在本地重现了我开头提到的咱们所遇到的异常，如下是和线上环境一致的堆栈信息：异步

2019-09-19 17:30:11.103|CRITICAL|138:140147476383488|log.py:282|log.log|Task command.celery.crontab_task.some_task[538ddb72-89b0-45fe-811e-107202dc665b] INTERNAL ERROR: AttributeError("'long' object has no attribute 'insert'",)
Traceback (most recent call last):
  File "/usr/local/bin/celery", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python2.7/dist-packages/celery/__main__.py", line 30, in main
    main()
  ...... 限于篇幅，这里省略不少无心义的内容
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/job.py", line 384, in on_success
    return self.on_failure(ret_value)
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/job.py", line 443, in on_failure
    self._log_error(exc_info, send_failed_event=send_failed_event)
  File "/usr/local/lib/python2.7/dist-packages/celery/worker/job.py", line 511, in _log_error
    'internal': internal}})
  File "/usr/local/lib/python2.7/dist-packages/celery/utils/log.py", line 282, in log
    return Logger.log(self, *args, **kwargs)
None

从这个堆栈信息看，它打印的其实是调用了 Logger.log 函数的堆栈，其中根本没有任何代码看到有 .insert 相关字眼，其与 AttributeError("'long' object has no attribute 'insert'",) 根本毫无关系，这样的堆栈信息，有和没有基本一个样。因而乎，我接着经过编辑器经过源码进行了更多的探索。

首先仍是借助上面的堆栈去分析到底哪里出了问题，因此我先看了 celery/worker/job.py:504-511 处的代码：

context = {
            'hostname': self.hostname,
            'id': self.id,
            'name': self.name,
            'exc': exception,
            'traceback': traceback,
            'args': sargs,
            'kwargs': skwargs,
            'description': description,
        }

        logger.log(severity, format.strip(), context,
                   exc_info=exc_info,
                   extra={'data': {'id': self.id,
                                   'name': self.name,
                                   'args': sargs,
                                   'kwargs': skwargs,
                                   'hostname': self.hostname,
                                   'internal': internal}})

这里调用了 logger.log 方法（logger 的来源在 Celery 的代码里可分析，可是不是这篇文章的重点，故此不展开）而且经过 context 对象传入了两个重要的信息：exception 和 traceback。在对 logger.log 源码的进一步阅读中，我确认了这块日志打印的核心依赖于对 traceback.print_exception 函数的调用：

def formatException(self, ei):
        """
        Format and return the specified exception information as a string.

        This default implementation just uses
        traceback.print_exception()
        """
        sio = io.StringIO()
        tb = ei[2]
        traceback.print_exception(ei[0], ei[1], tb, None, sio)

因而乎，我回到了 celery/worker/job.py:504-511 处的代码，在 logger.log 前面插入了两种打印错误堆栈信息的代码：

# context  = ...

        ################################################################
        import traceback as _traceback
        # Method 1: like what logger.log does
        _traceback.print_exception(*exc_info)

        # Method 2: use `format_exception` instead
        print(''.join(_traceback.format_exception(*exc_info)))
        ################################################################

        logger.log(....

从新启动 celery 后，执行异步任务后，获得的第一种错误堆栈和前面我贴出来的堆栈信息是彻底一致的，这个倒也好理解，毕竟这里的 print_exception 函数的调用就是 logger.log 里的核心实现。而 format_exception 的调用给了我真正有意义的错误堆栈信息：

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 283, in trace_task
    uuid, retval, SUCCESS, request=task_request,
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/base.py", line 271, in store_result
    request=request, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/base.py", line 505, in _store_result
    self.set(self.get_key_for_task(task_id), self.encode(meta))
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/redis.py", line 161, in set
    return self.ensure(self._set, (key, value), **retry_policy)
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/redis.py", line 150, in ensure
    **retry_policy
  File "/usr/local/lib/python2.7/dist-packages/kombu/utils/__init__.py", line 246, in retry_over_time
    return fun(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/celery/backends/redis.py", line 170, in _set
    pipe.execute()
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2879, in execute
    return execute(conn, stack, raise_on_error)
  File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 2785, in _execute_transaction
    response.insert(i, e)
AttributeError: 'long' object has no attribute 'insert'

好家伙，这下就清晰了，原来这个代码的异常真正出处是这里！
可是问题就来了，为何print_exception和 format_exception给出的堆栈信息不同呢？我充满疑问地去查找了官方文档，可是困惑更重了：

traceback.format_exception(etype, value, tb[, limit])
Format a stack trace and the exception information. The arguments have the same meaning as the corresponding arguments to print_exception(). The return value is a list of strings, each ending in a newline and some containing internal newlines. When these lines are concatenated and printed, exactly the same text is printed as does print_exception().

重点在最后一句，Python 官方文档说了，两个函数输出的错误堆栈是同样(exactly the same text)的！

揪出猴子补丁

其实，问题的真正排查过程耗费了我好多时间，我一直没有往猴子补丁上想，最后却是在出门赴朋友的饭约的地铁上灵机一动，用手机翻看了公司 GitLab 上的项目代码，一下找到了元凶。

def _patch_print_exception():
    import traceback

    def custom_print_exception(etype, value, tb, limit=None, file=None):
        exc_info = sys.exc_info()
        stack = traceback.extract_stack()
        # ... omit other source codes 

    traceback.print_exception = custom_print_exception

从补丁代码看，补丁直接覆盖了原版的代码，而且实现上也直接粗暴地无视了传入的几个异常信息参数！因此才会出现这么大的乌龙，出现毫无关系的异常堆栈信息！(╯‵□′)╯︵┻━┻

排查猴子补丁的技巧

猴子补丁这类编程技巧当然会利弊共存，使用上必然须要额外慎重，但也并不是须要敬而远之，重点是掌握必要的排查技巧，如下我针对此次的教训又去找下一些可能有帮助的方法：

1. 经过函数或方法自身属性检查方法或者函数的信息

众所周知，Python 的全部对象都有一堆内置的属性，函数也不例外，以我项目中的例子：

# django shell
In [1]: traceback.print_exception.func_code
Out[1]: <code object custom_print_exception at 0x109e9f030, file "/Users/boy/work_area/project/project-source/lib/common/logger.py", line 295>

一看就知道，这个函数的真实代码其实就是项目中的补丁代码！

2. 借助 inspect 包来检查

Python 自身提供的工具包很是多，inspect 天然也是利器之一，其能够用来对几乎全部类型作运行时的检查，仍是以个人实际例子：

# django shell
In [1]: import inspect
In [2]: inspect.getfile(traceback.print_exception)
Out[2]: '/Users/boy/work_area/project/project-source/lib/common/logger.py'

In [3]: inspect.getsource(traceback.print_exception)
Out[3]: '\tdef custom_print_exception(etype, value, tb, limit=None, file=None): ......\n'

In [4]: print inspect.getsource(traceback.print_exception)
Out[4]: def custom_print_exception(etype, value, tb, limit=None, file=None):disable=redefined-builtin
            if file is None:
                file = sys.stderr
            exc_info = sys.exc_info()
            stack = traceback.extract_stack()
            ...

总之，若是赶上代码行为与预期不符却又没法和官方文档或者官方源码对应，那么可能就是依赖的方法或者函数被打了猴子补丁，而最快速确认猴子补丁的方式，就是第一时间检查所调用的函数或者方法的实际定义，即应用上述方法便可！

题外话

作 Ruby 开发时，我也遇到过猴子补丁的陷阱，Ruby 里也有相似的方法：

file, line = A.new.method(:foo).source_location
puts "Method foo is defined in #{file}, line #{line}"
# => "Method foo is defined in temp.rb, line 2"