一个std::sort 自定义比较排序函数 crash的分析过程

时间 2019-12-15

标签一个 std sort 自定义比较排序函数 crash 分析过程繁體版

原文原文链接

两年未写总结博客，今天先来练练手，总结最近遇到的一个crash case。
　注意：如下的分析都基于GCC4.4.6html

1、解决crashlinux

咱们有一个复杂的排序，涉及到不少个因子，使用自定义排序函数的std::sort作排序。Compare函数相似下文的伪代码：c++

bool compare(const FakeObj& left, const FakeObj& right) {
    if (left.a != right.a) {
        return left.a > right.a;
    }
    if (left.b != right.b) {
        return left.b > right.b;
    }
     ....
}

后来，咱们给排序函数加了更多的复杂逻辑：git

bool compare(const FakeObj& left, const FakeObj& right) {
    if (left.a != right.a) {
        return left.a > right.a;
    }
    if (left.b != right.b) {
        return left.b > right.b;
    }
    if (left.c != 0 && right.c != 0 && left.c != right.c) {
        // 当C属性都存在的时候使用C属性作比较
        return left.c > right.c;
    }
    if (left.d != right.d) {
        return left.d > right.d;
    }
    ....
}

服务发布以后，进程就开始出现偶现的crash，使用gdb查看，调用堆栈以下：github

/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:5260
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:2194
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:2161
/usr/lib/gcc/x86_64-redhat-linux/4.4.6/../../../../include/c++/4.4.6/bits/stl_algo.h:2084

crash发生位置：在标准库调用compare函数执行比较的时候出现了越界：dom

这时候，开始怀疑compare函数没有按照标准库的规范实现，查看相关资源：函数

　　https://stackoverflow.com/questions/41488093/why-do-i-get-runtime-error-when-comparison-function-in-stdsort-always-return-toop

　　https://en.cppreference.com/w/cpp/named_req/Comparespa

仔细看官方的文档能够发现：3d

咱们的compare函数对c属性的判断，没有严格遵照可传递性：if comp(a,b)==true and comp(b,c)==true then comp(a,c)==true。假设存在A、B、C三个对象，

一、A、B对象有属性c，且A.c > B.c，按照咱们的比较函数，这时候A>B；

二、C对象没有c属性，且C.d>A.d，这时候C>A;

三、C对象没有c属性，且B.d < C.d，这时候B>C

综上，A>B 且 B>C，可是C>A，这就违反了strict weak ordering的transitivity。

到这里，咱们的case就解决了，但实际上，基于如下几个缘由，这个case花费了很长的时间：

一、咱们的compare函数的代码不是逐步添加的，而是一次性写完，致使没有当即怀疑c属性的比较有bug；

二、对官方文档不够重视，只关注到了非对称性：comp(a,b) ==true then comp(b,a)==false，忽略了可传递性；

展转了好久才注意到传递性要求。后续在解决问题时，应该更细致，不放过每个细节。

2、crash更深层的缘由

业务上的crash问题已经解决，但crash的直接缘由是什么仍是未知的，须要继续探索。

找到std::sort的源码：

https://github.com/gcc-mirror/gcc/blob/gcc-4_4-branch/libstdc%2B%2B-v3/include/bits/stl_algo.h

再结合其余人分析std::sort源码的总结：

http://www.javashuo.com/article/p-keruwjzp-er.html

https://liam.page/2018/09/18/std-sort-in-STL/

简单的总结：std::sort为了提升效率，综合了快排、堆排序、插入排序，能够分为两阶段：

一、快排+堆排序（__introsort_loop），对于元素个数大于_S_threshold的序列，执行快排，当快排的递归深刻到必定层次（__depth_limit）时，再也不递归深刻，对待排序元素执行堆排序；对于元素个数小于_S_threshold的序列则不处理，交给后面的插入排序。

二、插入排序（__final_insertion_sort），当元素个数小于_S_threshold时，执行普通的插入排序（__insertion_sort）；当大于_S_threshold时，执行两批次的插入排序，首先是普通的插入排序排[0, _S_threshold)；而后是无保护的插入排序（__unguarded_insertion_sort），从_S_threshold位置开始排，直到end，注意这里可能还会处理到_S_threshold以前的元素（由于这个函数只用比较结果来判断是否中止，而不强制要求在某个位置点上中止）。

咱们的crash发生在__unguarded_insertion_sort阶段，也就是无保护的插入排序。看下这块的代码：

/// This is a helper function for the sort routine.
template<typename _RandomAccessIterator, typename _Compare>
inline void __unguarded_insertion_sort(_RandomAccessIterator __first,
               _RandomAccessIterator __last, _Compare __comp)
{
    typedef typename iterator_traits<_RandomAccessIterator>::value_type _ValueType;
    for (_RandomAccessIterator __i = __first; __i != __last; ++__i)
        std::__unguarded_linear_insert(__i, _ValueType(*__i), __comp);
    }


/// This is a helper function for the sort routine.
template<typename _RandomAccessIterator, typename _Tp, typename _Compare>
void __unguarded_linear_insert(_RandomAccessIterator __last, _Tp __val,
              _Compare __comp) {
    _RandomAccessIterator __next = __last;
    --__next;
    while (__comp(__val, *__next)) {
        *__last = *__next;
        __last = __next;
        --__next;
    }
    *__last = __val;
}

能够看到，__unguarded_linear_insert 函数比较的终止条件是compare函数返回false，不然就一直排序下去，这里之因此能够这么作，是由于以前的快排+堆排代码保证了[0,X)序列的元素确定大于（假设是递减排序）[X, end)，其中0<X<=_S_threshol，一旦没法保证，则会致使--__next越界，最终致使crash。

再回到咱们的crash case，由于compare函数不知足传递性，虽然[0,X)区间的全部元素都大于X，且(X,end]区间的全部元素都小于X，可是并不能保证(X,end]的元素都小于[0,X)区间的元素，在__unguarded_linear_insert函数里，对(X,end]区间的元素执行插入排序时，某元素大于[0,X)区间的全部元素，这时候就发生了越界crash。

这里使用__unguarded_insertion_sort而不是仅使用__insertion_sort的好处是能够节省边界判断。相关讨论：https://bytes.com/topic/c/answers/819473-questions-about-stl-sort