解决进程间共享内存，因为某个进程异常退出致使死锁问题

时间 2019-11-12

标签解决进程共享内存因为某个异常退出致使死锁问题繁體版

原文原文链接

发现问题

继这篇Blog 解决Nginx和Fpm-Php等内部多进程之间共享数据问题发完后，进程间共享内存又遇到了新的问题 php

昨天晚上QP同窗上线后，早上看超时报表发现有一台前端机器访问QP超时，比其余前端机器高出了几个数量级，前端的机器都是同构的 html

难道是这台机器系统不正常？查看系统状态也没有任何异常，统计了一下超时日志，发现超时都发生在早上QP服务重启的过程当中，正常状况下服务重启时，ClusterMap 会保证流量的正常分配前端

难道是ClusterMap有问题？去ClusterMap Server端看了一下，一切正常 nginx

难道是订阅者客户端有问题吗？随便找了一台正常的机器和有问题的这台机器对比，查看下日志也没有发现问题，使用查询工具检查这两台机器订阅者代理写的共享内存，发现工具读取共享内存返回的结果不一致，这就更奇怪了，都是相同的订阅者，一台机器有问题一台没问题函数

难道Server端给他们的消息不一致？去Server端把订阅者的机器列表都打了出来，发现了有问题的机器根本不在订阅者列表里面，说明这台机器没有订阅，貌似有点线索了，我下线了一台它订阅的QP机器验证，发现共享内部数据没有更新，pstack一下这个进程，发现内部的更新线程一直在等锁，致使共享内存数据一直没法更新，gdb跟进去以后，_lock.data.nr_readers一直为1，说明一直有一个读进程占着锁致使写进程没法进入，遍历了全部fpm-php的读进程发现都没有占着锁，这说明在读进程在得到锁后没来得及释放就挂掉了工具

测试

如今问题已经确认就是得到读锁后进程异常退出致使的，我写个测试程序复现这个问题性能

(! 2293)-> cat test/read_shared.cpp 测试

#include

SharedUpdateData*   _sharedUpdateData = NULL;
cm_sub::CMMapFile*  _mmapFile = NULL;

int32_t initSharedMemRead(const std::string& mmap_file_path)
{
    _mmapFile = new (std::nothrow) cm_sub::CMMapFile();
    if (_mmapFile == NULL || !_mmapFile->open(mmap_file_path.c_str(), FILE_OPEN_WRITE) )
    {
        return -1;
    }
    _sharedUpdateData = (SharedUpdateData*)_mmapFile->offset2Addr(0);
    return 0;
}

int main(int argc, char** argv)
{
    if (initSharedMemRead(argv[1]) != 0) return -1;

    int cnt = 100;
    while (cnt > 0)
    {
        pthread_rwlock_rdlock( &(_sharedUpdateData->_lock));
        fprintf(stdout, "version = %ld, readers = %u\n",
            _sharedUpdateData->_version, _sharedUpdateData->_lock.__data.__nr_readers);
        if (cnt == 190)
        {
            exit(0);
        }
        sleep(1);
        pthread_rwlock_unlock( &(_sharedUpdateData->_lock));
        -- cnt;
        usleep(100*1000);
    }
    delete _mmapFile;
}

(! 2293)-> cat test/write_shared.cpp ui

#include

SharedUpdateData*   _sharedUpdateData = NULL;
cm_sub::CMMapFile*  _mmapFile = NULL;

int32_t initSharedMemWrite(const char* mmap_file_path)
{
    _mmapFile = new (std::nothrow) cm_sub::CMMapFile();
    if ( _mmapFile == NULL || !_mmapFile->open(mmap_file_path, FILE_OPEN_WRITE, 1024) )
    {
        return -1;
    }
    _sharedUpdateData = (SharedUpdateData *)_mmapFile->offset2Addr(0);
    madvise(_sharedUpdateData, 1024, MADV_SEQUENTIAL);

    pthread_rwlockattr_t attr;
    memset(&attr, 0x0, sizeof(pthread_rwlockattr_t));
    if (pthread_rwlockattr_init(&attr) != 0 || pthread_rwlockattr_setpshared(&attr, PTHREAD_PROCESS_SHARED) != 0)
    {
        return -1;
    }
    pthread_rwlock_init( &(_sharedUpdateData->_lock), &attr);
    _sharedUpdateData->_updateTime = autil::TimeUtility::currentTime();
    _sharedUpdateData->_version = 0;
    return 0;
}

int main()
{
    if (initSharedMemWrite("data.mmap") != 0) return -1;

    int cnt = 200;
    while (cnt > 0)
    {
        pthread_rwlock_wrlock( &(_sharedUpdateData->_lock));
        ++ _sharedUpdateData->_version;
        fprintf(stdout, "version = %ld, readers = %u\n",
                _sharedUpdateData->_version, _sharedUpdateData->_lock.__data.__nr_readers);
        sleep(1);
        pthread_rwlock_unlock( &(_sharedUpdateData->_lock));
        -- cnt;
        usleep(100*1000);
    }
    delete _mmapFile;
}

不管是读进程仍是写进程，获取锁后来不及释放就挂掉都会有这样的问题 this

如何解决

问题已经复现，想一想如何用一个好的办法解决，在网上找了一遍，针对读写锁没有什么好的解决办法，只能在逻辑上本身解决，能想到的是使用超时机制，即写进程内部增长一个超时时间，若是写进程到了这个时间仍是不能得到锁，就认为死锁，将读进程的计数减1，这是一个暴力的解决办法，不解释了，若是谁有好的解决办法指导我下

看下读写锁的代码，读写锁和互斥锁相比，更适合用在读多写少的场景，若是读进程须要锁住时间久，就更合适使用读写锁了，个人应该场景是，读多写少，读写时间都很是短；暂时认为互斥锁和读写锁性能差异应该不大，其实读写锁内部一样使用了互斥锁，只不过是锁的时间比较短，锁住互斥区，进去看下是否有人正在写，而后就释放了，

须要注意的是，读写锁默认是写优先的，也就是说当正在写，或者进入写队列准备写时，读锁都是加不上的，须要等待

好，那咱们看看互斥锁可否解决咱们的问题，互斥锁内部有一个属性叫Robust锁

设置锁为Robust锁: pthread_mutexattr_setrobust_np

The robustness attribute defines the behavior when the owner
    of  a  mutex  dies.  The value of robustness could be either
    PTHREAD_MUTEX_ROBUST_NP or  PTHREAD_MUTEX_STALLED_NP,  which
    are  defined by the header <pthread.h>. The default value of
    the robustness attribute is PTHREAD_MUTEX_STALLED_NP.

        When the owner of a mutex with the  PTHREAD_MUTEX_STALLED_NP
    robustness    attribute    dies,   all   future   calls   to
    pthread_mutex_lock(3C) for this mutex will be  blocked  from
    progress in an unspecified manner.

修复非一致的Robust锁: pthread_mutex_consistent_np

A consistent mutex becomes inconsistent and is  unlocked  if
    its  owner dies while holding it, or if the process contain-
    ing the owner of the mutex unmaps the memory containing  the
    mutex or performs one of the exec(2) functions. A subsequent
    owner  of  the   mutex   will   acquire   the   mutex   with
    pthread_mutex_lock(3C),  which  will  return  EOWNERDEAD  to
    indicate that the acquired mutex is inconsistent.

        The pthread_mutex_consistent_np() function should be  called
    while  holding  the  mutex  acquired  by  a previous call to
    pthread_mutex_lock() that returned EOWNERDEAD.

        Since the critical section protected by the mutex could have
    been  left  in  an inconsistent state by the dead owner, the
    caller should make the mutex consistent only if it  is  able
    to  make  the  critical  section protected by the mutex con-
    sistent.

简单来讲就是当发现EOWNERDEAD时，pthread_mutex_consistent_np函数内部会判断这个互斥锁是否是Robust锁，若是是，而且他OwnerDie了，那么他会把锁的owner设置成本身的进程ID，这样这个锁又能够恢复可用，很简单吧

锁释放是能够解决了，可是经过共享内存在进程间共享数据时，还有一点是须要注意的，就是数据的正确性，即完整性，进程共享不一样与线程，若是是一个进程中的多个线程，那么进程异常退出了，其余线程也同时退出了，进程间共享都是独立的，若是一个写线程在写共享数据的过程当中，异常退出，致使写入的数据不完整，读进程读取时就会有读到不完整数据的问题，其实数据完整性很是好解决，只须要在共享内存中加一个完成标记就行了，锁住共享区后，写数据，写好以后标记为完成，就能够了，读进程在读取时判断一下完成标记

测试代码见：

(! 2295)-> cat test/read_shared_mutex.cpp

#include 

 SharedUpdateData*   _sharedUpdateData = NULL;
 cm_sub::CMMapFile*  _mmapFile = NULL;

 int32_t initSharedMemRead(const std::string& mmap_file_path)
 {
    _mmapFile = new (std::nothrow) cm_sub::CMMapFile();
    if (_mmapFile == NULL || !_mmapFile->open(mmap_file_path.c_str(), FILE_OPEN_WRITE) )
    {
        return -1;
    }
    _sharedUpdateData = (SharedUpdateData*)_mmapFile->offset2Addr(0);
    return 0;
 }

 int main(int argc, char** argv)
 {
     if (argc != 2) return -1;
     if (initSharedMemRead(argv[1]) != 0) return -1;   

     int cnt = 10000;
     int ret = 0;
     while (cnt > 0)
     {
         ret = pthread_mutex_lock( &(_sharedUpdateData->_lock));
         if (ret == EOWNERDEAD)
         {
             fprintf(stdout, "%s: version = %ld, lock = %d, %u, %d\n",
                strerror(ret),
                _sharedUpdateData->_version,
                _sharedUpdateData->_lock.__data.__lock,
                _sharedUpdateData->_lock.__data.__count,
                _sharedUpdateData->_lock.__data.__owner);
             ret = pthread_mutex_consistent_np( &(_sharedUpdateData->_lock));
             if (ret != 0)
             {
                 fprintf(stderr, "%s\n", strerror(ret));
                 pthread_mutex_unlock( &(_sharedUpdateData->_lock));
                 continue;
             }
         }
         fprintf(stdout, "version = %ld, lock = %d, %u, %d\n",
            _sharedUpdateData->_version,
            _sharedUpdateData->_lock.__data.__lock,
            _sharedUpdateData->_lock.__data.__count,
            _sharedUpdateData->_lock.__data.__owner);
         sleep(5);
         pthread_mutex_unlock( &(_sharedUpdateData->_lock));
         usleep(500*1000);
         -- cnt;
    }
    fprintf(stdout, "go on\n");
    delete _mmapFile;
 }

(! 2295)-> cat test/write_shared_mutex.cpp

#include 

SharedUpdateData*   _sharedUpdateData = NULL;
cm_sub::CMMapFile*  _mmapFile = NULL;

int32_t initSharedMemWrite(const char* mmap_file_path)
{
    _mmapFile = new (std::nothrow) cm_sub::CMMapFile();
    if ( _mmapFile == NULL || !_mmapFile->open(mmap_file_path, FILE_OPEN_WRITE, 1024) )
    {
        return -1;
    }
    _sharedUpdateData = (SharedUpdateData *)_mmapFile->offset2Addr(0);
    madvise(_sharedUpdateData, 1024, MADV_SEQUENTIAL);

    pthread_mutexattr_t attr;
    memset(&attr, 0x0, sizeof(pthread_mutexattr_t));
    if (pthread_mutexattr_init(&attr) != 0 || pthread_mutexattr_setpshared(&attr, PTHREAD_PROCESS_SHARED) != 0)
    {
        return -1;
    }
    if (pthread_mutexattr_setrobust_np(&attr, PTHREAD_MUTEX_ROBUST_NP) != 0)
    {
        return -1;
    }
    pthread_mutex_init( &(_sharedUpdateData->_lock), &attr);
    _sharedUpdateData->_version = 0;
    return 0;
}

int main()
{
    if (initSharedMemWrite("data.mmap") != 0) return -1;

    int cnt = 200;
    int ret = 0;
    while (cnt > 0)
    {
        ret = pthread_mutex_lock( &(_sharedUpdateData->_lock));
        if (ret == EOWNERDEAD)
        {
            fprintf(stdout, "%s: version = %ld, lock = %d, %u, %d\n",
                    strerror(ret),
                    _sharedUpdateData->_version,
                    _sharedUpdateData->_lock.__data.__lock,
                    _sharedUpdateData->_lock.__data.__count,
                                            _sharedUpdateData->_lock.__data.__owner);
            ret = pthread_mutex_consistent_np( &(_sharedUpdateData->_lock));
            if (ret != 0)
            {
                fprintf(stderr, "%s\n", strerror(ret));
                pthread_mutex_unlock( &(_sharedUpdateData->_lock));
                continue;
            }
        }
        ++ _sharedUpdateData->_version;
        fprintf(stdout, "version = %ld, lock = %d, %u, %d\n", _sharedUpdateData->_version,
                _sharedUpdateData->_lock.__data.__lock,
                _sharedUpdateData->_lock.__data.__count,
                _sharedUpdateData->_lock.__data.__owner);
        usleep(1000*1000);
        pthread_mutex_unlock( &(_sharedUpdateData->_lock));
        -- cnt;
        usleep(500*1000);
    }

    delete _mmapFile;
}

BTW：咱们都知道加锁是有开销的，不只仅是互斥致使的等待开销，还有加锁过程都是有系统调用到内核态的，这个过程开销也很大，有一种互斥锁叫Futex锁(Fast User Mutex)，Linux从2.5.7版本开始支持Futex，快速的用户层面的互斥锁，Fetux锁有更好的性能，是用户态和内核态混合使用的同步机制，若是没有锁竞争的时候，在用户态就能够判断返回，不须要系统调用，

固然任何锁都是有开销的，能不用尽可能不用，使用双Buffer，释放链表，引用计数，均可以在必定程度上替代锁的使用