Pthreads并行编程之spin lock与mutex性能对比分析（转）

时间 2019-12-02

标签 pthreads 并行编程 spin lock mutex 性能对比分析栏目系统性能繁體版

原文原文链接

POSIX threads(简称Pthreads)是在多核平台上进行并行编程的一套经常使用的API。线程同步(Thread Synchronization)是并行编程中很是重要的通信手段，其中最典型的应用就是用Pthreads提供的锁机制(lock)来对多个线程之间共享的临界区(Critical Section)进行保护(另外一种经常使用的同步机制是barrier)。php

Pthreads提供了多种锁机制：
(1) Mutex（互斥量）：pthread_mutex_***
(2) Spin lock（自旋锁）：pthread_spin_***
(3) Condition Variable（条件变量）：pthread_con_***
(4) Read/Write lock（读写锁）：pthread_rwlock_***html

Pthreads提供的Mutex锁操做相关的API主要有：
pthread_mutex_lock (pthread_mutex_t *mutex);
pthread_mutex_trylock (pthread_mutex_t *mutex);
pthread_mutex_unlock (pthread_mutex_t *mutex);linux

Pthreads提供的与Spin Lock锁操做相关的API主要有：
pthread_spin_lock (pthread_spinlock_t *lock);
pthread_spin_trylock (pthread_spinlock_t *lock);
pthread_spin_unlock (pthread_spinlock_t *lock);程序员

从实现原理上来说，Mutex属于sleep-waiting类型的锁。例如在一个双核的机器上有两个线程(线程A和线程B)，它们分别运行在Core0和Core1上。假设线程A想要经过pthread_mutex_lock操做去获得一个临界区的锁，而此时这个锁正被线程B所持有，那么线程A就会被阻塞(blocking)，Core0 会在此时进行上下文切换(Context Switch)将线程A置于等待队列中，此时Core0就能够运行其余的任务(例如另外一个线程C)而没必要进行忙等待。而Spin lock则否则，它属于busy-waiting类型的锁，若是线程A是使用pthread_spin_lock操做去请求锁，那么线程A就会一直在 Core0上进行忙等待并不停的进行锁请求，直到获得这个锁为止。编程

若是你们去查阅Linux glibc中对pthreads API的实现NPTL(Native POSIX Thread Library) 的源码的话(使用”getconf GNU_LIBPTHREAD_VERSION”命令能够获得咱们系统中NPTL的版本号)，就会发现pthread_mutex_lock()操做若是没有锁成功的话就会调用system_wait()的系统调用（如今NPTL的实现采用了用户空间的futex，不须要频繁进行系统调用，性能已经大有改善），并将当前线程加入该mutex的等待队列里。而spin lock则能够理解为在一个while(1)循环中用内嵌的汇编代码实现的锁操做(印象中看过一篇论文介绍说在linux内核中spin lock操做只须要两条CPU指令，解锁操做只用一条指令就能够完成)。有兴趣的朋友能够参考另外一个名为sanos的微内核中pthreds API的实现：mutex.c spinlock.c，尽管与NPTL中的代码实现不尽相同，可是由于它的实现很是简单易懂，对咱们理解spin lock和mutex的特性仍是颇有帮助的。多线程

那么在实际编程中mutex和spin lcok哪一个的性能更好呢？咱们知道spin lock在Linux内核中有很是普遍的利用，那么这是否是说明spin lock的性能更好呢？下面让咱们来用实际的代码测试一下（请确保你的系统中已经安装了最近的g++）。oracle

  1 // Name: spinlockvsmutex1.cc
  2 // Source: http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock
  3 // Compiler(spin lock version): g++ -o spin_version -DUSE_SPINLOCK spinlockvsmutex1.cc -lpthread
  4 // Compiler(mutex version): g++ -o mutex_version spinlockvsmutex1.cc -lpthread
  5 #include <stdio.h>
  6 #include <unistd.h>
  7 #include <sys/syscall.h>
  8 #include <errno.h>
  9 #include <sys/time.h>
 10 #include <list>
 11 #include <pthread.h>
 12  
 13 #define LOOPS 50000000
 14  
 15 using namespace std;
 16  
 17 list<int> the_list;
 18  
 19 #ifdef USE_SPINLOCK
 20 pthread_spinlock_t spinlock;
 21 #else
 22 pthread_mutex_t mutex;
 23 #endif
 24  
 25 //Get the thread id
 26 pid_t gettid() { return syscall( __NR_gettid ); }
 27  
 28 void *consumer(void *ptr)
 29 {
 30     int i;
 31  
 32     printf("Consumer TID %lun", (unsigned long)gettid());
 33  
 34     while (1)
 35     {
 36 #ifdef USE_SPINLOCK
 37         pthread_spin_lock(&spinlock);
 38 #else
 39         pthread_mutex_lock(&mutex);
 40 #endif
 41  
 42         if (the_list.empty())
 43         {
 44 #ifdef USE_SPINLOCK
 45             pthread_spin_unlock(&spinlock);
 46 #else
 47             pthread_mutex_unlock(&mutex);
 48 #endif
 49             break;
 50         }
 51  
 52         i = the_list.front();
 53         the_list.pop_front();
 54  
 55 #ifdef USE_SPINLOCK
 56         pthread_spin_unlock(&spinlock);
 57 #else
 58         pthread_mutex_unlock(&mutex);
 59 #endif
 60     }
 61  
 62     return NULL;
 63 }
 64  
 65 int main()
 66 {
 67     int i;
 68     pthread_t thr1, thr2;
 69     struct timeval tv1, tv2;
 70  
 71 #ifdef USE_SPINLOCK
 72     pthread_spin_init(&spinlock, 0);
 73 #else
 74     pthread_mutex_init(&mutex, NULL);
 75 #endif
 76  
 77     // Creating the list content...
 78     for (i = 0; i < LOOPS; i++)
 79         the_list.push_back(i);
 80  
 81     // Measuring time before starting the threads...
 82     gettimeofday(&tv1, NULL);
 83  
 84     pthread_create(&thr1, NULL, consumer, NULL);
 85     pthread_create(&thr2, NULL, consumer, NULL);
 86  
 87     pthread_join(thr1, NULL);
 88     pthread_join(thr2, NULL);
 89  
 90     // Measuring time after threads finished...
 91     gettimeofday(&tv2, NULL);
 92  
 93     if (tv1.tv_usec > tv2.tv_usec)
 94     {
 95         tv2.tv_sec--;
 96         tv2.tv_usec += 1000000;
 97     }
 98  
 99     printf("Result - %ld.%ldn", tv2.tv_sec - tv1.tv_sec,
100         tv2.tv_usec - tv1.tv_usec);
101  
102 #ifdef USE_SPINLOCK
103     pthread_spin_destroy(&spinlock);
104 #else
105     pthread_mutex_destroy(&mutex);
106 #endif
107  
108     return 0;
109 }

该程序运行过程以下：主线程先初始化一个list结构，并根据LOOPS的值将对应数量的entry插入该list，以后建立两个新线程，它们都执行consumer()这个任务。两个被建立的新线程同时对这个list进行pop操做。主线程会计算从建立两个新线程到两个新线程结束之间所用的时间，输出为下文中的”Result “。app

测试机器参数：
Ubuntu 9.04 X86_64
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
4.0 GB Memoryless

从下面是测试结果：性能

POSIX threads(简称Pthreads)是在多核平台上进行并行编程的一套经常使用的API。线程同步(Thread Synchronization)是并行编程中很是重要的通信手段，其中最典型的应用就是用Pthreads提供的锁机制(lock)来对多个线程之间共 享的临界区(Critical Section)进行保护(另外一种经常使用的同步机制是barrier)。

Pthreads提供了多种锁机制：
(1) Mutex（互斥量）：pthread_mutex_***
(2) Spin lock（自旋锁）：pthread_spin_***
(3) Condition Variable（条件变量）：pthread_con_***
(4) Read/Write lock（读写锁）：pthread_rwlock_***

Pthreads提供的Mutex锁操做相关的API主要有：
pthread_mutex_lock (pthread_mutex_t *mutex);
pthread_mutex_trylock (pthread_mutex_t *mutex);
pthread_mutex_unlock (pthread_mutex_t *mutex);

Pthreads提供的与Spin Lock锁操做相关的API主要有：
pthread_spin_lock (pthread_spinlock_t *lock);
pthread_spin_trylock (pthread_spinlock_t *lock);
pthread_spin_unlock (pthread_spinlock_t *lock);

从实现原理上来说，Mutex属于sleep-waiting类型的锁。例如在一个双核的机器上有两个线程(线程A和线程B)，它们分别运行在Core0和Core1上。假设线程A想要经过pthread_mutex_lock操做去获得一个临界区的锁，而此时这个锁正被线程B所持有，那么线程A就会被阻塞(blocking)，Core0 会在此时进行上下文切换(Context Switch)将线程A置于等待队列中，此时Core0就能够运行其余的任务(例如另外一个线程C)而没必要进行忙等待。而Spin lock则否则，它属于busy-waiting类型的锁，若是线程A是使用pthread_spin_lock操做去请求锁，那么线程A就会一直在 Core0上进行忙等待并不停的进行锁请求，直到获得这个锁为止。

若是你们去查阅Linux glibc中对pthreads API的实现NPTL(Native POSIX Thread Library) 的源码的话(使用”getconf GNU_LIBPTHREAD_VERSION”命令能够获得咱们系统中NPTL的版本号)，就会发现pthread_mutex_lock()操做若是没有锁成功的话就会调用system_wait()的系统调用（如今NPTL的实现采用了用户空间的futex，不须要频繁进行系统调用，性能已经大有改善），并将当前线程加入该mutex的等待队列里。而spin lock则能够理解为在一个while(1)循环中用内嵌的汇编代码实现的锁操做(印象中看过一篇论文介绍说在linux内核中spin lock操做只须要两条CPU指令，解锁操做只用一条指令就能够完成)。有兴趣的朋友能够参考另外一个名为sanos的微内核中pthreds API的实现：mutex.c spinlock.c，尽管与NPTL中的代码实现不尽相同，可是由于它的实现很是简单易懂，对咱们理解spin lock和mutex的特性仍是颇有帮助的。

那么在实际编程中mutex和spin lcok哪一个的性能更好呢？咱们知道spin lock在Linux内核中有很是普遍的利用，那么这是否是说明spin lock的性能更好呢？下面让咱们来用实际的代码测试一下（请确保你的系统中已经安装了最近的g++）。

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
// Name: spinlockvsmutex1.cc
// Source: http://www.alexonlinux.com/pthread-mutex-vs-pthread-spinlock
// Compiler(spin lock version): g++ -o spin_version -DUSE_SPINLOCK spinlockvsmutex1.cc -lpthread
// Compiler(mutex version): g++ -o mutex_version spinlockvsmutex1.cc -lpthread
#include <stdio.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
#include <sys/time.h>
#include <list>
#include <pthread.h>
 
#define LOOPS 50000000
 
using namespace std;
 
list<int> the_list;
 
#ifdef USE_SPINLOCK
pthread_spinlock_t spinlock;
#else
pthread_mutex_t mutex;
#endif
 
//Get the thread id
pid_t gettid() { return syscall( __NR_gettid ); }
 
void *consumer(void *ptr)
{
    int i;
 
    printf("Consumer TID %lun", (unsigned long)gettid());
 
    while (1)
    {
#ifdef USE_SPINLOCK
        pthread_spin_lock(&spinlock);
#else
        pthread_mutex_lock(&mutex);
#endif
 
        if (the_list.empty())
        {
#ifdef USE_SPINLOCK
            pthread_spin_unlock(&spinlock);
#else
            pthread_mutex_unlock(&mutex);
#endif
            break;
        }
 
        i = the_list.front();
        the_list.pop_front();
 
#ifdef USE_SPINLOCK
        pthread_spin_unlock(&spinlock);
#else
        pthread_mutex_unlock(&mutex);
#endif
    }
 
    return NULL;
}
 
int main()
{
    int i;
    pthread_t thr1, thr2;
    struct timeval tv1, tv2;
 
#ifdef USE_SPINLOCK
    pthread_spin_init(&spinlock, 0);
#else
    pthread_mutex_init(&mutex, NULL);
#endif
 
    // Creating the list content...
    for (i = 0; i < LOOPS; i++)
        the_list.push_back(i);
 
    // Measuring time before starting the threads...
    gettimeofday(&tv1, NULL);
 
    pthread_create(&thr1, NULL, consumer, NULL);
    pthread_create(&thr2, NULL, consumer, NULL);
 
    pthread_join(thr1, NULL);
    pthread_join(thr2, NULL);
 
    // Measuring time after threads finished...
    gettimeofday(&tv2, NULL);
 
    if (tv1.tv_usec > tv2.tv_usec)
    {
        tv2.tv_sec--;
        tv2.tv_usec += 1000000;
    }
 
    printf("Result - %ld.%ldn", tv2.tv_sec - tv1.tv_sec,
        tv2.tv_usec - tv1.tv_usec);
 
#ifdef USE_SPINLOCK
    pthread_spin_destroy(&spinlock);
#else
    pthread_mutex_destroy(&mutex);
#endif
 
    return 0;
}
该程序运行过程以下：主线程先初始化一个list结构，并根据LOOPS的值将对应数量的entry插入该list，以后建立两个新线程，它们都执行consumer()这个任务。两个被建立的新线程同时对这个list进行pop操做。主线程会计算从建立两个新线程到两个新线程结束之间所用的时间，输出为下文中的”Result “。

测试机器参数：
Ubuntu 9.04 X86_64
Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz
4.0 GB Memory

从下面是测试结果：

01
02
03
04
05
06
07
08
09
10
11
12
13
14
15
16
17
18
19
gchen@gchen-desktop:~/Workspace/mutex$ g++ -o spin_version -DUSE_SPINLOCK spinvsmutex1.cc -lpthread
gchen@gchen-desktop:~/Workspace/mutex$ g++ -o mutex_version spinvsmutex1.cc -lpthread
gchen@gchen-desktop:~/Workspace/mutex$ time ./spin_version
Consumer TID 5520
Consumer TID 5521
Result - 5.888750
 
real    0m10.918s
user    0m15.601s
sys    0m0.804s
 
gchen@gchen-desktop:~/Workspace/mutex$ time ./mutex_version
Consumer TID 5691
Consumer TID 5692
Result - 9.116376
 
real    0m14.031s
user    0m12.245s
sys    0m4.368s

能够看见spin lock的版本在该程序中表现出来的性能更好。另外值得注意的是sys时间，mutex版本花费了更多的系统调用时间，这就是由于mutex会在锁冲突时调用system wait形成的。

可是，是否是说spin lock就必定更好了呢？让咱们再来看一个锁冲突程度很是剧烈的实例程序：

 1 //Name: svm2.c
 2 //Source: http://www.solarisinternals.com/wiki/index.php/DTrace_Topics_Locks
 3 //Compile(spin lock version): gcc -o spin -DUSE_SPINLOCK svm2.c -lpthread
 4 //Compile(mutex version): gcc -o mutex svm2.c -lpthread
 5 #include <stdio.h>
 6 #include <stdlib.h>
 7 #include <pthread.h>
 8 #include <sys/syscall.h>
 9  
10 #define        THREAD_NUM     2
11  
12 pthread_t g_thread[THREAD_NUM];
13 #ifdef USE_SPINLOCK
14 pthread_spinlock_t g_spin;
15 #else
16 pthread_mutex_t g_mutex;
17 #endif
18 __uint64_t g_count;
19  
20 pid_t gettid()
21 {
22     return syscall(SYS_gettid);
23 }
24  
25 void *run_amuck(void *arg)
26 {
27        int i, j;
28  
29        printf("Thread %lu started.n", (unsigned long)gettid());
30  
31        for (i = 0; i < 10000; i++) {
32 #ifdef USE_SPINLOCK
33            pthread_spin_lock(&g_spin);
34 #else
35                pthread_mutex_lock(&g_mutex);
36 #endif
37                for (j = 0; j < 100000; j++) {
38                        if (g_count++ == 123456789)
39                                printf("Thread %lu wins!n", (unsigned long)gettid());
40                }
41 #ifdef USE_SPINLOCK
42            pthread_spin_unlock(&g_spin);
43 #else
44                pthread_mutex_unlock(&g_mutex);
45 #endif
46        }
47         
48        printf("Thread %lu finished!n", (unsigned long)gettid());
49  
50        return (NULL);
51 }
52  
53 int main(int argc, char *argv[])
54 {
55        int i, threads = THREAD_NUM;
56  
57        printf("Creating %d threads...n", threads);
58 #ifdef USE_SPINLOCK
59        pthread_spin_init(&g_spin, 0);
60 #else
61        pthread_mutex_init(&g_mutex, NULL);
62 #endif
63        for (i = 0; i < threads; i++)
64                pthread_create(&g_thread[i], NULL, run_amuck, (void *) i);
65  
66        for (i = 0; i < threads; i++)
67                pthread_join(g_thread[i], NULL);
68  
69        printf("Done.n");
70  
71        return (0);
72 }

这个程序的特征就是临界区很是大，这样两个线程的锁竞争会很是的剧烈。固然这个是一个极端状况，实际应用程序中临界区不会如此大，锁竞争也不会如此激烈。测试结果显示mutex版本性能更好：

gchen@gchen-desktop:~/Workspace/mutex$ time ./spin
Creating 2 threads...
Thread 31796 started.
Thread 31797 started.
Thread 31797 wins!
Thread 31797 finished!
Thread 31796 finished!
Done.
 
real    0m5.748s
user    0m10.257s
sys    0m0.004s
 
gchen@gchen-desktop:~/Workspace/mutex$ time ./mutex
Creating 2 threads...
Thread 31801 started.
Thread 31802 started.
Thread 31802 wins!
Thread 31802 finished!
Thread 31801 finished!
Done.
 
real    0m4.823s
user    0m4.772s
sys    0m0.032s

另一个值得注意的细节是spin lock耗费了更多的user time。这就是由于两个线程分别运行在两个核上，大部分时间只有一个线程能拿到锁，因此另外一个线程就一直在它运行的core上进行忙等待，CPU占用率一直是100%；而mutex则不一样，当对锁的请求失败后上下文切换就会发生，这样就能空出一个核来进行别的运算任务了。（其实这种上下文切换对已经拿着锁的那个线程性能也是有影响的，由于当该线程释放该锁时它须要通知操做系统去唤醒那些被阻塞的线程，这也是额外的开销）

总结
（1）Mutex适合对锁操做很是频繁的场景，而且具备更好的适应性。尽管相比spin lock它会花费更多的开销（主要是上下文切换），可是它能适合实际开发中复杂的应用场景，在保证必定性能的前提下提供更大的灵活度。

（2）spin lock的lock/unlock性能更好(花费更少的cpu指令)，可是它只适应用于临界区运行时间很短的场景。而在实际软件开发中，除非程序员对本身的程序的锁操做行为很是的了解，不然使用spin lock不是一个好主意(一般一个多线程程序中对锁的操做有数以万次，若是失败的锁操做(contended lock requests)过多的话就会浪费不少的时间进行空等待)。

（3）更保险的方法或许是先（保守的）使用 Mutex，而后若是对性能还有进一步的需求，能够尝试使用spin lock进行调优。毕竟咱们的程序不像Linux kernel那样对性能需求那么高(Linux Kernel最经常使用的锁操做是spin lock和rw lock)。

2010年3月3日补记：这个观点在Oracle的文档中获得了支持：

During configuration, Berkeley DB selects a mutex implementation for the architecture. Berkeley DB normally prefers blocking-mutex implementations over non-blocking ones. For example, Berkeley DB will select POSIX pthread mutex interfaces rather than assembly-code test-and-set spin mutexes because pthread mutexes are usually more efficient and less likely to waste CPU cycles spinning without getting any work accomplished.

p.s.调用syscall(SYS_gettid)和syscall( __NR_gettid )均可以获得当前线程的id:)

转自：www.parallellabs.com