I was looking for the fastest way to popcount
large arrays of data. 我一直在寻找最快的方法来popcount
大量数据的数量。 I encountered a very weird effect: Changing the loop variable from unsigned
to uint64_t
made the performance drop by 50% on my PC. 我遇到了一个很是奇怪的效果:将循环变量从unsigned
更改成uint64_t
使PC上的性能降低了50%。 ios
#include <iostream> #include <chrono> #include <x86intrin.h> int main(int argc, char* argv[]) { using namespace std; if (argc != 2) { cerr << "usage: array_size in MB" << endl; return -1; } uint64_t size = atol(argv[1])<<20; uint64_t* buffer = new uint64_t[size/8]; char* charbuffer = reinterpret_cast<char*>(buffer); for (unsigned i=0; i<size; ++i) charbuffer[i] = rand()%256; uint64_t count,duration; chrono::time_point<chrono::system_clock> startP,endP; { startP = chrono::system_clock::now(); count = 0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with unsigned for (unsigned i=0; i<size/8; i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } { startP = chrono::system_clock::now(); count=0; for( unsigned k = 0; k < 10000; k++){ // Tight unrolled loop with uint64_t for (uint64_t i=0;i<size/8;i+=4) { count += _mm_popcnt_u64(buffer[i]); count += _mm_popcnt_u64(buffer[i+1]); count += _mm_popcnt_u64(buffer[i+2]); count += _mm_popcnt_u64(buffer[i+3]); } } endP = chrono::system_clock::now(); duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count(); cout << "uint64_t\t" << count << '\t' << (duration/1.0E9) << " sec \t" << (10000.0*size)/(duration) << " GB/s" << endl; } free(charbuffer); }
As you see, we create a buffer of random data, with the size being x
megabytes where x
is read from the command line. 如您所见,咱们建立了一个随机数据缓冲区,大小为x
兆字节,其中从命令行读取x
。 Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount
intrinsic to perform the popcount. 以后,咱们遍历缓冲区并使用x86 popcount
内部版本的展开版原本执行popcount。 To get a more precise result, we do the popcount 10,000 times. 为了得到更精确的结果,咱们将弹出次数进行了10,000次。 We measure the times for the popcount. 咱们计算弹出次数的时间。 In the upper case, the inner loop variable is unsigned
, in the lower case, the inner loop variable is uint64_t
. 在大写状况下,内部循环变量是unsigned
,在小写状况下,内部循环变量是uint64_t
。 I thought that this should make no difference, but the opposite is the case. 我认为这应该没有什么区别,但状况偏偏相反。 c++
I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1): 我这样编译(g ++版本:Ubuntu 4.8.2-19ubuntu1): ubuntu
g++ -O3 -march=native -std=c++11 test.cpp -o test
Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1
(so 1 MB random data): 这是个人Haswell Core i7-4770K CPU @ 3.50 GHz,运行test 1
(所以有1 MB随机数据)的结果: sass
As you see, the throughput of the uint64_t
version is only half the one of the unsigned
version! 如您所见, uint64_t
版本的吞吐量仅为 unsigned
版本的吞吐量的一半 ! The problem seems to be that different assembly gets generated, but why? 问题彷佛在于生成了不一样的程序集,可是为何呢? First, I thought of a compiler bug, so I tried clang++
(Ubuntu Clang version 3.4-1ubuntu3): 首先,我想到了编译器错误,所以尝试了clang++
(Ubuntu Clang版本3.4-1ubuntu3): less
clang++ -O3 -march=native -std=c++11 teest.cpp -o test
Result: test 1
结果: test 1
dom
So, it is almost the same result and is still strange. 所以,它几乎是相同的结果,但仍然很奇怪。 But now it gets super strange. 可是如今变得超级奇怪。 I replace the buffer size that was read from input with a constant 1
, so I change: 我用常量1
替换了从输入中读取的缓冲区大小,所以我进行了更改: oop
uint64_t size = atol(argv[1]) << 20;
to 至 性能
uint64_t size = 1 << 20;
Thus, the compiler now knows the buffer size at compile time. 所以,编译器如今在编译时就知道缓冲区的大小。 Maybe it can add some optimizations! 也许它能够添加一些优化! Here are the numbers for g++
: 这是g++
的数字: 测试
Now, both versions are equally fast. 如今,两个版本都一样快。 However, the unsigned
got even slower ! 可是, unsigned
变得更慢 ! It dropped from 26
to 20 GB/s
, thus replacing a non-constant by a constant value lead to a deoptimization . 它从26
20 GB/s
降低到20 GB/s
,所以用恒定值替换非恒定值会致使优化不足 。 Seriously, I have no clue what is going on here! 说真的,我不知道这是怎么回事! But now to clang++
with the new version: 可是如今使用新版本的clang++
: 优化
Wait, what? 等一下 Now, both versions dropped to the slow number of 15 GB/s. 如今,两个版本的速度都下降到了15 GB / s的缓慢速度 。 Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang! 所以,在两种状况下,用常数替换很是数都会致使Clang!代码缓慢。
I asked a colleague with an Ivy Bridge CPU to compile my benchmark. 我要求具备Ivy Bridge CPU的同事来编译个人基准测试。 He got similar results, so it does not seem to be Haswell. 他获得了相似的结果,所以彷佛不是Haswell。 Because two compilers produce strange results here, it also does not seem to be a compiler bug. 由于两个编译器在这里产生奇怪的结果,因此它彷佛也不是编译器错误。 We do not have an AMD CPU here, so we could only test with Intel. 咱们这里没有AMD CPU,所以只能在Intel上进行测试。
Take the first example (the one with atol(argv[1])
) and put a static
before the variable, ie: 以第一个示例(带有atol(argv[1])
示例)为例,而后在变量前放置一个static
变量,即:
static uint64_t size=atol(argv[1])<<20;
Here are my results in g++: 这是我在g ++中的结果:
Yay, yet another alternative . 是的,另外一种选择 。 We still have the fast 26 GB/s with u32
, but we managed to get u64
at least from the 13 GB/s to the 20 GB/s version! 咱们仍然有快26 GB / s的u32
,但咱们设法u64
从13 GB至少/ S到20 GB / s的版本! On my collegue's PC, the u64
version became even faster than the u32
version, yielding the fastest result of all. 在我collegue的PC中, u64
版本成为速度甚至超过了u32
的版本,产生全部的最快的结果。 Sadly, this only works for g++
, clang++
does not seem to care about static
. 可悲的是,这仅适用于g++
, clang++
彷佛并不关心static
。
Can you explain these results? 您能解释这些结果吗? Especially: 特别:
u32
and u64
? 哪有之间的这种差别u32
和u64
? static
keyword make the u64
loop faster? 插入static
关键字如何使u64
循环更快? Even faster than the original code on my collegue's computer! 甚至比同事计算机上的原始代码还要快! I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. 我知道优化是一个棘手的领域,可是,我从未想到过如此小的更改会致使执行时间差别100% ,而诸如恒定缓冲区大小之类的小因素又会彻底混和结果。 Of course, I always want to have the version that is able to popcount 26 GB/s. 固然,我一直但愿拥有可以以26 GB / s的速度计数的版本。 The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. 我能想到的惟一可靠的方法是针对这种状况复制粘贴程序集并使用内联程序集。 This is the only way I can get rid of compilers that seem to go mad on small changes. 这是我摆脱彷佛对微小更改感到恼火的编译器的惟一方法。 What do you think? 你怎么看? Is there another way to reliably get the code with most performance? 还有另外一种方法能够可靠地得到性能最高的代码吗?
Here is the disassembly for the various results: 这是各类结果的反汇编:
26 GB/s version from g++ / u32 / non-const bufsize : 来自g ++ / u32 / non-const bufsize的 26 GB / s版本:
0x400af8: lea 0x1(%rdx),%eax popcnt (%rbx,%rax,8),%r9 lea 0x2(%rdx),%edi popcnt (%rbx,%rcx,8),%rax lea 0x3(%rdx),%esi add %r9,%rax popcnt (%rbx,%rdi,8),%rcx add $0x4,%edx add %rcx,%rax popcnt (%rbx,%rsi,8),%rcx add %rcx,%rax mov %edx,%ecx add %rax,%r14 cmp %rbp,%rcx jb 0x400af8
13 GB/s version from g++ / u64 / non-const bufsize : 来自g ++ / u64 / non-const bufsize的 13 GB / s版本:
0x400c00: popcnt 0x8(%rbx,%rdx,8),%rcx popcnt (%rbx,%rdx,8),%rax add %rcx,%rax popcnt 0x10(%rbx,%rdx,8),%rcx add %rcx,%rax popcnt 0x18(%rbx,%rdx,8),%rcx add $0x4,%rdx add %rcx,%rax add %rax,%r12 cmp %rbp,%rdx jb 0x400c00
15 GB/s version from clang++ / u64 / non-const bufsize : 来自clang ++ / u64 / non-const bufsize的 15 GB / s版本:
0x400e50: popcnt (%r15,%rcx,8),%rdx add %rbx,%rdx popcnt 0x8(%r15,%rcx,8),%rsi add %rdx,%rsi popcnt 0x10(%r15,%rcx,8),%rdx add %rsi,%rdx popcnt 0x18(%r15,%rcx,8),%rbx add %rdx,%rbx add $0x4,%rcx cmp %rbp,%rcx jb 0x400e50
20 GB/s version from g++ / u32&u64 / const bufsize : 来自g ++ / u32&u64 / const bufsize的 20 GB / s版本:
0x400a68: popcnt (%rbx,%rdx,1),%rax popcnt 0x8(%rbx,%rdx,1),%rcx add %rax,%rcx popcnt 0x10(%rbx,%rdx,1),%rax add %rax,%rcx popcnt 0x18(%rbx,%rdx,1),%rsi add $0x20,%rdx add %rsi,%rcx add %rcx,%rbp cmp $0x100000,%rdx jne 0x400a68
15 GB/s version from clang++ / u32&u64 / const bufsize : 来自clang ++ / u32&u64 / const bufsize的 15 GB / s版本:
0x400dd0: popcnt (%r14,%rcx,8),%rdx add %rbx,%rdx popcnt 0x8(%r14,%rcx,8),%rsi add %rdx,%rsi popcnt 0x10(%r14,%rcx,8),%rdx add %rsi,%rdx popcnt 0x18(%r14,%rcx,8),%rbx add %rdx,%rbx add $0x4,%rcx cmp $0x20000,%rcx jb 0x400dd0
Interestingly, the fastest (26 GB/s) version is also the longest! 有趣的是,最快的版本(26 GB / s)也是最长的! It seems to be the only solution that uses lea
. 它彷佛是使用lea
的惟一解决方案。 Some versions use jb
to jump, others use jne
. 一些版本使用jb
跳转,另外一些版本使用jne
。 But apart from that, all versions seem to be comparable. 可是除此以外,全部版本彷佛都是可比的。 I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. 我看不出100%的性能差距可能源于何处,但我不太擅长破译汇编。 The slowest (13 GB/s) version looks even very short and good. 最慢的版本(13 GB / s)看起来很是短并且很好。 Can anyone explain this? 谁能解释一下?
No matter what the answer to this question will be; 无论这个问题的答案是什么; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code . 我了解到,在真正的热循环中, 每一个细节均可能很重要, 即便那些彷佛与该热代码没有任何关联的细节也是如此 。 I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! 我从未考虑过要为循环变量使用哪一种类型,可是如您所见,如此小的更改可能会产生100%的差别! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static
keyword in front of the size variable! 正如咱们在size变量前面插入static
关键字所看到的那样,甚至缓冲区的存储类型也能够产生巨大的变化! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance. 未来,在编写对系统性能相当重要的紧密而又热的循环时,我将始终在各类编译器上测试各类替代方案。
The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. 有趣的是,尽管我已经四次展开循环,但性能差别仍然很高。 So even if you unroll, you can still get hit by major performance deviations. 所以,即便展开,仍然会受到重大性能误差的影响。 Quite interesting. 挺有意思。