将32位循环计数器替换为64位会在Intel CPU上使用_mm_popcnt_u64引发疯狂的性能误差

问题:

I was looking for the fastest way to popcount large arrays of data. 我一直在寻找最快的方法来popcount大量数据的数量。 I encountered a very weird effect: Changing the loop variable from unsigned to uint64_t made the performance drop by 50% on my PC. 我遇到了一个很是奇怪的效果:将循环变量从unsigned更改成uint64_t使PC上的性能降低了50%。 ios

The Benchmark 基准测试

#include <iostream>
#include <chrono>
#include <x86intrin.h>

int main(int argc, char* argv[]) {

    using namespace std;
    if (argc != 2) {
       cerr << "usage: array_size in MB" << endl;
       return -1;
    }

    uint64_t size = atol(argv[1])<<20;
    uint64_t* buffer = new uint64_t[size/8];
    char* charbuffer = reinterpret_cast<char*>(buffer);
    for (unsigned i=0; i<size; ++i)
        charbuffer[i] = rand()%256;

    uint64_t count,duration;
    chrono::time_point<chrono::system_clock> startP,endP;
    {
        startP = chrono::system_clock::now();
        count = 0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with unsigned
            for (unsigned i=0; i<size/8; i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "unsigned\t" << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }
    {
        startP = chrono::system_clock::now();
        count=0;
        for( unsigned k = 0; k < 10000; k++){
            // Tight unrolled loop with uint64_t
            for (uint64_t i=0;i<size/8;i+=4) {
                count += _mm_popcnt_u64(buffer[i]);
                count += _mm_popcnt_u64(buffer[i+1]);
                count += _mm_popcnt_u64(buffer[i+2]);
                count += _mm_popcnt_u64(buffer[i+3]);
            }
        }
        endP = chrono::system_clock::now();
        duration = chrono::duration_cast<std::chrono::nanoseconds>(endP-startP).count();
        cout << "uint64_t\t"  << count << '\t' << (duration/1.0E9) << " sec \t"
             << (10000.0*size)/(duration) << " GB/s" << endl;
    }

    free(charbuffer);
}

As you see, we create a buffer of random data, with the size being x megabytes where x is read from the command line. 如您所见,咱们建立了一个随机数据缓冲区,大小为x兆字节,其中从命令行读取x Afterwards, we iterate over the buffer and use an unrolled version of the x86 popcount intrinsic to perform the popcount. 以后,咱们遍历缓冲区并使用x86 popcount内部版本的展开版原本执行popcount。 To get a more precise result, we do the popcount 10,000 times. 为了得到更精确的结果,咱们将弹出次数进行了10,000次。 We measure the times for the popcount. 咱们计算弹出次数的时间。 In the upper case, the inner loop variable is unsigned , in the lower case, the inner loop variable is uint64_t . 在大写状况下,内部循环变量是unsigned ,在小写状况下,内部循环变量是uint64_t I thought that this should make no difference, but the opposite is the case. 我认为这应该没有什么区别,但状况偏偏相反。 c++

The (absolutely crazy) results (绝对疯狂)结果

I compile it like this (g++ version: Ubuntu 4.8.2-19ubuntu1): 我这样编译(g ++版本:Ubuntu 4.8.2-19ubuntu1): ubuntu

g++ -O3 -march=native -std=c++11 test.cpp -o test

Here are the results on my Haswell Core i7-4770K CPU @ 3.50 GHz, running test 1 (so 1 MB random data): 这是个人Haswell Core i7-4770K CPU @ 3.50 GHz,运行test 1 (所以有1 MB随机数据)的结果: sass

  • unsigned 41959360000 0.401554 sec 26.113 GB/s 未签名41959360000 0.401554秒26.113 GB / s
  • uint64_t 41959360000 0.759822 sec 13.8003 GB/s uint64_t 41959360000 0.759822秒13.8003 GB / s

As you see, the throughput of the uint64_t version is only half the one of the unsigned version! 如您所见, uint64_t版本的吞吐量仅为 unsigned版本的吞吐量的一半 The problem seems to be that different assembly gets generated, but why? 问题彷佛在于生成了不一样的程序集,可是为何呢? First, I thought of a compiler bug, so I tried clang++ (Ubuntu Clang version 3.4-1ubuntu3): 首先,我想到了编译器错误,所以尝试了clang++ (Ubuntu Clang版本3.4-1ubuntu3): less

clang++ -O3 -march=native -std=c++11 teest.cpp -o test

Result: test 1 结果: test 1 dom

  • unsigned 41959360000 0.398293 sec 26.3267 GB/s 无符号41959360000 0.398293秒26.3267 GB / s
  • uint64_t 41959360000 0.680954 sec 15.3986 GB/s uint64_t 41959360000 0.680954秒15.3986 GB / s

So, it is almost the same result and is still strange. 所以,它几乎是相同的结果,但仍然很奇怪。 But now it gets super strange. 可是如今变得超级奇怪。 I replace the buffer size that was read from input with a constant 1 , so I change: 我用常量1替换了从输入中读取的缓冲区大小,所以我进行了更改: oop

uint64_t size = atol(argv[1]) << 20;

to 性能

uint64_t size = 1 << 20;

Thus, the compiler now knows the buffer size at compile time. 所以,编译器如今在编译时就知道缓冲区的大小。 Maybe it can add some optimizations! 也许它能够添加一些优化! Here are the numbers for g++ : 这是g++的数字: 测试

  • unsigned 41959360000 0.509156 sec 20.5944 GB/s 未签名41959360000 0.509156秒20.5944 GB / s
  • uint64_t 41959360000 0.508673 sec 20.6139 GB/s uint64_t 41959360000 0.508673秒20.6139 GB / s

Now, both versions are equally fast. 如今,两个版本都一样快。 However, the unsigned got even slower ! 可是, unsigned 变得更慢 It dropped from 26 to 20 GB/s , thus replacing a non-constant by a constant value lead to a deoptimization . 它从26 20 GB/s降低到20 GB/s ,所以用恒定值替换非恒定值会致使优化不足 Seriously, I have no clue what is going on here! 说真的,我不知道这是怎么回事! But now to clang++ with the new version: 可是如今使用新版本的clang++ 优化

  • unsigned 41959360000 0.677009 sec 15.4884 GB/s 未签名41959360000 0.677009秒15.4884 GB / s
  • uint64_t 41959360000 0.676909 sec 15.4906 GB/s uint64_t 41959360000 0.676909秒15.4906 GB / s

Wait, what? 等一下 Now, both versions dropped to the slow number of 15 GB/s. 如今,两个版本的速度都下降到了15 GB / s的缓慢速度 Thus, replacing a non-constant by a constant value even lead to slow code in both cases for Clang! 所以,在两种状况下,用常数替换很是数都会致使Clang!代码缓慢。

I asked a colleague with an Ivy Bridge CPU to compile my benchmark. 我要求具备Ivy Bridge CPU的同事来编译个人基准测试。 He got similar results, so it does not seem to be Haswell. 他获得了相似的结果,所以彷佛不是Haswell。 Because two compilers produce strange results here, it also does not seem to be a compiler bug. 由于两个编译器在这里产生奇怪的结果,因此它彷佛也不是编译器错误。 We do not have an AMD CPU here, so we could only test with Intel. 咱们这里没有AMD CPU,所以只能在Intel上进行测试。

More madness, please! 请更加疯狂!

Take the first example (the one with atol(argv[1]) ) and put a static before the variable, ie: 以第一个示例(带有atol(argv[1])示例)为例,而后在变量前放置一个static变量,即:

static uint64_t size=atol(argv[1])<<20;

Here are my results in g++: 这是我在g ++中的结果:

  • unsigned 41959360000 0.396728 sec 26.4306 GB/s 无符号41959360000 0.396728秒26.4306 GB / s
  • uint64_t 41959360000 0.509484 sec 20.5811 GB/s uint64_t 41959360000 0.509484秒20.5811 GB / s

Yay, yet another alternative . 是的,另外一种选择 We still have the fast 26 GB/s with u32 , but we managed to get u64 at least from the 13 GB/s to the 20 GB/s version! 咱们仍然有快26 GB / s的u32 ,但咱们设法u64从13 GB至少/ S到20 GB / s的版本! On my collegue's PC, the u64 version became even faster than the u32 version, yielding the fastest result of all. 在我collegue的PC中, u64版本成为速度甚至超过了u32的版本,产生全部的最快的结果。 Sadly, this only works for g++ , clang++ does not seem to care about static . 可悲的是,这仅适用于g++clang++彷佛并不关心static

My question 个人问题

Can you explain these results? 您能解释这些结果吗? Especially: 特别:

  • How can there be such a difference between u32 and u64 ? 哪有之间的这种差别u32u64
  • How can replacing a non-constant by a constant buffer size trigger less optimal code ? 如何用恒定的缓冲区大小替换很是数会触发次优代码
  • How can the insertion of the static keyword make the u64 loop faster? 插入static关键字如何使u64循环更快? Even faster than the original code on my collegue's computer! 甚至比同事计算机上的原始代码还要快!

I know that optimization is a tricky territory, however, I never thought that such small changes can lead to a 100% difference in execution time and that small factors like a constant buffer size can again mix results totally. 我知道优化是一个棘手的领域,可是,我从未想到过如此小的更改会致使执行时间差别100% ,而诸如恒定缓冲区大小之类的小因素又会彻底混和结果。 Of course, I always want to have the version that is able to popcount 26 GB/s. 固然,我一直但愿拥有可以以26 GB / s的速度计数的版本。 The only reliable way I can think of is copy paste the assembly for this case and use inline assembly. 我能想到的惟一可靠的方法是针对这种状况复制粘贴程序集并使用内联程序集。 This is the only way I can get rid of compilers that seem to go mad on small changes. 这是我摆脱彷佛对微小更改感到恼火的编译器​​的惟一方法。 What do you think? 你怎么看? Is there another way to reliably get the code with most performance? 还有另外一种方法能够可靠地得到性能最高的代码吗?

The Disassembly 拆卸

Here is the disassembly for the various results: 这是各类结果的反汇编:

26 GB/s version from g++ / u32 / non-const bufsize : 来自g ++ / u32 / non-const bufsize的 26 GB / s版本:

0x400af8:
lea 0x1(%rdx),%eax
popcnt (%rbx,%rax,8),%r9
lea 0x2(%rdx),%edi
popcnt (%rbx,%rcx,8),%rax
lea 0x3(%rdx),%esi
add %r9,%rax
popcnt (%rbx,%rdi,8),%rcx
add $0x4,%edx
add %rcx,%rax
popcnt (%rbx,%rsi,8),%rcx
add %rcx,%rax
mov %edx,%ecx
add %rax,%r14
cmp %rbp,%rcx
jb 0x400af8

13 GB/s version from g++ / u64 / non-const bufsize : 来自g ++ / u64 / non-const bufsize的 13 GB / s版本:

0x400c00:
popcnt 0x8(%rbx,%rdx,8),%rcx
popcnt (%rbx,%rdx,8),%rax
add %rcx,%rax
popcnt 0x10(%rbx,%rdx,8),%rcx
add %rcx,%rax
popcnt 0x18(%rbx,%rdx,8),%rcx
add $0x4,%rdx
add %rcx,%rax
add %rax,%r12
cmp %rbp,%rdx
jb 0x400c00

15 GB/s version from clang++ / u64 / non-const bufsize : 来自clang ++ / u64 / non-const bufsize的 15 GB / s版本:

0x400e50:
popcnt (%r15,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r15,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r15,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r15,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp %rbp,%rcx
jb 0x400e50

20 GB/s version from g++ / u32&u64 / const bufsize : 来自g ++ / u32&u64 / const bufsize的 20 GB / s版本:

0x400a68:
popcnt (%rbx,%rdx,1),%rax
popcnt 0x8(%rbx,%rdx,1),%rcx
add %rax,%rcx
popcnt 0x10(%rbx,%rdx,1),%rax
add %rax,%rcx
popcnt 0x18(%rbx,%rdx,1),%rsi
add $0x20,%rdx
add %rsi,%rcx
add %rcx,%rbp
cmp $0x100000,%rdx
jne 0x400a68

15 GB/s version from clang++ / u32&u64 / const bufsize : 来自clang ++ / u32&u64 / const bufsize的 15 GB / s版本:

0x400dd0:
popcnt (%r14,%rcx,8),%rdx
add %rbx,%rdx
popcnt 0x8(%r14,%rcx,8),%rsi
add %rdx,%rsi
popcnt 0x10(%r14,%rcx,8),%rdx
add %rsi,%rdx
popcnt 0x18(%r14,%rcx,8),%rbx
add %rdx,%rbx
add $0x4,%rcx
cmp $0x20000,%rcx
jb 0x400dd0

Interestingly, the fastest (26 GB/s) version is also the longest! 有趣的是,最快的版本(26 GB / s)也是最长的! It seems to be the only solution that uses lea . 它彷佛是使用lea的惟一解决方案。 Some versions use jb to jump, others use jne . 一些版本使用jb跳转,另外一些版本使用jne But apart from that, all versions seem to be comparable. 可是除此以外,全部版本彷佛都是可比的。 I don't see where a 100% performance gap could originate from, but I am not too adept at deciphering assembly. 我看不出100%的性能差距可能源于何处,但我不太擅长破译汇编。 The slowest (13 GB/s) version looks even very short and good. 最慢的版本(13 GB / s)看起来很是短并且很好。 Can anyone explain this? 谁能解释一下?

Lessons learned 获得教训

No matter what the answer to this question will be; 无论这个问题的答案是什么; I have learned that in really hot loops every detail can matter, even details that do not seem to have any association to the hot code . 我了解到,在真正的热循环中, 每一个细节均可能很重要, 即便那些彷佛与该热代码没有任何关联的细节也是如此 I have never thought about what type to use for a loop variable, but as you see such a minor change can make a 100% difference! 我从未考虑过要为循环变量使用哪一种类型,可是如您所见,如此小的更改可能会产生100%的差别! Even the storage type of a buffer can make a huge difference, as we saw with the insertion of the static keyword in front of the size variable! 正如咱们在size变量前面插入static关键字所看到的那样,甚至缓冲区的存储类型也能够产生巨大的变化! In the future, I will always test various alternatives on various compilers when writing really tight and hot loops that are crucial for system performance. 未来,在编写对系统性能相当重要的紧密而又热的循环时,我将始终在各类编译器上测试各类替代方案。

The interesting thing is also that the performance difference is still so high although I have already unrolled the loop four times. 有趣的是,尽管我已经四次展开循环,但性能差别仍然很高。 So even if you unroll, you can still get hit by major performance deviations. 所以,即便展开,仍然会受到重大性能误差的影响。 Quite interesting. 挺有意思。


解决方案:

参考一: https://stackoom.com/question/1hE0T/将-位循环计数器替换为-位会在Intel-CPU上使用-mm-popcnt-u-引发疯狂的性能误差
参考二: https://oldbug.net/q/1hE0T/Replacing-a-32-bit-loop-counter-with-64-bit-introduces-crazy-performance-deviations-with-mm-popcnt-u64-on-Intel-CPUs
相关文章
相关标签/搜索