C++17 In Detail

08 April 2014

Presentation - Native code performance on modern CPUs

See my new website at cppstories.com

Just a quick summary of a great presentation from Build 2014 called Native Code Performance on Modern CPUs: A Changing Landscape.
The presenter Eric Brumer (from Visual C++ Compiler Team) talked, in quite unique way, about deep down details of code optimizations. Why it is better to use compiler to do the hard work. Why new and powerful FMAD instructions can sometimes slow down your code. And how to generally think about code performance.


  • Visual Studio has support for code generation using SIMD instructions: /arch:SSE /arch:SSE2 and then /arch:AVX and /arch:AVX2. The last one will be available for VS 2013 Update 2 and on Intel Haswell chips only.
  • Profile, profile, profile! I hear this all the time when watching/reading any presentation talking about performance. Maybe they are all right! :)
  • FMA can slow down the code!
    • It will be faster for a = yx + z, but not for a = yx + zw
    • For Intel mul is 5 cycles, add is 3 cycyles, FMA is 5.
    • So for the latter equation two muls will be executed in parallel and then added - in total 8 cycles
    • FMA version will first use mul for zw and then use FMA - in total 10 cycles.
    • Conclusion: be careful
  • 256 bit code does not run 2X faster than 128 bit!
    • Computation and instruction execution is 2x faster, but we need to wait for memory
    • Highly efficient code is actually memory efficient code.
  • In the last part of the presentation there was an analysis of a performance bug in Eigen3 math library
    • Compiling with /arch:AVX2 (and /arch:AVX) caused 60% slowdown on Haswell chips!
    • BTW: there had no difference between /arch:SSE2 and /arch:AVX on Sandy Bridge
    • problem was cause by bottleneck in Cpu Store Buffer - I haven't heard about that before, but using this thing carefully can give you a huge boost (or problems :))
    • Here is a nice looking link with some more info about Store Buffers on Sandy and Haswell
    • CPUs are so powerful that they can 'analyze' the code and sometimes this can introduce secondary such bugs. Need to know profiler tools to properly analyze such situations.
Wrap up:
Highly efficient code is actually memory efficient code.
Overally the presentation was great!
The pace of the presentation seemed to be quite slow, but this is actually good. That way you get more information stored. Definitely need to look for more presentation from Eric. They are, for instance, here on channel9.

If you want to get additional C++ resources, exlusive articles, early access content, private Discord server and weekly curated news, check out my Patreon website: (see all benefits):

© 2017, Bartlomiej Filipek, Blogger platform
Disclaimer: Any opinions expressed herein are in no way representative of those of my employers. All data and information provided on this site is for informational purposes only. I try to write complete and accurate articles, but the web-site will not be liable for any errors, omissions, or delays in this information or any losses, injuries, or damages arising from its display or use.
This site contains ads or referral links, which provide me with a commission. Thank you for your understanding.