- Visual Studio has support for code generation using SIMD instructions: /arch:SSE /arch:SSE2 and then /arch:AVX and /arch:AVX2. The last one will be available for VS 2013 Update 2 and on Intel Haswell chips only.
- Profile, profile, profile! I hear this all the time when watching/reading any presentation talking about performance. Maybe they are all right! :)
- FMA can slow down the code!
- It will be faster for
a = yx + z, but not for
a = yx + zw
- For Intel mul is 5 cycles, add is 3 cycyles, FMA is 5.
- So for the latter equation two muls will be executed in parallel and then added - in total 8 cycles
- FMA version will first use mul for
zwand then use FMA - in total 10 cycles.
- Conclusion: be careful
- 256 bit code does not run 2X faster than 128 bit!
- Computation and instruction execution is 2x faster, but we need to wait for memory
- Highly efficient code is actually memory efficient code.
- In the last part of the presentation there was an analysis of a performance bug in Eigen3 math library
- Compiling with /arch:AVX2 (and /arch:AVX) caused 60% slowdown on Haswell chips!
- BTW: there had no difference between /arch:SSE2 and /arch:AVX on Sandy Bridge
- problem was cause by bottleneck in Cpu Store Buffer - I haven't heard about that before, but using this thing carefully can give you a huge boost (or problems :))
- Here is a nice looking link with some more info about Store Buffers on Sandy and Haswell
- CPUs are so powerful that they can 'analyze' the code and sometimes this can introduce secondary such bugs. Need to know profiler tools to properly analyze such situations.