15 September 2014

Flexible particle system - Tools optimization

Tools optimization

In this post I will test several compiler options and switches that could make the particle system run faster.

Read more to see how I've reached around 20% of performance improvement!

The Series

Plan

Start

Visual Studio Logo

We are starting with those numbers (Core i5 Sandy Bridge):

count tunnel attractors fountain
151000 229.5 576.25 451.625
161000 465.813 727.906 541.453
171000 527.227 790.113 582.057
181000 563.028 835.014 617.507
191000 596.754 886.877 653.938

Core i5 Ivy Bridge:

count tunnel attractors fountain
151000 283.5 646.75 527.375
161000 555.688 812.344 629.172
171000 628.586 879.293 671.146
181000 670.073 932.537 710.768
191000 709.384 982.192 752.596

(time in milliseconds)

The above results come from running 200 'frames' of particle system's update method. No rendering, only CPU work. count means number of particles in a given system. You can read more about this benchmark in the previous post.

And the Visual Studio configuration:

  • Optimization: /02
  • Inline Function Expansion: Default
  • Favor Size or Speed: Neither
  • Whole program optimization: Yes
  • Enable enhanced instruction set: not set
  • Floating point model: /fp:precise (default)

Of course, we are interested in making the above results faster. Also, I am wondering what Visual Studio's compiler options give potential performance improvements.

Floating-point semantics mode

By default Visual Studio uses /fp:precise floating-point semantics mode. It produces quite fast, but safe and accurate results. All calculations are done in the highest available precision. The compiler can reorder instructions, but only when it does not change the final value.

In particle system simulation we do not need so much precision. This is not a complex and accurate physics simulation, so we could trade precision for performance. We use floats only and small errors usually won't be visible.

By using fp:fast compiler relaxes its rules so that more optimization can be applied automatically by the compiler. Computation will be performed in, usually, lower resolution, so we do not lose time on casting from and to 80-bit precision. Additionally, the compiler can reorder instructions - even if it changes the final result a bit.

By switching from fp:precise to fp:fast I got the following results:

Core i5 Sandy Bridge

count tunnel attractors fountain
171000 497.953 700.477 535.738
181000 533.369 744.185 569.092
191000 565.046 787.023 601.512

Core i5 Ivy Bridge

count tunnel attractors fountain
171000 597.242 823.121 635.061
181000 635.53 872.765 675.883
191000 674.441 924.721 713.86

So around 5%...or even 11% of improvement.

Enable enhanced instruction set

Since SIMD instructions are available for a quite long time it would be wise to use those options as well. According to wiki:

  • SSE2 appeared in Pentium 4 - 2001 or in AMD's Athlon 64 - 2003
  • SSE4 appeared in Intel Core microarchitecture - 2006 or in AMD's K10 - 2007
  • AVX are available since Sandy Bridge (2011) or AMD's Bulldozer (2011)

Unfortunately in my case, adding /arch:SSE2 does not make difference. It appe

But when I've used /arch:avx the timings were a bit better:

Core i5 Sandy Bridge

count tunnel attractors fountain
171000 429.195 608.598 460.299
181000 460.649 647.825 490.412
191000 489.206 688.603 520.302

Core i5 Ivy Bridge

count tunnel attractors fountain
171000 529.188 746.594 570.297
181000 565.648 792.824 605.912
191000 593.956 832.478 640.739

This time this is around 20% of improvement on Sandy Bridge and around 15% on Ivy Bridge. Of course, /fp:fast is also enabled.

BTW: When I used /arch:AVX2 the application crashed :)

Additional switches

I've tried using other compiler switches: Inline Function Expansion, Favor Size or Speed, Whole program optimization. Unfortunately, I got almost no difference in terms of performance.

Something missing?

question mark

Hmm… but what about auto vectorization and auto parallelization? Maybe it could help? Why not use those powerful features as well? In fact, it would be better to rely on the compiler that should do the most of the job, instead of manually rewriting the code.

In Visual Studio (since VS 2012) there are two important options /Qvec and /Qpar. Those options should, as names suggest, automatically use vector instructions and distribute tasks among other cores.

I do not have much experience using those switches, but in my case they simply do not work and I got no performance improvement.

To know what is going on with `auto' switches you have to use /Qvec-report and /Qpar-report additional compiler options. Then, the compiler will show what loops were vectorized or parallelized, or in which places it had problems. On MSDN there is a whole page that describes all the possible issues that can block 'auto' features.

Definitely, I need to look closer to those 'auto' powerful features and figure out how to use them properly.

BTW: What is the difference between auto vectorization and enable enhanced instruction set options?

GNU logo

Bonus: GCC (mingw) results

Although compiling the full particle demo (graphics) in a different compiler would be quite problematic, there is no such problem with 'cpuTest'. This benchmark is only a simple console application, so I've managed to rebuilt it using GCC (minGW version). Here are the results:

32bit, Ivy Bridge

GCC 4.8.1, -march=native -mavx -Ofast -m32 -std=c++11 -ffast-math
count tunnel attractors fountain
151000 230.000 508.000 415.000
161000 439.500 646.750 494.375
171000 493.688 694.344 531.672
181000 534.336 748.168 568.584
191000 565.792 798.396 613.198

64bit, Ivy Bridge

-march=native -mavx -Ofast -m64 -std=c++11 -ffast-math
count tunnel attractors fountain
151000 251.000 499.500 406.750
161000 459.875 622.438 473.719
171000 505.359 672.180 510.590
181000 539.795 714.397 546.199
191000 576.099 764.050 579.525

It seems that GCC optimizer does a much better job than Visual Studio (764.050ms vs 832.478ms)!

Wrap up & What's Next

This was quite fast: I've tested several Visual Studio compiler switches and it appeared that only floating point mode and enhanced instruction set options improved the performance in a visible way.

Final results:

CPU count tunnel attractors fountain
Sandy 191000 489.206 (-18.02%) 688.603 (-22.36%) 520.302 (-20.44%)
Ivy 191000 593.956 (-15.66%) 832.478 (-14.77%) 640.739 (-15.15%)

In the end there is around 20% of speed up (for Sandy Bridge), 15% for Ivy Bridge. This is definitely not a huge factor, but still quite nice. It was only several clicks of the mouse! ;)

Question: Do you know other useful Visual Studio/GCC compiler options that could help in this case?

Next time, I will try to show how to further improve the performance by using SIMD instructions. By rewriting some of the critical code parts we can utilize even more of the CPU power.

Read next: Code Optimizations

Want Help and Test?

Just for an experiment, it would be nice to compile the code with gcc or clang and compare the results. Or use also a different CPU. If you want to help here is the repository here on github and if you have the timings please let me know.

The easiest way is to download exe files (should be virus-free, but please double check!) and save the results to a txt file.

References

Interested in new blog posts and occasional updates? Please sign up for my free newsletter.

Copyright Bartlomiej Filipek, 2016, Blogger platform
Any opinions expressed herein are in no way representative of those of my employers.
This site contains ads or referral links, which provide me with a commission. Thank you for your understanding.