Flexible particle system - Optimization through tools

Table of Contents

In this post I will test several compiler options and switches that could make the particle system run faster.

Read more to see how I’ve reached around 20% of performance improvement!

The Series

Start

We are starting with those numbers (Core i5 Sandy Bridge):

count	tunnel	attractors	fountain
151000	229.5	576.25	451.625
161000	465.813	727.906	541.453
171000	527.227	790.113	582.057
181000	563.028	835.014	617.507
191000	596.754	886.877	653.938

Core i5 Ivy Bridge:

count	tunnel	attractors	fountain
151000	283.5	646.75	527.375
161000	555.688	812.344	629.172
171000	628.586	879.293	671.146
181000	670.073	932.537	710.768
191000	709.384	982.192	752.596

(time in milliseconds)

The above results come from running 200 ‘frames’ of particle system’s update method. No rendering, only CPU work. count means number of particles in a given system. You can read more about this benchmark in the previous post.

And the Visual Studio configuration:

Optimization: /02
Inline Function Expansion: Default
Favor Size or Speed: Neither
Whole program optimization: Yes
Enable enhanced instruction set: not set
Floating point model: /fp:precise (default)

Of course, we are interested in making the above results faster. Also, I am wondering what Visual Studio’s compiler options give potential performance improvements.

Floating-point semantics mode

By default Visual Studio uses /fp:precise floating-point semantics mode. It produces quite fast, but safe and accurate results. All calculations are done in the highest available precision. The compiler can reorder instructions, but only when it does not change the final value.

In particle system simulation we do not need so much precision. This is not a complex and accurate physics simulation, so we could trade precision for performance. We use floats only and small errors usually won’t be visible.

By using fp:fast compiler relaxes its rules so that more optimization can be applied automatically by the compiler. Computation will be performed in, usually, lower resolution, so we do not lose time on casting from and to 80-bit precision. Additionally, the compiler can reorder instructions - even if it changes the final result a bit.

By switching from fp:precise to fp:fast I got the following results:

Core i5 Sandy Bridge

count	tunnel	attractors	fountain
171000	497.953	700.477	535.738
181000	533.369	744.185	569.092
191000	565.046	787.023	601.512

Core i5 Ivy Bridge

count	tunnel	attractors	fountain
171000	597.242	823.121	635.061
181000	635.53	872.765	675.883
191000	674.441	924.721	713.86

So around 5%…or even 11% of improvement.

Enable enhanced instruction set

Since SIMD instructions are available for a quite long time it would be wise to use those options as well. According to wiki:

SSE2 appeared in Pentium 4 - 2001 or in AMD’s Athlon 64 - 2003
SSE4 appeared in Intel Core microarchitecture - 2006 or in AMD’s K10 - 2007
AVX are available since Sandy Bridge (2011) or AMD’s Bulldozer (2011)

Unfortunately in my case, adding /arch:SSE2 does not make difference. It appe

But when I’ve used /arch:avx the timings were a bit better:

Core i5 Sandy Bridge

count	tunnel	attractors	fountain
171000	429.195	608.598	460.299
181000	460.649	647.825	490.412
191000	489.206	688.603	520.302

Core i5 Ivy Bridge

count	tunnel	attractors	fountain
151000	230.000	508.000	415.000
161000	439.500	646.750	494.375
171000	493.688	694.344	531.672
181000	534.336	748.168	568.584
191000	565.792	798.396	613.198

This time this is around 20% of improvement on Sandy Bridge and around 15% on Ivy Bridge. Of course, /fp:fast is also enabled.

BTW: When I used /arch:AVX2 the application crashed :)

Additional switches

I’ve tried using other compiler switches: Inline Function Expansion, Favor Size or Speed, Whole program optimization. Unfortunately, I got almost no difference in terms of performance.

Something missing?

Hmm… but what about auto vectorization and auto parallelization? Maybe it could help? Why not use those powerful features as well? In fact, it would be better to rely on the compiler that should do the most of the job, instead of manually rewriting the code.

In Visual Studio (since VS 2012) there are two important options /Qvec and /Qpar. Those options should, as names suggest, automatically use vector instructions and distribute tasks among other cores.

I do not have much experience using those switches, but in my case they simply do not work and I got no performance improvement.

To know what is going on with `auto’ switches you have to use /Qvec-report and /Qpar-report additional compiler options. Then, the compiler will show what loops were vectorized or parallelized, or in which places it had problems. On MSDN there is a whole page that describes all the possible issues that can block ‘auto’ features.

Definitely, I need to look closer to those ‘auto’ powerful features and figure out how to use them properly.

BTW: What is the difference between auto vectorization and enable enhanced instruction set options?

Bonus: GCC (mingw) results

Although compiling the full particle demo (graphics) in a different compiler would be quite problematic, there is no such problem with ‘cpuTest’. This benchmark is only a simple console application, so I’ve managed to rebuilt it using GCC (minGW version). Here are the results:

32bit, Ivy Bridge

GCC 4.8.1, -march=native -mavx -Ofast -m32 -std=c++11 -ffast-math

count	tunnel	attractors	fountain
151000	230.000	508.000	415.000
161000	439.500	646.750	494.375
171000	493.688	694.344	531.672
181000	534.336	748.168	568.584
191000	565.792	798.396	613.198

64bit, Ivy Bridge

-march=native -mavx -Ofast -m64 -std=c++11 -ffast-math

count	tunnel	attractors	fountain
151000	251.000	499.500	406.750
161000	459.875	622.438	473.719
171000	505.359	672.180	510.590
181000	539.795	714.397	546.199
191000	576.099	764.050	579.525

It seems that GCC optimizer does a much better job than Visual Studio (764.050ms vs 832.478ms)!

Wrap up & What’s Next

This was quite fast: I’ve tested several Visual Studio compiler switches and it appeared that only floating point mode and enhanced instruction set options improved the performance in a visible way.

Final results:

CPU	count	tunnel	attractors	fountain
Sandy	191000	489.206 (-18.02%)	688.603 (-22.36%)	520.302 (-20.44%)
Ivy	191000	593.956 (-15.66%)	832.478 (-14.77%)	640.739 (-15.15%)

In the end there is around 20% of speed up (for Sandy Bridge), 15% for Ivy Bridge. This is definitely not a huge factor, but still quite nice. It was only several clicks of the mouse! ;)

Question: Do you know other useful Visual Studio/GCC compiler options that could help in this case?

Next time, I will try to show how to further improve the performance by using SIMD instructions. By rewriting some of the critical code parts we can utilize even more of the CPU power.

Want Help and Test?

Just for an experiment, it would be nice to compile the code with gcc or clang and compare the results. Or use also a different CPU. If you want to help here is the repository here on github and if you have the timings please let me know.

The easiest way is to download exe files (should be virus-free, but please double check!) and save the results to a txt file.