Flexible Particle System - Code Optimization

Table of Contents

After playing with the tools we have some more options to improve the performance of the particle system. This time, we need to rewrite some parts of the code.

In total, the particle system runs almost twice as fast as initially! Read more to see what pieces of code were changed.

The Series

Start

We are starting with those numbers, see the previous post (last results)

Core i5 Sandy Bridge

count	tunnel	attractors	fountain
171000	429.195	608.598	460.299
181000	460.649	647.825	490.412
191000	489.206	688.603	520.302

Core i5 Ivy Bridge

count	tunnel	attractors	fountain
171000	529.188	746.594	570.297
181000	565.648	792.824	605.912
191000	593.956	832.478	640.739

(time in milliseconds)

SIMD preparation

Previously I tried to force the compiler to use SSE2 or AVX instructions. As we saw, there was a nice performance boost (around 10% for AVX). But hey… SIMD should calculate things 4x or 8x times faster… so why we got only a little improvement?

In real life it’s not that simple:

SIMD can do 4 or 8 instructions at a time, but we still need to wait for the memory. See my summary of a talk “Native code performance on modern CPUs” for more information. In general, we can get max 2.5x speedup using SSE2/4, assuming we have ideally ‘vectorizable’ code. Not all code is in such a perfect state.
Current CPU’s are superscalar that means CPU can run several different instructions in parallel. Sometimes SIMD code can be even slower than the original code created by a compiler.
Additional small problem: SIMD registers needs memory chunks to be aligned to 128bits (16-byte alignment). We need to take care of this when we allocate new memory. So not every variable or array is good for SSE code.

Whan can we do?

Since particles operate mostly on glm::vec4 there is a high chance to use the full power of SSE. We use 4 floats per vector, 16 bytes.
glm adds a very nice feature glm::simdVec4 which basically adds SSE code to common vector functions. So I simply changed glm::vec4 to glm::simdVec4.
Memory must be aligned, so I used _aligned_malloc and _aligned_free.

Some code examples:

// particles.h, in ParticleData class declaration
glm::simdVec4 *m_pos;
glm::simdVec4 *m_col;

// in particles.cpp, generate() method:
m_pos = (glm::simdVec4 *)_aligned_malloc(sizeof(glm::vec4)*maxSize, 16);
m_col = (glm::simdVec4 *)_aligned_malloc(sizeof(glm::vec4)*maxSize, 16);

// particles.cpp, destructor
_aligned_free(m_pos);
_aligned_free(m_col);

The results after changes (Visual Studio):

Sandy Bridge:

count	tunnel	attractors	fountain
171000	387.563	495.281	394.641
181000	417.320	529.660	426.330
191000	447.665	563.833	450.416

Ivy Bridge:

count	tunnel	attractors	fountain
171000	476.625	596.313	483.656
181000	514.328	639.664	523.332
191000	552.666	682.333	558.667

Wow: almost 20% of improvement! All by proper data structures (for vectors) and memory alignment.

SSE and AVX instructions

So far we got a nice speed up… Now, let’s write some SSE code for most critical loops. Will it run faster?

Euler update, SSE:

__m128 ga = globalA.Data;
__m128 *pa, *pb, pc;
__m128 ldt = _mm_set_ps1(localDT);

size_t i;
for (i = 0; i < endId; i++)
{
    pa = (__m128*)(&p->m_acc[i].x);
    *pa = _mm_add_ps(*pa, ga);
}

for (i = 0; i < endId; i ++)
{
    pa = (__m128*)(&p->m_vel[i].x);
    pb = (__m128*)(&p->m_acc[i].x);
    pc = _mm_mul_ps(*pb, ldt);
    *pa = _mm_add_ps(*pa, pc);
}

for (size_t i = 0; i < endId; i++)
{
    pa = (__m128*)(&p->m_pos[i].x);
    pb = (__m128*)(&p->m_vel[i].x);
    pc = _mm_mul_ps(*pb, ldt);
    *pa = _mm_add_ps(*pa, pc);
}

Readability is much worse in this case.

The results:

Sandy Bridge

count	tunnel	attractors	fountain
171000	386.453	492.727	393.363
181000	416.182	529.591	423.795
191000	444.398	564.199	450.099

Ivy Bridge:

count	tunnel	attractors	fountain
171000	481.172	584.086	486.543
181000	516.271	623.136	514.068
191000	547.034	656.517	541.258

Not much, unfortunately. This is because of glm::simdVec4 which uses SSE code. So there is no point in rewriting it. We lose readability and the performance gain is questionable.

Pointer aliasing: __restrict keyword

In my previous post I got a very interesting comment from Matías N. Goldberg:

… In my experience you get massive performance improvements by just telling the compiler the pointers do not alias to each other…

Matias suggest using __restrict keyword to tell the compiler that pointers are not aliasing. For instance:

glm::vec4 * __restrict acc = p->m_acc;
glm::vec4 * __restrict vel = p->m_vel;
glm::vec4 * __restrict pos = p->m_pos;

And then, instead of p->m_pos just use pos pointer.

When I did such change in all the updaters (and generators) code I got the following results:

Sandy Bridge

count	tunnel	attractors	fountain
171000	372.641	476.820	376.410
181000	401.705	508.353	404.176
191000	427.588	542.794	432.397

Ivy Bridge

count	tunnel	attractors	fountain
171000	475.609	591.805	480.402
181000	502.201	620.601	512.300
191000	534.150	667.575	541.788

This is not a massive improvement, but it’s still worth testing.

Random Number Generator

I focused mostly on the updaters part so far. But, generators could also be improved a bit. In this module a random number generator is heavily used. What if we changed it?

Right now there is standard C rand() function called. For a particle system we, probably, do not need to use something more advanced (like normal distribution random generator) - uniform distribution is just fine… maybe there are some faster generators than the default one?

I’ve searched and found something: here, here and here

I’ve tried to use this generator:

// http://www.rgba.org/articles/sfrand/sfrand.htm
static unsigned int mirand = 1;
float sfrand(void) {
    unsigned int a;
    mirand *= 16807;
    a = (mirand & 0x007fffff) | 0x40000000;
    return(*((float*)&a) - 3.0f);
}

It has a uniform distribution and 23 bits of precision (C rand() has only 16 bits).

The results:

Sandy Bridge:

count	tunnel	attractors	fountain
171000	334.633	443.816	348.908
181000	363.954	474.477	372.739
191000	384.869	501.435	394.217

Ivy Bridge:

count	tunnel	attractors	fountain
171000	412.172	531.586	429.293
181000	450.146	573.073	463.037
191000	473.518	606.759	484.880

Wow! Now it is around 28% of total improvement for Sandy Bridge and almost the same for Ivy Bridge.

Wrap up

Final results

CPU	count	tunnel	attractors	fountain
Sandy	191000	384.869 (-21.3%)	501.435 (-27.2%)	394.217 (-24.2%)
Ivy	191000	473.518 (-20.3%)	606.759 (-27.1%)	484.880 (-24.3%)

Total (taking times before tools optimization):

CPU	tunnel	attractors	fountain
Sandy	35.5%	43,5%	39,7%
Ivy	33.2%	38,2%	35,6%

We can ‘reverse’ those numbers and say that now the attractor effect runs almost two times faster! Not that bad!

Conclusion:

Memory alignment and proper data structures are the key factors.
Write SIMD code only if needed, usually it is better to rely on a compiler and third party libraries.
Describe your code better: for instance using __restrict keyword. That way a compiler can generate better code.
Random number generator can make a difference

What’s Next

The renderer is very simple so far. Maybe, there are some options to improve its code. For sure, we have to look at CPU to GPU memory transfers and better buffers usage.

Flexible Particle System - Code Optimization

The Series

Plan for this post

Start

SIMD preparation

SSE and AVX instructions

Pointer aliasing: __restrict keyword

Random Number Generator

Wrap up

What’s Next

References

Similar Articles: