Finally, I managed to finish the adventure with my particle system! This time I’d like to share some thoughts about improvements in the OpenGL renderer.
Code was simplified and I got little performance improvement.
- Initial Particle Demo
- Particle Container 1 - problems
- Particle Container 2 - implementation
- Generators & Emitters
- Introduction to Software Optimization
- Tools Optimizations
- Code Optimizations
- Renderer Optimizations
Plan For This Post
The most recent repo: particles/renderer_opt @github
Where we are?
As I described in the post about my current renderer, I use quite a simple approach: copy position and color data into the VBO buffer and then render particles.
Here is the core code of the update proc:
glBindBuffer(GL_ARRAY_BUFFER, m_bufPos); ptr = m_system->getPos(...); glBufferSubData(GL_ARRAY_BUFFER, 0, size, ptr); glBindBuffer(GL_ARRAY_BUFFER, m_bufCol); ptr = m_system->getCol(...) glBufferSubData(GL_ARRAY_BUFFER, 0, size, ptr);
The main problem with this approach is that we need to transfer data from system memory into GPU. GPU needs to read that data, whether is is explicitly copied into GPU memory or read directly through GART, and then it can use it in a draw call.
It would be much better to be just on the GPU side, but this is too complicated at this point. Maybe in the next version of my particle system I’ll implement it completely on GPU.
Still, we have some options to increase performance when doing CPU to GPU data transfer.
- Disable VSync! - OK
- Quite easy to forget, but without this we could not measure real performance!
- Small addition: do not use blocking code like timer queries too much. When done badly it can really spoil the performance! GPU will simply wait till you read a timer query!
- Single draw call for all particles - OK
- doing one draw call per a single particle would obviously kill the performance!
- Using Point Sprites - OK
- An interesting test was done at geeks3D that showed that points sprites are faster than geometry shader approach. Even 30% faster on AMD cards, between 5% to 33% faster on NVidia GPUs. Additional note on geometry shader from joshbarczak.com
- Of course point sprites are less flexible (do not support rotations), but usually we can live without that.
- Reduce size of the data - Partially
- I send only pos and col, but I am using full FLOAT precision and 4 components per vector.
- Risk: we could reduce vertex size, but that would require doing conversions. Is it worth it?
- In total I use 8 floats per vertex/particle. If a particle system contains 100k particles (not that much!) we transfer 100k * 8 * 4b = 3200k = ~ 3MB of data each frame.
- If we want to use more particles, like 500k, it’ll be around 15MB each frame.
In my last CPU performance tests I got the following numbers: one frame of simulations for each effect (in milliseconds).
Now we need to add the GPU time + memory transfer cost.
Below you can find a simple calculator
As I described in details in the posts about Persistent Mapped Buffers (PMB )I think it’s obvious we should use this approach.
Other options like: buffer orphaning, mapping, etc… might work, but the code will be more complicated I think.
We can simply use PMB with 3x of the buffer size (triple buffering) and probably the performance gain should be the best.
Here is the updated code:
const GLbitfield creationFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT | GL_DYNAMIC_STORAGE_BIT; const GLbitfield mapFlags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT; const unsigned int BUFFERING_COUNT = 3; const GLsizeiptr neededSize = sizeof(float) * 4 * count * BUFFERING_COUNT; glBufferStorage(GL_ARRAY_BUFFER, neededSize, nullptr, creationFlags); mappedBufferPtr = glMapBufferRange(GL_ARRAY_BUFFER, 0, neededSize, mapFlags);
float *posPtr = m_system->getPos(...) float *colPtr = m_system->getCol(...) const size_t maxCount = m_system->numAllParticles(); // just a memcpy mem = m_mappedPosBuf + m_id*maxCount * 4; memcpy(mem, posPtr, count*sizeof(float) * 4); mem = m_mappedColBuf + m_id*maxCount * 4; memcpy(mem, colPtr, count*sizeof(float) * 4); // m_id - id of current buffer (0, 1, 2)
My approach is quite simple and could be improved. Since I have a pointer to the memory I could pass it to the particle system. That way I would not have to
memcpy it every time.
Another thing: I do not use explicit synchronization. This might cause some issues, but I haven’t observed that. Triple buffering should protect us from race conditions. Still, in real production code I would not be so optimistic :)
Initially (AMD HD 5500):
|500k||81.5 (~5.2%)||82.2 (~2.9%)||107.2 (~7.6%)|
|1mln||39.9 (~9.9%)||31.8 (~8.2%)||47.2 (~9.3%)|
Reducing vertex size optimization
I tried to reduce vertex size. I’ve even asked a question on StackOverflow:
We could use
GL_HALF_FLOAT or use
vec3 instead of
vec4 for position. And we could also use
RGBA8 for color.
Still, after some basic tests, I did not get much performance improvement. Maybe because I lost a lot of time for doing conversions.
The system with its renderer is not that slow. On my system I can get decent 70..80FPS for 0.5mln of particles! For 1 million particle system it drops down to 30… 45FPS which is also not that bad!
I would like to present some more 'extraordinary' data and say that I got 200% perf update. Unfortunately it was not that easy... definitely, the plan is to move to the GPU side for the next version. Hopefully there will be more space for improvements.
Read next: Summary
- Persistent Mapped Buffers - my two recent posts:
- From “The Hacks Of Life” blog, VBO series:
- Double-Buffering VBOs - part one
- Double-Buffering Part 2 - Why AGP Might Be Your Friend - part two
- One More On VBOs - glBufferSubData - part three
- When Is Your VBO Double Buffered? - part four