Partners: KDAB Whole Tomato Software CppDepend

29 January 2015

Persistent Mapped Buffers, Benchmark Results

In part 2 of the article about persistent mapped buffers I share results from the demo app.

I've compared single, double and triple buffering approach for persistent mapped buffers. Additionally there is a comparison for standard methods: glBuffer*Data and glMapBuffer.

Note:
This post is a second part of the article about Persistent Mapped Buffers,
see the first part here - introduction

Demo

Github repo: fenbf/GLSamples

How it works:

  • app shows number of rotating 2D triangles (wow!)
  • triangles are updated on CPU and then send (streamed) to GPU
  • drawing is based on glDrawArrays command
  • in benchmark mode I run this app for N seconds (usually 5s) and then count how many frames did I get
  • additionally I measure counter that is incremented each time we need to wait for buffer
  • vsync is disabled

Features:

  • configurable number of triangles
  • configurable number of buffers: single/double/triple
  • optional syncing
  • optional debug flag
  • benchmark mode (quit app after N seconds)

Code bits

Init buffer:

size_t bufferSize{ gParamTriangleCount * 3 * sizeof(SVertex2D)};
if (gParamBufferCount > 1)
{
  bufferSize *= gParamBufferCount;
  gSyncRanges[0].begin = 0;
  gSyncRanges[1].begin = gParamTriangleCount * 3;
  gSyncRanges[2].begin = gParamTriangleCount * 3 * 2;
}

flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
glBufferStorage(GL_ARRAY_BUFFER, bufferSize, 0, flags);
gVertexBufferData = (SVertex2D*)glMapBufferRange(GL_ARRAY_BUFFER, 
                                           0, bufferSize, flags);

Display:

void Display() {
  glClear(GL_COLOR_BUFFER_BIT);
  gAngle += 0.001f;

  if (gParamSyncBuffers)
  {
    if (gParamBufferCount > 1)
      WaitBuffer(gSyncRanges[gRangeIndex].sync);
    else
      WaitBuffer(gSyncObject);
  }

  size_t startID = 0;

  if (gParamBufferCount > 1)
    startID = gSyncRanges[gRangeIndex].begin;

  for (size_t i(0); i != gParamTriangleCount * 3; ++i)
  {
    gVertexBufferData[i + startID].x = genX(gReferenceTrianglePosition[i].x);
    gVertexBufferData[i + startID].y = genY(gReferenceTrianglePosition[i].y);
  }

  glDrawArrays(GL_TRIANGLES, startID, gParamTriangleCount * 3);

  if (gParamSyncBuffers)
  {
    if (gParamBufferCount > 1)
      LockBuffer(gSyncRanges[gRangeIndex].sync);
    else
      LockBuffer(gSyncObject);
  }

  gRangeIndex = (gRangeIndex + 1) % gParamBufferCount;

  glutSwapBuffers();
  gFrameCount++;

  if (gParamMaxAllowedTime > 0 &&
      glutGet(GLUT_ELAPSED_TIME) > gParamMaxAllowedTime)
    Quit();
}

WaitBuffer:

void WaitBuffer(GLsync& syncObj)
{
  if (syncObj)
  {
    while (1)
    {
      GLenum waitReturn = glClientWaitSync(syncObj, 
                                       GL_SYNC_FLUSH_COMMANDS_BIT, 1);
      if (waitReturn == GL_ALREADY_SIGNALED ||
          waitReturn == GL_CONDITION_SATISFIED)
        return;

      gWaitCount++;    // the counter
    }
  }
}

Test Cases

I've created a simple batch script that:

  • runs test for 10, 100, 1000, 2000 and 5000 triangles
  • each test (takes 5 seconds):
    • persistent_mapped_buffer single_buffer sync
    • persistent_mapped_buffer single_buffer no_sync
    • persistent_mapped_buffer double_buffer sync
    • persistent_mapped_buffer double_buffer no_sync
    • persistent_mapped_buffer triple_buffer sync
    • persistent_mapped_buffer triple_buffer no_sync
    • standard_mapped_buffer glBuffer*Data orphan
    • standard_mapped_buffer glBuffer*Data no_orphan
    • standard_mapped_buffer glMapBuffer orphan
    • standard_mapped_buffer glMapBuffer no_orphan
  • in total 5*10*5 sec = 250 sec
  • no_sync means that there is no locking or waiting for the buffer range. That can potentially generate a race condition and even an application crash - use it on your own risk! (at least in my case nothing happened - maybe a little bit of dancing vertices :) )
  • 2k triangles uses: 2000*3*2*4 bytes = 48 kbytes per frame. This is quite small number. In the followup for this experiment I'll try to increase that and stress CPU to GPU bandwidth a bit more.

Orphaning:

  • for glMapBufferRange I add GL_MAP_INVALIDATE_BUFFER_BIT flag
  • for glBuffer*Data I call glBufferData(NULL) and then normal call to glBufferSubData.

Results

All results can be found on github: GLSamples/project/results

100 Triangles

GeForce 460 GTX (Fermi), Sandy Bridge Core i5 2400, 3.1 GHZ

GeForce 460 GTX, buffer streaming

Wait counter:

  • Single buffering: 37887
  • Double buffering: 79658
  • Triple buffering: 0

AMD HD5500, Sandy Bridge Core i5 2400, 3.1 GHZ

AMD HD5500, buffer streaming

Wait counter:

  • Single buffering: 1594647
  • Double buffering: 35670
  • Triple buffering: 0

Nvidia GTX 770 (Kepler), Sandy Bridge i5 2500k @4ghz

Nvidia GTX 770, buffer streaming

Wait counter:

  • Single buffering: 21863
  • Double buffering: 28241
  • Triple buffering: 0

Nvidia GTX 850M (Maxwell), Ivy Bridge i7-4710HQ

Nvidia GTX 850M, buffer streaming

Wait counter:

  • Single buffering: 0
  • Double buffering: 0
  • Triple buffering: 0

All GPUs

With Intel HD4400 and NV 720M

All gpus, 100 tris, buffer streaming

2000 Triangles

GeForce 460 GTX (Fermi), Sandy Bridge Core i5 2400, 3.1 GHZ

GeForce 460 GTX, buffer streaming

Wait counter:

  • Single buffering: 2411
  • Double buffering: 4
  • Triple buffering: 0

AMD HD5500, Sandy Bridge Core i5 2400, 3.1 GHZ

AMD HD5500, buffer streaming

Wait counter:

  • Single buffering: 79462
  • Double buffering: 0
  • Triple buffering: 0

Nvidia GTX 770 (Kepler), Sandy Bridge i5 2500k @4ghz

Nvidia GTX 770, buffer streaming

Wait counter:

  • Single buffering: 10405
  • Double buffering: 404
  • Triple buffering: 0

Nvidia GTX 850M (Maxwell), Ivy Bridge i7-4710HQ

Nvidia GTX 850M, buffer streaming

Wait counter:

  • Single buffering: 8256
  • Double buffering: 91
  • Triple buffering: 0

All GPUs

With Intel HD4400 and NV 720M

All gpus, 2000 tris, buffer streaming

Summary

  • Persistent Mapped Buffers (PBM) with triple buffering and no synchronization seems to be the fastest approach in most tested scenarios.
    • Only Maxwell (850M) GPU has issues with that: slow for 100 tris, and for 2k tris it's better to use double buffering.
  • PBM width double buffering seems to be only a bit slower than triple buffering, but sometimes 'wait counter' was not zero. That means we needed to wait for the buffer. Triple buffering has no such problem, so no synchronization is needed.
    • Using double buffering without syncing might work, but we might expect artifacts. (Need to verify more on that).
  • Single buffering (PBM) with syncing is quite slow on NVidia GPUs.
  • using glMapBuffer without orphaning is the slowest approach
  • interesting that glBuffer*Data with orphaning seems to be even comparable to PBM. So old code that uses this approach might be still quite fast!

TODO: use Google Charts for better visualization of the results

Please Help

If you like to help, you can run benchmark on your own and send me (bartlomiej DOT filipek AT gmail ) the results.

Windows only. Sorry :)

Behchmark_pack 7zip @github

Go to benchmark_pack and execute batch run_from_10_to_5000.bat.

run_from_10_to_5000.bat > my_gpu_name.txt

The test runs all the tests and takes around 250 seconds.

If you are not sure your GPU will handle ARB_buffer_storage extension you can simply run persistent_mapped_buffers.exe alone and it will show you potential problems.

Get my free ebook about C++17!

More than 50 pages about the new Language Standard.

C++17 in detail, by Bartlomiej Filipek

For now I don't have my own courses, but I promote others :) (Warning: I'll also get a little commission for every signup). Have a look my recommended C++ courses at @Pluralsight (more info in my Resource page):

© 2017, Bartlomiej Filipek, Blogger platform
Any opinions expressed herein are in no way representative of those of my employers.
This site contains ads or referral links, which provide me with a commission. Thank you for your understanding.