This section offers advice on making your OpenGL programs go faster and run smoother.
Perhaps the most common error in measuring the performance of an OpenGL program is to say something like:
start_the_clock () ; draw_a_bunch_of_polygons () ; stop_the_clock () ; swapbuffers () ;
This fails because OpenGL implementations are almost always pipelined - that is to say, things are not necessarily drawn when you tell OpenGL to draw them - and the fact that an OpenGL call returned doesn't mean it finished rendering.
Typically, there is a large FIFO buffer on the front end of the graphics card. When your application code sends polygons to OpenGL, the driver places them into the FIFO and returns to your application. Sometime later, the graphics processor picks up the data from the FIFO and draws it.
Hence, measuring the time it took to pass the polygons to OpenGL tells you very little about performance. It is wise to use a 'glFinish' command between drawing the polygons and stopping the clock - this theoretically forces all of the rendering to be completed before it returns - so you get an accurate idea of how long the rendering took. That works to a degree - but experience suggests that not all implementations have literally completed processing by then.
Worse still, there is the possibility that when you started drawing polygons, the graphics card was already busy because of something you did earlier. Even a 'swapbuffer' call can leave work for the graphics system to do that can hang around and slow down subsequent operations in a mysterious fashion. The cure for this is to put a 'glFinish' call before you start the clock as well as just before you stop the clock.
However, eliminating these overlaps in time can also be misleading. If you measure the time it takes to draw 1000 triangles (with a glFinish in front and behind), you'll probably be happy to discover that it doesn't take twice as long to draw 2000 of them.
Graphics cards are quite complex parallel systems and it's exceedingly hard to measure precisely what's going on.
The best practical solution is to measure the time between consecutive returns from the swapbuffer command with your entire application running. You can then adjust one piece of the code at a time, remeasure and get a reasonable idea of the practical improvements you are getting as a result of your optimisation efforts.
Even when you do this, you may have to take a little care - some systems force the swapbuffers command to wait for the next video vertical retrace before performing the swap - if that is the case then you'll only ever see times that are an exact multiple of the video frame time and it will be impossible to see exactly how much time you are consuming. However, most PC graphics adaptors do not do this by default - so you would probably have to have taken an active step to turn this feature on. However, if you have (say) a 60Hz monitor and your times all come out at either 16.66ms, 33.33ms or 50ms then suspect that you have a problem.
Understanding where the bottlenecks are
There are generally four things to look at initially.
- Your CPU performance. If your code is so slow that it's not feeding the graphics pipe at the maximum rate it could go - then improving the nature of the things you actually draw won't help much.
- Bus bandwidth. There is a finite limit to the rate at which data can be sent from the CPU to the graphics card. If you require too much data to describe what you have to draw - then you may be unable to keep the graphics card busy simply because you can't get it the data it needs. Consider using techniques like compiled vertex arrays and display lists to place the bulky data onto the graphics card just once so you can cause it to be rendered with a compact command such as 'glCallList'.
- Vertex performance. The first thing of any significance that happens on the graphics card is vertex processing. If you are using the standard pipeline then lighting and vertex transformation are done here - if you are using a vertex shader then running the vertex shader can be a bottleneck. This is usually easy to diagnose by replacing the shader with something (for example) without lighting calculations and see if things speed up. If they do - then the odds are good that you are vertex limited. If not...not.
- Fragment performance. After vertex processing, the polygons are chopped up into fragments (typically the size of a pixel) and the fragment processing takes over. Fragment processing takes longer for large polygons than for small ones - and generally gets slower the more textures you use - and the more complex your fragment shader is (if you are using one).
If you are fragment processing bound, a really simple test is to reduce the size of the window you are rendering into down to the size of a postage stamp and see if your frame rate improves. If it does then you are at least partially fill rate limited - if it doesn't then that's not the problem. This is such a simple test that it should always be the first thing you try.
There are more subtleties here - but this is a start.
Toolkits for understanding Performance
- OpenGL Benchmarks comparing:
* C vs Perl * Perl vs Python * POGL vs SDL::OpenGL * Windows vs Linux