Saturday, May 21, 2011

39 - Saturday Performance School

So let us look at performance on last time on what seems to be the final version of the isometric rendering engine before "Stoneage". I have some better ideas than the current implementation, but they are a radical departure and I don't want to rewrite it.

I will be using a nonrandom uniform and symmetric map. This may seem like cheating to some, but it is not, since if I get a let's say 10% improvement, that improvement will scale with non random maps too. A static and uniform map will allow me to measure performance gains better.

I will be using mostly pure software rendering, at 1280x720 resolution, with the game compiled in debug mode. Using hardware acceleration delegates a lot of the work to the GPU, and thus it would be harder to see if my changes have an impact on the performance since I can only influence the CPU. But I will measure performance for other setups as well once in a while. Result will be specific for my machine and I will not be using the fastest machine available to me.

While my real problems are related to scroll speed, I will first measure the rendering performance. So let us render just a single layer from our map. Since the map is uniform and symmetrical, it makes little difference what region I am rendering, as long as I keep away from the borders. Half height wall rendering is off, since rendering in such a way is faster. I get a steady 65 FPS. Now two levels: 66. Hmmm... I guess that CPU caching is to blame for the increase, since the game has been running for a while and kept traversing the exact same data structures. Three levels: the same. Ten levels: same. Twenty: still 65, but I keeps spiking up higher, sometimes reaching 70. All Z levels: the same. Good. Everything is working as expected! As I detailed in post a long time ago, the engine does not care about map horizontal sizes, and it only cares a little about the number of Z levels.

Repeating the experiment with Irrlicht more advanced software renderer, we get a steady 50 FPS. On most machines the advanced one will be faster or at least as fast as normal rendering, but my machine is weird.

DirectX 8: 65. Again, my machine is weird

DirectX 9: 183! OK, now we are talking!

OpenGL: 36. I am one of those people for who OpenGL has never worked as good as DirectX. A curse? A blessing? I don't know. This is why I tend to dismiss OpenGL development with a smug smirk on my face, even though I have some plans with OpenGL in the future.

Before we go to the real problem, scrolling speed, let me check if the rendering loop is really as optimized as it should be.
*checks out code* *facepalms* *current system is old and outdated* *fixes it* *does not work* *finds ancient bug in image loading* *fixes it*

Done! I have unified the floor and tile drawing system. There is no noticeable performance gain, but the new system is simpler so it was a worth while change. Rendering will never get any faster (as long the number of tiles remains constant of course). The renderer is basically a single one dimensional for loop containing a single branch. There is basically no way to simplify this. Maybe the branch could be removed, but I doubt that would have an impact on performance since the number of tiles is not that great. And while the number of tiles increases greatly once you increase zoom, in this mode you get between 30 and 40 fps with software rendering.

So now it is time to tackle the real culprit: scroll speed. Why is scroll speed low? Because we cache the map so we can have the renderer be as fast as it is right now. Measuring the scroll speed under default circumstances, we get the values 0, 15 and 16 ms, with zero being the most common. As I said, this is a limitation of the timer that I am using: it has a low resolution. But anyway, 16 ms is a good value. Now let us increase the floor levels to maximum: we get almost every time 16 ms. So in the first case we actually had something around 2-4 ms, and it only reported a non zero value once these small numbers started accumulating. Now we are at around 14-15 ms based on the frequency of the 16 values. Still not bad: scrolling is slightly less smooth, but it is barely noticeable. While keeping a scroll key pressed, FPS drops by about 10, again fully acceptable. But now, let us increase the zoom to maximum: 187-210 ms. Ouch! With such high values, you can not keep the button pressed and FPS drops to around 3 when continuously scrolling.

As a first step I try to optimize the bounds detection algorithms when building the cache. This is the part that tells me how much of the map fits on the screen. In a top down map, a rectangular area from the map is rendered as an rectangular area on screen. But in isometric mode, a rectangular area will result in a diamond shape render on screen, so you need to take a diamond shape area of the map which will render as a roughly rectangular area on screen. Optimizing this, I have gotten to 156-171 ms, so about 30 ms less. A good start.

Then I greatly violated the DRY principle, but only in the deepest depths of my deep code, thus eliminating an extra if per cell.

Now that the the code that determines the area that should be scanned is more efficient, the code that actually does the scanning for visibility should be optimized. Unfortunately, this is as good as it will get and there is no way to optimize it. So I also cached the results of the scan, stuffing the data in some free space in the available cache data structure, avoiding and increase in RAM consumption. This does have the disadvantage that the cache must be updated on wall dig/build operations, but this can be done locally and atomically, so it is not a big deal. New results: 109-125 ms.

There is one single worthwhile experiment: adding a new field to the cache, so the new data does not get inserted into left over space and thus the CPU can access it faster, but increasing the RAM consumption: 94-110 ms. I am not sure about this change. The RAM increase is negligible, but the speed increase is not great. On more test with huge RAM increase: the same, so not worth it. I'll go with solution "a", small RAM increase, small gain.

Well, this is about all I can do. I do not see any way to greatly increase scrolling performance further, but some minor gains could yet be achieved. But going from 187-210 ms to 94-110 ms was definitely worth over half a day of optimizing. The results are great, but not stellar. The best part is that with the game compiled in optimal more (not in debug) and using the default world generation and display values, scrolling at the highest zoom level is pretty snappy. You do get a small snag once in a while, but you can keep the button pressed. Even tripling the number of Z levels leads to less snappy but good results.

I know I am repeating myself, but these changes only count on maximum zoom level. On normal zoom, there was never a problem, but today optimizations have certainly sped up even the normal zoom level scrolling and map updating, thus making the game more playable on even older hardware (theoretically speaking of course. I need to test it out on some netbook before I can claim this with certainty). And of course, if there are still performance issues left, you can always reduce the resolution. I have just tried out 800x600 on highest zoom level with all Z levels visible and it is great, with almost no lag at all and 52 FPS in optimal mode. I switched over to debug mode so I can measure the cache building speed, and it is between 46 and 62 ms. So cache building performance is directly proportional with the resolution.

Thus, version 3.0 of my isometric rendering engine is done. I will not optimize it again until after "Stoneage" (I will fix bugs though) and if you see me tinkering yet again with it please give me a slap to snap me out of it.

No comments:

Post a Comment