Monday, February 13, 2012

Second generation 3D engine done!

It's about time! For future generations I have planned some light multithreading, at least some gray blobs behind objects to simulate shadows and indoor lighting. 

But let's talk about how the second generation engine came to be and what it can do.

Like I said in the previous post, I am raising the quality of graphics just a little. So I set up a heavy stress test and started tweaking the objects. The problem was that I was getting heavy snagging: while scrolling the map, the scroll animation would freeze for just a fraction of a second, but enough to be jerky and disturbing.

There are many reasons for this: it was a stress test so item density was high. I tested on a map without elevation and elevation uses less resources than items, so the load was higher. But there are two main reasons for this. The first is that I fixed scroll animation to be smooth. In the past it was never smooth, not even with an empty map. So with jerky scrolling you did not notice the extra snags. The second reason is that the new LOD system was designed with the first person camera in mind. I said back then that for top down the new one is a little bit slower and I did not get around to do something about this. Ironically, the primary camera for world interaction that also has the best performance has the worst LOD implementation. For first person camera, the snags are not an issue.

So I ignored this issue for starters and went though the entire item list optimizing LOD levels to obtain the best possible visuals. This meant going to a LOD switch border, taking a step forward, one back, from different angles and observing the change. Then adjusting items: adding a face her, a loop cut there, removing some faces, changing shading, scaling, moving, etc. It took me about six hours to do so for the 14 items I have right now. So I don't want to touch Blender again for at least a week.

So now I have a good looking set of items with all LOD levels 95% optimized. I went for visuals, not performance, so GPU requirement is a little bit higher, as planned.

This is my reference point. This is how graphics will look and how many resources they will eat. So I must make sure that for these levels the engine works fine and is snag free.

In order to remove the load on the CPU I implemented a streaming LOD switcher. Streaming means that the system determines the work load and has a speed at which it sets of to execute that work load in the background. This way even huge tasks can be executed, they only take longer. The speed must be set in such a way that it is a good compromise between CPU time and the resulting pop-in. Because streaming also means extra popin. With streaming you can walk for a while, then stop, and a few second after the engine is still working, optimizing what you see.

I did two fun experiments that were visually interesting for me. The first was creating a map but not pre-populating it, so in the first second in was completely empty. With a streaming speed set to low it was fun to see the world be built little by little for a couple of minutes. The second experiment was the opposite. I created the world and pre-populated it with maximum LOD levels. It ate up almost 1 GiB of video memory. Then I watched as the LOD switcher reduced detail for distant objects little by little, eventually reducing the memory consumption to 150 MiB.

I fine tuned the values for what I expect my target hardware to be and got this result:


To make things more interesting I set the game to ultra and set antialiasing to 32x CSAA. This is why performance is so low. With 4x antialiasing and without FRAPS performance is at normal levels.

What can't be seen in the video is what I did next. I replaced the simple streaming LOD switcher with a biased streaming LOD switcher. This means that it uses heuristics to try and prioritize updates that would lead to a better quality render.

Combining all these techniques I get a really spectacular engine for first person navigation. There is nothing I can think of that would improve it right now, except for adding free form height transitions.

In top down mode things are not that great. Streaming greatly reduced snags, but they are still present. Making them shorter and less frequent actually made things worse, because you are not expecting a snag.

So the next step was to do some heavy profiling to determine what causes the snag. I created for these tests very small fixed size maps with an uniform distribution of barrels to eliminate all randomness. I scaled things so the streaming LOD switcher would interpret this as the worst case scenario with the highest work load.

To my surprise the snag seemed present and of similar length even on such small maps. This meant that map size was not an issue, and the snag was constant per work unit. So you could determine the approximate snag duration by multiplying the work load with the cost of the work unit.

There were two important measurements to determine. I won't go into detail of what these represent because I can't explain it without fully explaining the streaming LOD switcher with formulas and all. It is suffice to say that for each LOD switch for the maximum work load I will give two numbers in milliseconds. Both numbers need to be as low as possible, but if the second is not that low it is not a big deal.

For the game compiled in release mode I got an average of 29 / 4 ms, and for the debug mode 88 / 12 ms. Having a 30 ms delay when scrolling when you are rending a lot of frames per second seems like a good candidate for the snags. Even if you have just 100 FPS, a frame takes 10 ms to render. So while pressing a directional key we have 10 ms, scroll, 10 ms, scroll, 10ms, 30 ms, scroll, 10 ms, scroll, 10 ms, scroll. You see what the problem is? Once in a while a frame takes 4 times as much to render and this makes makes scrolling jumpy. And the higher your framerate, the worse it gets. I did a test and I confirmed that making this delay shorter directly affects the smoothness of the scroll.

The first thing I needed to do is disable hardware buffers and see if the results remain consistent. I heavily use hardware buffers and if they are the cause for the length of the update operation, there is no way to fix the snag. Not without going into the Irrlicht code at least. Keeping my fingers crossed...

Disabling hardware buffers causes a 6-7 times drop in FPS, but the operations seem to take as much as before. Maybe times are sometimes lower by 1-2 ms. So this is great news. It seems that hardware buffers don't cause this, so it can potentially be fixed!

Then I change over to shallow building. Shallow building is very fast but can only be used under very limited circumstances. Shallow building changes the 29 ms to around 10 ms. This is not that good. It is not possible to get better results than shallow building for any build operation. To further improve upon that we need assembly and/or multithreading. But from 29 to 10 there is a big difference, and even bigger form 88 to 10 (it is still 10 under debug mode), so let's see if we can't improve upon this a little. These measurements were taken under worst case scenario, so a sizable but insufficient improvement here could translate in what is needed for normal scenarios.

The first step is to get rid of some vestigial shadow support calculations. Shadows probably won't make it into the engine for quite a while so no use having them around. This brings zero gain, but it is cleaner.

The second step is to try and optimize bounding box calculations. First we fully disable all such calculations to see what results we can get: 22 / 3 ms. Not bad! That's 7-8 ms spent just calculating bounding boxes. Of course, without bounding boxes frustum culling is dead and performance goes down while rendering is no longer correct. So let's optimize! After a 30 line optimization that pre-computes far more efficiently the bounding boxes using vector math we get 22 / 3 ms.  To make sure that everything is correct I keep both methods on for a while and added an assert to see if they produce different results while playing around on a full map. The new calculation gives the same results 100% of the time. I could replace them with a calculation that is faster, less precise and thus making the GPU work harder to render, but I don't think this will help the performance of the LOD switcher. For low to medium item densities this change is already enough to reduce snagging noticeably.

Next we go for the colorizing code! After doing this we get a minor and improvement. Sometimes we get a -2 / -0.5, but sometimes we don't. Such small duration are hard to measure with accuracy. There is still one if per vertex that I would like to see gone. Luckily, I find a loophole: I make a change that slows down level change a little but the if is gone. I am getting now a consistent 18 / 2.75 ms. This optimization can be applied  to shallow building too so that has been sped up a little bit.

The obvious and simple stuff is out of the way an we are sitting on 18 / 2.75, down from 29 / 4. Now comes the hard part.

Using extremely low level pointer arithmetic I squeeze out every possible CPU cycle out of index updating and get down to 16 / 2.5. Other than this I have only one more idea, an idea that uses about about about 12 MiB of static cache memory, but I wont resort to that only if I am desperate.

For vertices I try something with memcpy and pointers, but this does not give any benefits, so I try the same tactic with pointers like I did for indices and... I think it is 15.

So we'll remain by the 15 / 2.5 for now. This is as much as I can optimize for now, but already this gives great results. When testing out on a normal map snagging is completely gone when testing in top down mode without the massive performance hogs: the trees. With trees I'm 95% it is gone. Sometimes I get the impression that there might have been a minor snag. First person never had any snagging issues and now it is probably even better.

This change also generally effect anything that creates or destroys item so it is a global performance gain. Item creation is now almost two times as fast!

The only bug I can see is that streaming and vertical synchronization do not mix well. So for now it is recommended that you turn that of. Vertical synchronization adds frame delay which compounds with my delay and things are jerky again.

So the streaming LOD switcher and this massive item creation optimization are the stars of the second generation 3D engine. If you have a weak CPU you should instantly notice the change. If you have a strong CPU you won't notice anything except for a smoother scroll that is caused by bug I fixed some time ago. I think I wrote about it. Not sure.

This second generation 3D engine will serve me well, but because of the massive changes a bug or two might have slipped in. I need a few days to play test and make sure that everything works as expected. 

So I can't really release Snapshot 6 yet. If I deem the engine stable I might make it on time, but if not I'll post the chagelog for it and get started on Snapshot 7.

Snapshot 6 also marks the finalization of creation mode stage 1, so I might do a release only with this followed by a few bugfix releases, but my mind already wanders towards the reintroduction of dwarves. And a very special HD fork.

I won't work on the engine any more right except for bugfixing, but one day when I feel ready for it, I'll try and create the third generation engine that should do all the tasks with a little help from multithreading. 

3 comments:

  1. Impressive, I barely can imagine how this will look like with ramps and more natural objects. Go for it!

    ReplyDelete
  2. like plants and rocks ...
    Objects you can find in nature

    ReplyDelete