Friday, December 1, 2017

Occlusion - What it is, and what and where Unreal and CryEngine implement them

The most effective way to speed up any task is to do less. The trick is always figuring out what you can do less without affecting the game. Occlusion testing is one of the biggest ways of doing this for rendering.  The goal behind adding an occlusion system is to determine what is definitely not visible so we can avoid rendering it, saving ideally both CPU time for generating command list calls, and on the GPU, to avoid transforming vertices that don't affect the result.

While going over each of these techniques, I'll point out where these happen in UE4 and CryEngine due to the source code for these being available to the public, and seeing how these systems are implemented can be useful.  These engines may not have the best implementation possible, but they can be checked out.

Frustum testing

The most basic occlusion technique is frustum testing.  What that involves is constructing a Frustum from the view projection matrix, and then for every object you're considering rendering, test the bounding box against the frustum.  If it's entirely outside of the frustum, you can safely not render it, it definitely won't be visible.  Every engine I've worked on had this in one form or another.

The least amount of work needed to add this is right before rendering an object, test the object against the current frustum, and skip it if it's outside.

A much better approach is adding a group of jobs that runs early during rendering for each view that will be rendered.  These jobs will take chunks of objects from the list of objects that might be visible, and then do the frustum test.  This way you have several workers assisting visibility testing, resulting in decreased time spent evaluating visibility.  It's even possible to calculate some view dependent info during this stage in parallel.  My own toy engine determines distance from camera, with each job sorting the data they found to be visible by distance to camera, then a final job that combines the per job presorted lists into a single sorted list.

UE4's implementation is located in SceneVisibility.cpp.  Look for FrustumCull.  This function gets called on the renderthread, and in it uses a ParallelFor to iterate over it's list of PrimitiveBounds, with creating a task per 4096 primitives, with each task working on 32 primitives at a time.  Each object that's determined to be visible gets added to the PrimitiveVisibilityMap, and if it's fading between LODs, added to the PotentiallyFadingPrimitiveMap.

In CryEngine, this isn't as straight forward.  Unlike Unreal, which ends up tracking data on both the Game Thread and the Render Thread, CryEngine maintains all of the data on the Game Thread, in an Octree setup.  It then iterates over the octree, only following nodes that pass the frustum tests, and adds any individual objects that are in the frustum to a culling queue.  This is all done in COctreeNode::Render_Object_Nodes.  The Culling queue is processed in a job, and I'll talk about the other work this job does in the following sections.

UE4 also maintains rendering data in an Octree, however that's primarily used for lights and what primitives will be rendered into dynamic shadows.  Basic rendering is done as an array of bounding boxes done on several jobs.  Processing the data as an Octree tends to have better single threaded performance, but processing the data as an array of bounds across several jobs can significantly improve performance with enough workers.

Reverse frustum testing

A fairly common extension on top of the standard frustum testing is reverse frustum testing.  What this is where special occlusion shapes are manually placed in the level, then the engine constructs frustums for those objects in the camera's view space.  These special frustums are treated as occlusion frustums, where any object that is fully inside of them is definitely not visible.

Most engines have this implemented somewhere as it does provide a relatively cheap occlusion feature if the levels are setup to use them.  In fact, one of the companies I've interviewed for revealed that this was the only occlusion their engine used for their open world game series.

In CryEngine, this is handled in the CVisAreaManager::IsOccludedByOcclVolumes.  This is the second test that an OctreeNode will undergo after determining that it's within the standard camera frustum.  I have yet to find anything similar in Unreal.

Hardware occlusion queries

A common technique for occlusion is to use hardware occlusion queries.  This is where, after rendering the scene, we set the GPU to report how many pixels passed depth testing, and have it render the bounding boxes, without actually writing to the render targets, for the objects we want to find out the visibility state.  From this info we can then tell if the objects were visible, and if so, either no longer render them, or to start rendering them if we thought they weren't visible.

This is the main way that Unreal handles occlusion.  This happens in FetchVisibilityForPrimitives, where it will batch together several objects into groups the things it had determined to not be visible to verify they are still not visible, and then create queries for new objects to be tested for visibility.

There are two notable downside of this setup.

One, it takes a while to get to a stable state, and it's performance will have spikes when moving through the level as things become visible and then leave visibility.  When you first render a scene, you will have no knowledge of what is visible, so you must render everything, then over the future frames, you'll find out what's not visible and gradually get performance back.  Also, due to clustering non-visible objects into groups, when one object becomes visible, a bunch of objects become visible at once, which will then be tested next frame, but you'll still see a performance spike.

Two, objects becoming visible are delayed by at least one frame, maybe more on PCs with multiple GPUs.  This means it does suffer from false negatives - it thinks objects aren't visible this frame when they are.  If you've ever seen a door open in an unreal game and you can briefly see through the world for a frame, this is what went wrong there.  It wasn't that the scene wasn't loaded, just that the system thought it wasn't visible when it should be.

Static occlusion testing

Another technique Unreal has is static occlusion.  The way this worked was during lighting build the system could also split the map into precomputed occlusion cells, which it would then test all objects in the scene against, to determine what's visible from anywhere inside that cell.  This is nice as there is very little work at run time needed to find out what's visible as you can just use the cells data as what's visible.  However, if you leave where cells have been placed, you're out of luck and need to fallback to other occlusion tests, or just render everything.

This was added for UE3's mobile support as glES didn't support occlusion queries when it was first being ported.

Reprojected depth visibility testing

The idea with this is we are going to take a lower-res depth buffer from last frame, reproject it from last frames view to this frames view, and we'll test each objects bounding box against the depth found in the reprojected depth buffer.

This is the main occlusion system CryEngine uses.  Before the gamethread renders the scene, it first calls PrepareOcclusion.  This triggers a series of jobs that download the depth buffer from the GPU, and then reprojects it from the old projection to this current frames projection, leaving a far plane value for any depth that doesn't get reprojected onto due to a moving or rotating camera.

Then, while the system is rendering the scene, a job is created that processes the octree nodes that were pushed into the cull queue.  This happens in CCullThread::CheckOcclusion, where it calls TestAABB, which tests the bounding box of the aabb against the reprojected depth buffer.  When I worked on the CryEngine before, this system can be extended to multiple worker threads to try to spread out this work onto multiple threads, but you can quickly run into diminishing returns.

The advantage of this over the hardware occlusion queries is that there are no false negatives.  The system still suffers from the first frame or any major camera change will result in many objects becoming visible that it might not need to.  Plus, small camera changes can produce holes in the depth buffer that can result in objects rendering that won't be visible.

I believe Unreal is working on an update to their occlusion systems to support this but last time I found it it was experimental, and didn't reproject the depth data either, but they'll probably get there eventually.

CPU Rasterized depth

The final step for CPU based visibility is to generate the depth buffer on the CPU.  This typically requires much lower-res geometry that the CPU is going to render in jobs to a depth buffer that it will then use to test occlusion.

CryEngine supports this as well as an alternative to reprojected depth buffer testing.  Instead it will render a simplified version of the scene, up to a certain number of vertices.  This is done entirely on the CPU, and with the current frames projection.  I'm not sure, but it might use the last frames depth buffer as well to provide some additional occluder.

The advantage of this is, is you end up with a view of the scene with no holes in the depth buffer and no first frame penalties.  The occlusion result is always 'perfect', there shouldn't be any false negatives or false positives.  It will however require the simplified geometry and, I assume, a fair amount of worker thread time to actually do the software render.  We were never in a situation to verify how much time this would have taken as our minspec was too low to justify this and the reprojected results were good enough.

Intel released a software based occlusion system as well, and there was a fascinating blog series written by Fabian Giesen about further optimizing their software occlusion that has been integrated into the official release.

I wouldn't be surprised to learn that most of the third party occlusion libraries available for Unreal and other engines implement this or the reprojected depth buffer for the enhancements to the occlusion system.

I've also seen some interesting uses for such a system, including having explosions be blocked by geometry and only applying impulses to the parts of dynamic objects that have a clear line of sight to the explosions themselves.  This is a level of detail most games don't really need, but it's an interesting feature to add.

GPU based rendering

Based on various siggraph papers, GPU based rendering is a major focus for occlusion in the near future, especially for engines that can be more specialized for specific games.  However I feel like the engines I've seen are going to have a hard time adding these. The crux of the issue is how do you render an uncertain number of objects on the CPU that the compute shaders will then determine what needs to be rendered.

While it's relatively straight forward to use indirect drawing to have the GPU control how many instances to render, to be able to specify things like what buffers to use is DX12 specific, while being able to control buffers, pipelines and descriptor sets are limited to Nvidia drivers on Vulkan.  To support as many targets as possible, you'll be limited to just controlling the number of instances to render, which is supported on DX11 and Vulkan.

This means the scenes you'd render with this must use heavy instancing and drawing every set of buffers/textures for each of the instanced occluded objects, or creating a system that allows for a great deal of indirection for accessing buffers and textures.  For systems like Unreal that has such a flexible material system, the first option seems the most likely path they would take, while CryEngine, which has a more controlled material system, either might work.

The basic plan though is from the GPUs perspective, is the system would render some likely occluders for this frame, fill in the gaps using a reprojection of last frames data, and do occlusion testing using that.

Compute based occlusion

The guys over at Guerilla games recently released a paper talking about the occlusion system they made for Horizon Zero Dawn.  What they're doing is basically the reprojected depth visibility testing, but using the async compute on the GPU instead of jobs.  This makes a lot of sense as the input data is rendering data to begin with.  This strikes me as an interesting approach, but I wonder how well the async compute falls into the render time, and if this would be a viable approach for engines that work on PCs.