Verge of Failure

Friday, December 1, 2017

Occlusion - What it is, and what and where Unreal and CryEngine implement them

The most effective way to speed up any task is to do less. The trick is always figuring out what you can do less without affecting the game. Occlusion testing is one of the biggest ways of doing this for rendering. The goal behind adding an occlusion system is to determine what is definitely not visible so we can avoid rendering it, saving ideally both CPU time for generating command list calls, and on the GPU, to avoid transforming vertices that don't affect the result.

While going over each of these techniques, I'll point out where these happen in UE4 and CryEngine due to the source code for these being available to the public, and seeing how these systems are implemented can be useful. These engines may not have the best implementation possible, but they can be checked out.

Frustum testing

The most basic occlusion technique is frustum testing. What that involves is constructing a Frustum from the view projection matrix, and then for every object you're considering rendering, test the bounding box against the frustum. If it's entirely outside of the frustum, you can safely not render it, it definitely won't be visible. Every engine I've worked on had this in one form or another.

The least amount of work needed to add this is right before rendering an object, test the object against the current frustum, and skip it if it's outside.

A much better approach is adding a group of jobs that runs early during rendering for each view that will be rendered. These jobs will take chunks of objects from the list of objects that might be visible, and then do the frustum test. This way you have several workers assisting visibility testing, resulting in decreased time spent evaluating visibility. It's even possible to calculate some view dependent info during this stage in parallel. My own toy engine determines distance from camera, with each job sorting the data they found to be visible by distance to camera, then a final job that combines the per job presorted lists into a single sorted list.

UE4's implementation is located in SceneVisibility.cpp. Look for FrustumCull. This function gets called on the renderthread, and in it uses a ParallelFor to iterate over it's list of PrimitiveBounds, with creating a task per 4096 primitives, with each task working on 32 primitives at a time. Each object that's determined to be visible gets added to the PrimitiveVisibilityMap, and if it's fading between LODs, added to the PotentiallyFadingPrimitiveMap.

In CryEngine, this isn't as straight forward. Unlike Unreal, which ends up tracking data on both the Game Thread and the Render Thread, CryEngine maintains all of the data on the Game Thread, in an Octree setup. It then iterates over the octree, only following nodes that pass the frustum tests, and adds any individual objects that are in the frustum to a culling queue. This is all done in COctreeNode::Render_Object_Nodes. The Culling queue is processed in a job, and I'll talk about the other work this job does in the following sections.

UE4 also maintains rendering data in an Octree, however that's primarily used for lights and what primitives will be rendered into dynamic shadows. Basic rendering is done as an array of bounding boxes done on several jobs. Processing the data as an Octree tends to have better single threaded performance, but processing the data as an array of bounds across several jobs can significantly improve performance with enough workers.

Reverse frustum testing

A fairly common extension on top of the standard frustum testing is reverse frustum testing. What this is where special occlusion shapes are manually placed in the level, then the engine constructs frustums for those objects in the camera's view space. These special frustums are treated as occlusion frustums, where any object that is fully inside of them is definitely not visible.

Most engines have this implemented somewhere as it does provide a relatively cheap occlusion feature if the levels are setup to use them. In fact, one of the companies I've interviewed for revealed that this was the only occlusion their engine used for their open world game series.

In CryEngine, this is handled in the CVisAreaManager::IsOccludedByOcclVolumes. This is the second test that an OctreeNode will undergo after determining that it's within the standard camera frustum. I have yet to find anything similar in Unreal.

Hardware occlusion queries

A common technique for occlusion is to use hardware occlusion queries. This is where, after rendering the scene, we set the GPU to report how many pixels passed depth testing, and have it render the bounding boxes, without actually writing to the render targets, for the objects we want to find out the visibility state. From this info we can then tell if the objects were visible, and if so, either no longer render them, or to start rendering them if we thought they weren't visible.

This is the main way that Unreal handles occlusion. This happens in FetchVisibilityForPrimitives, where it will batch together several objects into groups the things it had determined to not be visible to verify they are still not visible, and then create queries for new objects to be tested for visibility.

There are two notable downside of this setup.

One, it takes a while to get to a stable state, and it's performance will have spikes when moving through the level as things become visible and then leave visibility. When you first render a scene, you will have no knowledge of what is visible, so you must render everything, then over the future frames, you'll find out what's not visible and gradually get performance back. Also, due to clustering non-visible objects into groups, when one object becomes visible, a bunch of objects become visible at once, which will then be tested next frame, but you'll still see a performance spike.

Two, objects becoming visible are delayed by at least one frame, maybe more on PCs with multiple GPUs. This means it does suffer from false negatives - it thinks objects aren't visible this frame when they are. If you've ever seen a door open in an unreal game and you can briefly see through the world for a frame, this is what went wrong there. It wasn't that the scene wasn't loaded, just that the system thought it wasn't visible when it should be.

Static occlusion testing

Another technique Unreal has is static occlusion. The way this worked was during lighting build the system could also split the map into precomputed occlusion cells, which it would then test all objects in the scene against, to determine what's visible from anywhere inside that cell. This is nice as there is very little work at run time needed to find out what's visible as you can just use the cells data as what's visible. However, if you leave where cells have been placed, you're out of luck and need to fallback to other occlusion tests, or just render everything.

This was added for UE3's mobile support as glES didn't support occlusion queries when it was first being ported.

Reprojected depth visibility testing

The idea with this is we are going to take a lower-res depth buffer from last frame, reproject it from last frames view to this frames view, and we'll test each objects bounding box against the depth found in the reprojected depth buffer.

This is the main occlusion system CryEngine uses. Before the gamethread renders the scene, it first calls PrepareOcclusion. This triggers a series of jobs that download the depth buffer from the GPU, and then reprojects it from the old projection to this current frames projection, leaving a far plane value for any depth that doesn't get reprojected onto due to a moving or rotating camera.

Then, while the system is rendering the scene, a job is created that processes the octree nodes that were pushed into the cull queue. This happens in CCullThread::CheckOcclusion, where it calls TestAABB, which tests the bounding box of the aabb against the reprojected depth buffer. When I worked on the CryEngine before, this system can be extended to multiple worker threads to try to spread out this work onto multiple threads, but you can quickly run into diminishing returns.

The advantage of this over the hardware occlusion queries is that there are no false negatives. The system still suffers from the first frame or any major camera change will result in many objects becoming visible that it might not need to. Plus, small camera changes can produce holes in the depth buffer that can result in objects rendering that won't be visible.

I believe Unreal is working on an update to their occlusion systems to support this but last time I found it it was experimental, and didn't reproject the depth data either, but they'll probably get there eventually.

CPU Rasterized depth

The final step for CPU based visibility is to generate the depth buffer on the CPU. This typically requires much lower-res geometry that the CPU is going to render in jobs to a depth buffer that it will then use to test occlusion.

CryEngine supports this as well as an alternative to reprojected depth buffer testing. Instead it will render a simplified version of the scene, up to a certain number of vertices. This is done entirely on the CPU, and with the current frames projection. I'm not sure, but it might use the last frames depth buffer as well to provide some additional occluder.

The advantage of this is, is you end up with a view of the scene with no holes in the depth buffer and no first frame penalties. The occlusion result is always 'perfect', there shouldn't be any false negatives or false positives. It will however require the simplified geometry and, I assume, a fair amount of worker thread time to actually do the software render. We were never in a situation to verify how much time this would have taken as our minspec was too low to justify this and the reprojected results were good enough.

Intel released a software based occlusion system as well, and there was a fascinating blog series written by Fabian Giesen about further optimizing their software occlusion that has been integrated into the official release.

I wouldn't be surprised to learn that most of the third party occlusion libraries available for Unreal and other engines implement this or the reprojected depth buffer for the enhancements to the occlusion system.

I've also seen some interesting uses for such a system, including having explosions be blocked by geometry and only applying impulses to the parts of dynamic objects that have a clear line of sight to the explosions themselves. This is a level of detail most games don't really need, but it's an interesting feature to add.

GPU based rendering

Based on various siggraph papers, GPU based rendering is a major focus for occlusion in the near future, especially for engines that can be more specialized for specific games. However I feel like the engines I've seen are going to have a hard time adding these. The crux of the issue is how do you render an uncertain number of objects on the CPU that the compute shaders will then determine what needs to be rendered.

While it's relatively straight forward to use indirect drawing to have the GPU control how many instances to render, to be able to specify things like what buffers to use is DX12 specific, while being able to control buffers, pipelines and descriptor sets are limited to Nvidia drivers on Vulkan. To support as many targets as possible, you'll be limited to just controlling the number of instances to render, which is supported on DX11 and Vulkan.

This means the scenes you'd render with this must use heavy instancing and drawing every set of buffers/textures for each of the instanced occluded objects, or creating a system that allows for a great deal of indirection for accessing buffers and textures. For systems like Unreal that has such a flexible material system, the first option seems the most likely path they would take, while CryEngine, which has a more controlled material system, either might work.

The basic plan though is from the GPUs perspective, is the system would render some likely occluders for this frame, fill in the gaps using a reprojection of last frames data, and do occlusion testing using that.

Compute based occlusion

The guys over at Guerilla games recently released a paper talking about the occlusion system they made for Horizon Zero Dawn. What they're doing is basically the reprojected depth visibility testing, but using the async compute on the GPU instead of jobs. This makes a lot of sense as the input data is rendering data to begin with. This strikes me as an interesting approach, but I wonder how well the async compute falls into the render time, and if this would be a viable approach for engines that work on PCs.

Sunday, July 12, 2015

Job systems

Video game and threading
For the longest time, video games pretty much only had a single core to work with. As far as I'm aware, it wasn't until the Sega Saturn that came with multiple programmable cores for games to use. PC's caught up to using dual cores in the early 2000, but it wasn't until the last generation that multiple cores played a much bigger role.

Most engines have two primary threads, a render thread, and an engine thread. In addition, there will be several other threads, such as audio threads, streaming threads, and possibly a bunch of other specialized threads, but they rarely use very much of the cores they run on.

The goal of a job system is to take portions of the update and move them from the engine or render thread, and get them to execute on those other, less utilized cores.

Jobs
Right, so a job represents a bit of code. Since I'm not focusing on supporting the PS3, I can focus on very generic code to create and manage jobs. For me, a job is just a basic virtual class with an execute function, so it looks like this:

class Job
{
public:
  virtual void Execute() = 0;
  virtual ~Job() = default;
  bool UnsetAutoFree() { m_flags |= 0x80000000; }
private:
  std::atomic<uint32_t>* m_modifiedOnCompletion;
  uint32_t m_flags;
  friend class JobManager;
};

The job manager then allows you to add a job, be notified when the job is completed, and help assist with job execution. A very simple job manager might look like this:

class JobManager
{
public:
  JobManager(uint32_t _queueSize) : m_jobList(_queueSize) {}
  bool TryAddJob(Job* _job, std::atomic<uint32_t>* _toMod = nullptr);
  void AddJob(Job* _job, std::atomic<uint32_t>* _toMod = nullptr)
  {
    while (!TryAddJob(_job, _toMod));
  }
  bool ProcessAJob();
private:
  Lockless::MPMCQueue<Job> m_jobList;
};

Then in the .cpp, we've got the implementation that looks like this:

bool JobManager::TryAddJob(Job* _job, std::atomic<uint32_t>* _toMod)
{
  if (_toMod)
  {
    _toMod->fetch_add(1);
  }
  _job->m_modifiedOnCompletion = _toMod;
  bool pushed = m_jobList.Push(_job);
  if (!pushed && _toMod)
  {
    _toMod->fetch_add(-1);
  }
  return pushed;
}

bool JobManager::ProcessAJob()
{
  Job* job = m_jobList.Pop();
  if (!job)
  {
    return false;
  }
  job->Execute();
  if (job->m_modifiedOnCompletion)
  {
    job->m_modifiedOnCompletion->fetch_sub(1);
  }
  if ((job->m_flags & 0x80000000) == 0)
  {
    delete job;
  }
  return true;
}

Now, any threads that want to help in the work, can just call ProcessAJob, and assist with jobs.

This provides a nice, basic setup for submitting jobs. We're using the lockless queue to support multiple threads all pushing and popping jobs all at the same time.

Improving job control
Currently, if you've got a series of task that you then need to ensure complete, and then kick off more jobs, you'd have to add the jobs, and then wait for the atomic value to return to 0, helping out by calling ProcessAJob. That works, but it's not the best. I used that for a while, but adding the ability to automatically run custom code, or add jobs to the queue on the thread that just completed the last operation in a group can greatly simplify job sequences.

For this, I create a JobGroup, which will store a list of Jobs, and that can be added to a JobManager.

class Job
{
  <...>
private:
  class JobGroup* m_owningGroup;
};

class JobGroup
{
public:
  virtual ~JobGroup() = default;
  virtual void Finalize() = 0;
  void AddJob(Job* _job) { m_jobs.push_back(_job); }
  bool UnsetAutoFree() { m_flags |= 0x80000000; }
private:
  std::vector<Job*> m_jobs;
  std::atomic<uint32_t> m_jobCount;
  std::atomic<uint32_t>* m_toFinalize;
  uint32_t m_flags;
  friend class JobManager;
};

class JobManager
{
  <...>
  void AddJobGroup(JobGroup* _group, std::atomic<uint32_t>* _toMod = nullptr);
}

Then, in the JobManager code...

bool JobManager::AddJob(Job* _job, std::atomic<uint32_t>* _toMod)
{
  _job->m_owningGroup = nullptr;
  <...>
}

void JobManager::AddJobGroup(JobGroup* _group, std::atomic<uint32_t>* _toMod)
{
  _group->m_jobCount = _group->m_jobs.size();
  _group->m_toFinalize = _toMod;
  if (_toMod)
  {
    _toMod->fetch_add(1);
  }
  for (auto job : _group->m_jobs)
  {
    job->m_owningGroup = _group;
    job->m_modifiedOnCompletion = nullptr;
    while (!m_jobList.Push(job))
    {
    }
  }
}

bool JobManager::ProcessAJob()
{
  Job* job = m_jobList.Pop();
  if (!job)
  {
    return false;
  }
  job->Execute();
  if (job->m_modifiedOnCompletion)
  {
    job->m_modifiedOnCompletion->fetch_sub(1);
  }
  JobGroup* group = job->m_owningGroup;
  if (group && group->m_jobCount.fetch_sub(1) == 1)
  {
    group->Finalize();
    if (group->m_toFinalize)
    {
      group->m_toFinalize->fetch_sub(1);
    }
    if ((group->m_flags & 0x80000000) == 0)
    {
      delete group;
    }
  }
  if ((job->m_flags & 0x80000000) == 0)
  {
    delete job;
  }
  return true;
}

So what happens, is when a job is executed, if it has an owning group, it'll decrement the atomic value tied to the group. Once the group's count hits 0, it calls finalize, and decrements the atomic that was associated with it when it was added. This finalize function allows for sequential operations to immediately start execution of the next sequence needed without waiting for another thread to spin a sequence around.

With this, we've got a good start, but this post was getting long and was long in the making. Next time I'll post about thread spawning, job and thread specialization, and attaching a limited access resource to those specializations so that you can guarantee there will never be more than a given number of those jobs in flight.

Monday, June 8, 2015

Multi Producer, Multi Consumer queues

Last time we implemented a Single Producer, Single Consumer queue. Now we'll look at expanding that to multiple producers and multiple consumers.

Why does our previous example fail with Multiple Producers and consumers?
So the old code fails with multiple consumers. But why? Let's look at what would happen with multiple consumers:
Assuming we have 2 bits of data to consume, with both thread A and B trying to pop that data at the same time, it's possible that they both get the same location. With them both reading the same location, the get a double pop. We also have problems dealing with the increment as well - we could have a thread get paused after reading the location, another thread pop off several values, then the original thread resumes, setting the read location back.

Producers have similar issues of losing writes and possibly destroying ordering of the data.

Dealing with Multiple Consumers
So how do we deal with this? Well, first we need to make sure only one consumer gets the item. This requires us to first convert the T* array to an std::atomic<T*>. Then when you go to pop an entry, you read the value, and then compare and swap that value for a nullptr. Then if that succeeded, you know you successfully read the value and own the result. Then, increment the reader location, and return what you found. Looks like this:

 T* Pop()
 {
  do
  {
   uint32_t location = m_readerLoc;
   if (location == m_writerLoc)
   {
    return nullptr;
   }

   T* found = m_data[location & m_count];
   if (m_data[location & m_count].compare_exchange_strong(found, nullptr))
   {
    m_readerLoc++;
    return found;
   }
  } while (true);
 }

So, that works, right?

Accidental locks
The code as presented, accidentally introduces an accidental lock. The situation happens if a thread succeeds to use the compare_exchange, but then gets paused before executing the increment. Until that thread resumes execution, all other threads are blocked. In general, any time you have a state update based on the result of an atomic compare, you have to be careful about running into that situation.

So how do we fix this? What I do is I make it so that we increment the location independent of if we succeed or fail. Like this:

 T* Pop()
 {
  do
  {
   uint32_t location = m_readerLoc;
   if (location == m_writerLoc)
   {
    return nullptr;
   }

   T* found = m_data[location & m_count];
   bool success = false;
   if (found)
   {
    success = m_data[location & m_count].compare_exchange_strong(found, nullptr);
   }
   m_readerLoc.compare_exchange_strong(location, location + 1);
   if (success)
   {
    return found;
   }
  } while (true);
 }

With this, if we have already read the value at location, or if someone else read the value, we'll try to go to the next slot. Also note, this isn't a blind increment, due to possibly having many readers failing to read the value - we only want one of them to increment the results.

For the Push side of things, it looks like this:

 
 bool Push(T* _value)
 {
  do
  {
   uint32_t location = m_writerLoc;
   if (m_readerLoc + m_count < location)
   {
    return false;
   }

   T* found = nullptr;
   bool success = m_data[location & m_count].compare_exchange_strong(found, _value);
   m_writerLoc.compare_exchange_strong(location, location + 1);

   if (success)
   {
    return true;
   }
  } while (true);
 }

So now we have defeated the rare lock, the code appears to work, so everything's good, yes?

ABA problem part 2
We've got an ABA problem, and it's kind of a weird one. I originally used this setup for a while as the job queue in my personal engine, and it worked until I started generating many jobs at the same time from several threads as I was consuming.

The problem shows up when you have three or more threads all accessing the queue at the same time. Specifically, thread A and B are both pushing to the queue. Thread C is poping.

If thread A reads from m_writerLoc, and then gets paused. Thread B then performs a push operation to completion. Thread C then pops the value and returns. Thread A then resumes. It sees the nullptr, and thinks it's safe to push. However we've already passed that point as a valid write location, so we've lost the push.

So what we need to do, is rather than push nullptr, we need to push an ever changing value. So what value should we push? What I normally opt to do is to push a the current location, but modified by shifting by one and oring a 1 in place. and set the lowest bit so we know it's a value that's not something that was pushed. We'll also want to replace the std::atomic<T*> array with an std::atomic<uintptr_t> array, since we'll be accessing the data as a numeric value.

So, when poping, we need to read the value, make sure the lowest bit isn't set, and if it isn't(which indicates this is a pointer we pushed), compare and swap with our special key, and return the value we read if that succeeded, after incrementing the offset.

The code looks like this:

 
 T* Pop()
 {
  do
  {
   uint32_t location = m_readerLoc;
   if (location == m_writerLoc)
   {
    return nullptr;
   }

   uintptr_t found = m_data[location & m_count];
   uintptr_t target = (uintptr_t(location) << 1) | 1;
   bool success = false;
   if ((found & 1) == 0)
   {
    success = m_data[location & m_count].compare_exchange_strong(found, target);
   }

   m_readerLoc.compare_exchange_strong(location, location + 1);
   if (success)
   {
    return (T*)found;
   }
  } while (true);
 }

When it comes to pushing, the important thing is figuring out what the value we expect to read is. When pushing, rather than setting found to a nullptr, we need to construct the location we would have wrote when we last popped. That looks like this:

 bool Push(T* _value)
 {
  do
  {
   uint32_t location = m_writerLoc;
   if (m_readerLoc + m_count < location)
   {
    return false;
   }

   uintptr_t found = ((uintptr_t(location) - (m_count + 1)) << 1) | 1;
   bool success = m_data[location & m_count].compare_exchange_strong(found, (uintptr_t)_value);
   m_writerLoc.compare_exchange_strong(location, location + 1);

   if (success)
   {
    return true;
   }
  } while (true);
 }

The only remaining step is the constructor, we need to make sure we initialize the values to what would have been generated one loop before.

 MPMCQueue(uint32_t _count)
 {
  m_count = 1 << _count;
  m_data = new std::atomic[m_count];
  for (uint32_t i = 0; i < m_count; ++i)
  {
   m_data[i] = ((uintptr_t(i) - m_count) << 1) | 1;
  }
  m_count -= 1;
 }

And that is my current lockless queue implementation basically. To my knowledge, there's no weakness in this implementation. I've been using it as a basis for a job system, which has several producers and consumers pushing and popping at the same time, and I've not detected any double pop or failing to push. Next I think I'll look at my own job system implementation. It's by no means the best, but it's got some nice features, and I'd like to think it's fairly straight forward.

Thursday, May 28, 2015

Lockless Queues

Last time, we implemented a lock-less stack. Let's move on to a slightly more useful structures, a fifo queue. But first, we need to talk about how lock-less these data structures are.

Producer consumer counts
When talking about lock-less data structures for transferring data between threads, the number of producers and consumers expected to use the data structure at the same time. So far, the lock-less stacks have been a multi-producer, multi-consumer setup, which means they've supported multiple threads pushing and popping data to the stacks at the same time on different threads.

That was managed relatively easy, but when talking about fifo queues, we're going to start with a weaker requirement to make things easier initially - single producer, single consumer. This means we expect to only be pushed to by one thread, and pop from by another thread.

SPSCQueue - single producer, single consumer queue
Starting this queue, we're going to focus on a fixed sized queue.
The basic design in use here is, when pushing, we write out data to an internal buffer, then increment our write offset. When poping, we make sure our read offset is different than our write offset, and if it is, we read the value and then increment our read offset.

The code looks like this:

template <class T>
class SPSCQueue
{
public:
 SPSCQueue(uint32_t _powerOfTwo)
 {
  m_count = 1 << _powerOfTwo;
  m_data = new T*[m_count];
  m_count-=1;
 }

 ~SPSCQueue()
 {
  delete[] m_data;
 }

 bool Push(T* _value)
 {
  uint32_t location = m_writerLoc;
  if (m_readerLoc + m_count < location)
  {
   return false;
  }

  m_data[location & m_count] = _value;
  m_writerLoc++;
  return true;
 }

 T* Pop()
 {
  uint32_t location = m_readerLoc;
  if (location == m_writerLoc)
  {
   return nullptr;
  }

  T* result = m_data[location & m_count];
  m_readerLoc = location + 1;
  return result;
 }

private:
 T** m_data;
 uint32_t m_count;
 std::atomic<uint32_t> m_readerLoc;
 std::atomic<uint32_t> m_writerLoc;
};

A note on the count: I require a power of 2 size to avoid expensive divides, so I have the user pass in the power of two desired. Also, Push can fail! Be sure to handle that situation. If I expect it to not fail, I'll use a verify macro, which includes an assert in debug builds to verify the results.

Do we need to use atomics here?
If you notice, we won't have multi threaded contention with how our code is written. We write the value to m_data, and then increment the writer location. When reading, if the reader and writer locations are different, there must be a value in the m_data where we were reading from, right?

Right?

Memory access ordering
So there are two ways this would fail without using the C++ atomics as we're using them.

Both of the ways are due to memory access ordering. For C++11 (and ARM and PPC), all memory accesses will appear from a single thread's perspective as consistent. Specifically, if you write to an address, and then read from that address on the same thread, you'll always get the value you wrote.

But when dealing with multiple threads, we can run into issues. For example, with C++11 memory model, the optimizer is allowed to re-order those writes. For our case, if we weren't protecting the m_writerLoc with atomics, the optimizer could change the code so that m_writerLoc gets written to before m_data gets written to, which will introduce a rare double pop, but only in optimized builds.

On CPUs with weak memory models(ARM, PPC and others) we end up with the same double pop issue, even on debug builds. This is because the CPU will queue up reads and writes to complete once their cache line is available. So even with a deoptimized build, the CPU can execute the write of value to m_data array, and then execute the m_writerLoc modify. However, if the address of m_writerLoc is already in cache, but m_data isn't, the m_writerLoc will finalize before m_data is, resulting in the same double-pop issue possibly showing up.

Platform specific fixes
When dealing with the optimizer issue, we need to insert a compiler barrier. This makes it so that the optimizer cannot move memory operations across where that barrier was. This fixes the issue where the optimizer could rearrange the writes.

When dealing with the CPU issue, these platforms will include specialized instructions to ensure specific memory operations complete in the right order. There are typically two notable ordering instructions that are typically made available, one to drain the read buffer, one to drain the read and write buffers. The compiler should recognize the intrinsic version of the instruction as a compiler barrier as well. With this, you'd write the pointer out, execute the write sync instruction, and then update the offset.

C++11 atomic memory models
The C++11 atomic system allows you to specify a memory access order for the atomics. The important orders are as follows:

Relaxed - No memory ordering is expected, we only want the atomicness of this action.
Acquire - No memory operations may appear to be reordered before this load.
Release - No memory operations may appear to be reordered to after this store.
Acquire/release - Used for read-modify-write operations, no memory operations may appear to be reordered around this atomic operation assuming it succeeded.
Sequentially Consistent - The default access, similar to acquire/release, may impose additional restrictions.

So we could use these orders to optimize our SPSC queue. In our Push, we could replace
m_writerLoc = location + 1;
with
m_writerLoc.fetch_add(1, std::memory_order_release);

and then in pop, replace
uint32_t location = m_readerLoc;
with
uint32_t location = m_readerLoc.load(std::memory_order_consume);

This probably won't affect how the code gets generated for x86 platforms, due to the strong memory model that x86 implements, however both arm and ppc should generate better code.

Next time we'll modify the code to support multiple consumers, and multiple producers.

Saturday, May 2, 2015

Non-intrusive lockless stack

Issues with Intrusive Lock-less structures
So last time, we implemented an intrusive lock-less stack. But there's some pretty strong limitations when it comes to intrusive lock-less structures. Primarily, objects cannot be deleted so long as the lock-less structure is in use. There's a large set of situations where this can be safe - specifically systems that maintain a pool of objects that can be popped from and pushed to from multiple threads.

But let's implement a more general case lock-less stack that doesn't have this weakness.

Fixed size Lock-less stack

 
template <class T>
class FixedSizeLocklessStack
{
 public:
 FixedSizeLocklessStack(uint32_t count)
 {
  m_buffer = new StackData[count];
  for (uint32_t i = 0; i < count; ++i)
  {
   m_freeSlots.Push(&m_buffer[i]);
  } 
 }

 ~FixedSizeLocklessStack()
 {
  delete[] m_buffer; 
 }

 bool Push(T* _value)
 {
   StackData* slot = m_freeSlots.Pop();
   if (!slot)
   {
    return false; 
   }
   slot->m_pointer = _value;
   m_valueStack.Push(slot);
 }

 T* Pop()
 {
   StackData* value = m_valueStack.Pop();
   if (!value)
   {
    return nullptr;
   }
   T* retValue = value->pointer;
   m_freeStack.Push(value);
   return retValue;
 }

private:
 struct StackData
 {
  std::atomic<StackData*>& GetIntrusiveMember() { return intrusiveValue; }
  std::atomic<StackData*> intrusiveValue;
  T* pointer;
 };
 LocklessStack<StackData> m_freeSlots;
 LocklessStack<StackData> m_valueStack;
 StackData* m_buffer;
};

So this is a relatively simple extension on top of the base Lockless Stack. We have and maintain the data used to point to the data being pushed and popped. This means the data being pushed in can be managed without need of worrying about if other threads are accessing the data.

Performance of this stack may be improved by forcing the StackData to be cached aligned. This way each thread can avoid stomping on each other's caches.

But why fixed sized?
This is implemented as a fixed sized stack, but the internal stack data being managed acts as a pointer, so surely we could just allocate a new slot instead of returning false when pushing, right?

Problem with that though, is we lose the Lock-free guarantee. This happens for any memory allocation. Nearly all allocators that support multiple threads require locks. Even the allocators that don't use locks(such as tcmalloc or a lock-free allocator) frequently have locks. Due to this, when working with lock-less structures, I intentionally avoid automatic memory management.

But, expanding this to a expandable lock-less stack is straight forward.

Expandable lock-less stack

The main change is rather than allocate a single large buffer, we allocate a bunch of small entries, and when pushing, if we fail to pop, allocate a new element there as well. Then on delete, delete all of the allocated entries.

template<class T>
class ExpandableLocklessStack
{
public:
 ExpandableLocklessStack(uint32_t _count)
 {
  for (uint32_t i = 0; i < _count; ++i)
  {
    m_freeSlots.Push(new StackData());
  }
 }

 ~ExpandableLocklessStack()
 {
  StackData* data; 
  while (data = m_freeSlots.Pop())
  {
   delete data;
  }
  while (data = m_valueStack.Pop())
  {
   delete data; 
  }
 }

 void Push(T* _data)
 {
  StackData* value = m_freeSlots.Pop();
  if (!value)
  {
   value = new StackData(); 
  }
  slot->m_pointer = _value;
  m_valueStack.Push(slot);
 }

 T* Pop()
 {
  StackData* value = m_valueStack.Pop();
  if (!value)
  {
   return nullptr;
  }
  T* retValue = value->pointer;
  m_freeStack.Push(value);
  return retValue;
 }

private:
 struct StackData
 {
  std::atomic<StackData*>& GetIntrusiveMember() { return intrusiveValue; }
  std::atomic<StackData*> intrusiveValue;
  T* pointer; 
 };
 LocklessStack<StackData> m_freeSlots;
 LocklessStack<StackData> m_valueStack;
};

Load-Link/Store-Conditional Lock-less Stack
Last time, I indicated that a Load-Link/Store-Conditional doesn't need tagging. Let's take a quick look at what this might look like.

I'll be using a made up namespace called platform. In it are 3 important functions:

//Loads a value from a source location using the load link instruction
template <class T>
T* LoadLink(T** src);

//Stores the value to destination, but only if no other core has written to the cache line the destination is in.
template <class T>
bool StoreConditional(T** dest, T* value);

//Writes a value, and then ensures that it will be visible to other cores before continuing execution. Also acts as a compiler barrier.
//This is needed due to the platforms that support these instructions also tend to have weaker memory models, which would require synchronization here.
template <class T>
void WriteAndSync(T** dest, T* value);

There is also an important macro, called CACHE_ALIGN. This adds compiler specific instructions to ensure the member is aligned to a cache size. This does increase the size of this structure from the size of a pointer to a cache line size. This is important though, because if the structure we're pushing ends up in the same cache line as the stack, we would never be able push that value to the stack.

struct LocklessTester
{
 LocklessTester*& GetIntrusiveMember() { return intrusiveValue; }
 LocklessTester* intrusiveValue;
 uint32_t a;
 uint32_t b;
};

template <class T>
class LocklessStack
{
public:
 LocklessStack() : m_head(nullptr) {} 
 void Push(T* _value)
 {
  while (true)
  {
   T* head = platform::LoadLink(&m_head);
   platform::WriteAndSync(&_value->GetIntrusiveMember(), head);
   if (platform::StoreConditional(&m_head, _value))
   {
    return;
   }
  }
 }
 T* Pop()
 {
  while (true)
  {
   T* head = platform::LoadLink(&m_head);
   if (!head)
   {
    return nullptr;
   } 
   T* nextHead = head->GetIntrusiveMember();
   if (platform::StoreConditional(&m_head, nextHead))
   {
    return head;
   }
  }
 }
private:
 CACHE_ALIGN(T*, m_head);
};

With this, we've mainly gone back to our original setup, however now we can be confident that our stack will be completely safe from ABA, not just partly, because we know that no one wrote to the cache line that contains our m_head value after we loaded from it.

Please note, this is currently untested due to my not having setup the Android NDK for testing the ARM version of this.

Friday, April 24, 2015

Playing with fire: c++11 Atomics and lockless data structures.

After watching several of the C++ convention videos, and having worked with and had issues with some questionably implemented atomic algorithms, I've decided I needed to learn these better.

So I shall write them myself. After all, how better to learn than to do.

Non-blocking algorithms
According to Wikipedia, there are 3 major non-blocking algorithm types:
1. Wait-free. This is where there is a known maximum cost for accessing the shared resource. Hardest to implement. Most of the implementations I've seen depend on thread local storage, and fallback on a Lock-free or Obstruction-Free under specific conditions. A good example of this would be Google's TCMalloc small allocator which uses thread local storage to avoid locking, unless memory availability requires accessing a central store.
2. Lock-Free. This is where one thread will always succeed. It must be impossible for thread swapping to result in a situation that results in no one to succeed. This is what I'll be focusing on.
3. Obstruction-Free. A single thread takes control of the shared resource at a time, the other threads must wait. Normally this is implemented as a critical section or as a Spin lock. Without special protection, it's possible for threads to be deadlocked on a single critical section, however most Operating Systems have code to protect against this.

Basic Lock-less Stack design
A stack is one of the first basic data structures people use. It also makes for a good starting point for lock-free design, so we'll start with that.

I shall provide a basic lock-less stack and then discuss the implementation. Just to clarify, this initial implementation is meant to be a simple example, and is neither optimal, nor correct, and we'll be talking about what's wrong and how to fix it.

struct LocklessTester
{

 std::atomic<LocklessTester*>& GetIntrusiveMember() { return intrusiveValue; }

 std::atomic<LocklessTester*> intrusiveValue;
 uint32_t a;
 uint32_t b;
};

template <class T>
class LocklessStack
{
public:

 LocklessStack() : m_head(nullptr) {}

 void Push(T* _value)
 {
  T* head = m_head;
  while (true)
  {
   _value->GetIntrusiveMember() = head;
   if (m_head.compare_exchange_strong(head, _value))
   {
    return;
   }
  }
 }
 T* Pop()
 {
  T* head = m_head;
  while (true)
  {
   if (!head)
   {
    return nullptr;
   }
   T* nextHead = head->GetIntrusiveMember();
   if (m_head.compare_exchange_strong(head, nextHead))
   {
    return head;
   }
  }
 }
private:
 std::atomic<T*> m_head;
};

So, how does this work.

This stack follows a very basic lock-less concept of constructing an updated state, and then applying the state in a single atomic update. This design is one of the best designs, as it is pretty easy to prove correct.

So what we do is we write the current head value to the provided intrusive value provided by what we are pushing(intrusiveValue in this case). After doing so, it will compare the m_head against the original head, and if it's the same, update it with our new element. If they don't match(which will happen if someone else has pushed or popped to this stack), we need to re-run the logic with the new head value.

Popping follows a similar logic. It first reads from the intrusive value what the next head will be, and then performing a compare and swap to restore the previously pushed value, and then return the head we first read if the compare and swap succeeded.

So what's wrong
Lock-less programming that uses the compare_exchange operations need to be careful with regards to ABA problem.

So, does this stack have this issue? And if so, how do we detect this?

What we have to evaluate is that when we perform the compare_exchange_<strong/weak>, does the underlying state change if changes have happened undetectably.

So first, let's consider the push logic. First, we write the head to the intrusive member, and then we compare and exchange the head member with the value we just wrote. Because the value being conditioned matches what we're depending on, we don't have an ABA problem.

But what about Pop. Does that need it? If we've modified the stack after reading nextHead from head->GetIntrusiveMember(), the value we've already read may no longer be correct! So Pop requires ABA protection.

The situation where Pop fails looks like this:
Thread 0:
Push A
Push B
Pop B, but get interrupted before the compare_exchange.
Thread 1 starts executing:
Pop B
Pop A
Push C
Push B.
Thread 1 stops, Thread 0 continues.
At this point, nextHead points to A, but m_head still has the value B, so it will incorrectly return A, instead of C!

So we clearly have the possibility of having the ABA problem, so how do we fix that.

Avoiding the ABA problem

Some platforms can completely avoid this by using Load-link/store-conditional. What this means is that the CPU is able to trigger a load, and then store only if no one else has stored to that address. Platforms that support that are PowerPC(last-gen consoles and the Wii U) and recent ARM cpus(most smart phones, and modern hand-helds). With these pairs of instructions, it's possible to completely avoid the ABA problem! If you're working with those platforms, you should implement a load-link/store-conditional lock-less stack rather than relying on an atomic compare and swap. This is especially important because some of these platforms actually lack the ability to perform a double wide atomic operation - atomically modifying 8 bytes on 32 bit systems, or 16 bytes on 64 bit systems.

Sticking to what is exposed in C++11, however, the most common method to strengthen our defense against ABA is to tag our data with some unique marker. Note, this does not fix it, it just makes a rare issue more rare, hopefully so rare we never see it.

Data tagging

So for basic tagging what is normally done is have to include in the data in the data we compare and swap that changes every time. With this, we can have increased confidence that the value has not changed since we read it.

For x86-32, since 486 era, there's been a 8 byte compare and exchange instruction that is able to be processed atomically - lock cmpxchg. This means 32 bit x86 supports something called double wide atomic operations. This should be how the C++ atomic library should implement 8 byte atomic operations on 32 bit builds. Using this, we can put a full 4 bytes of tag data along side the pointer.

64 bit isn't as simple. There is a 16 byte compare exchange available on all but early 64 bit AMD processors. Unless you must absolutely support all processors from the last 13 years, you can safely assume this to be available.

But, if you create a 16 byte value, compile an std::atomic version of it, and look at the assembly, you'll see that MSVC at least will not generate that instruction. It instead has an 8 byte value that it uses to store a lock variable that it implements a spin lock on. Looking at the libstdc++ 4.6 atomic header also lacks a 16 byte type.

So what can we do? Well, we can either use compiler specific implementations to perform the 16 byte operation. Or, we can take advantage of the fact that on x86-64, we don't actually have the full 64 bit namespace to hold pointers. We only have 48 bits of virtual namespace. Specifically bits 1 through 17 will always match bit 0. Most OS implementations will dedicate the high range(where bits 0 through 17 are all 1s) for OS usage. So for a user space lock-less stack, we could use the top 17 bits if we so choose to.

Which I have.

Data Tagged Lock-less Stack
The big change on the user side is that rather than store a pointer in the class being stored on the stack, we now always store a uint64_t. We also implement a helper struct named ABAProt to help us hide the difference between 32 and 64 bit. The new class looks like this:

template <class T>
class LocklessStack
{
    template <unsigned char>
    struct ABAProt {
    };

    template <>
    struct ABAProt<4U> {
        static const uint64_t markerBits = 0xFFFFFFFF00000000ULL;
        static const uint64_t markerAdd = 0x0000000100000000ULL;
    };
    template <>
    struct ABAProt<8U> {
        static const uint64_t markerBits = 0xFFFF800000000000ULL;
        static const uint64_t markerAdd = 0x0000800000000000ULL;
    };

    typedef ABAProt<sizeof(void*)> Config;

public:
    LocklessStack() : m_head(0) {}

    void Push(T* _value)
    {
        assert((uint64_t(_value) & Config::markerBits) == 0);
        uint64_t head = m_head;
        while (true)
        {
            uint64_t marker = (head + Config::markerAdd) & Config::markerBits;
            uint64_t value = uint64_t(_value) | marker;
           T* headItem = (T*)(head & ~Config::markerBits);
            _value->GetIntrusiveMember() = headItem ;
            if (m_head.compare_exchange_strong(head, value))
            {
                return;
            }
        }
    }
    T* Pop()
    {
        uint64_t head = m_head;
        while (true)
        {
            T* Ptr = (T*)(head & ~Config::markerBits);
            if (!Ptr)
            {
                return nullptr;
            }

            uint64_t nextHead = (uint64_t)(Ptr->GetIntrusiveMember().load())
            uint64_t marker = (head + Config::markerAdd) & Config::markerBits;
            nextHead = nextHead | marker;
            if (m_head.compare_exchange_strong(head, nextHead))
            {
                return Ptr;
            }
        }
    }
private:
    std::atomic<uint64_t> m_head;
};

So how does this work. When you go to push the value, we first make sure the data we're pushing has no values in the markerBits region. Then, in the loop, we generate the next marker we'll use by taking the head, adding the modifier, and then anding the mask. This way we have a constantly incrementing marker relative to the last head. Or that onto the pointer we were given, and that's the value we'll swap in.

When popping, we also generate a new mask, to ensure that pops and pushes both modify the marker state.

So we're good now, right?

Mostly. There's a specific limitation you have to keep in mind when using intrusive pointers and lock-less data structures.

Do not free right after popping.

The reason for this, is what if, after loading the head value, but before the call to GetIntrusiveMember, the thread gets paused. Then another thread takes over, pops the value, and deletes it. Then the original thread takes over. It's not reading from deleted memory. At best, doing this is fine, and you get garbage, but you'll throw away the result. At worst, the OS has deallocated that page, and you'll trigger a page fault error killing your program.

In a follow up post I'll give an example load-link/store-condition implementation, and implement a non-intrusive stack. Then I'll move onto a lock-less queue, after which I'll touch on the worker system I implemented using it for my hobby game engine.