Sunday, July 12, 2015

Job systems

Video game and threading
For the longest time, video games pretty much only had a single core to work with.  As far as I'm aware, it wasn't until the Sega Saturn that came with multiple programmable cores for games to use.  PC's caught up to using dual cores in the early 2000, but it wasn't until the last generation that multiple cores played a much bigger role.

Most engines have two primary threads, a render thread, and an engine thread.  In addition, there will be several other threads, such as audio threads, streaming threads, and possibly a bunch of other specialized threads, but they rarely use very much of the cores they run on.

The goal of a job system is to take portions of the update and move them from the engine or render thread, and get them to execute on those other, less utilized cores.

Jobs
Right, so a job represents a bit of code.  Since I'm not focusing on supporting the PS3, I can focus on very generic code to create and manage jobs.  For me, a job is just a basic virtual class with an execute function, so it looks like this:


class Job
{
public:
  virtual void Execute() = 0;
  virtual ~Job() = default;
  bool UnsetAutoFree() { m_flags |= 0x80000000; }
private:
  std::atomic<uint32_t>* m_modifiedOnCompletion;
  uint32_t m_flags;
  friend class JobManager;
};

The job manager then allows you to add a job, be notified when the job is completed, and help assist with job execution.  A very simple job manager might look like this:


class JobManager
{
public:
  JobManager(uint32_t _queueSize) : m_jobList(_queueSize) {}
  bool TryAddJob(Job* _job, std::atomic<uint32_t>* _toMod = nullptr);
  void AddJob(Job* _job, std::atomic<uint32_t>* _toMod = nullptr)
  {
    while (!TryAddJob(_job, _toMod));
  }
  bool ProcessAJob();
private:
  Lockless::MPMCQueue<Job> m_jobList;
};

Then in the .cpp, we've got the implementation that looks like this:

bool JobManager::TryAddJob(Job* _job, std::atomic<uint32_t>* _toMod)
{
  if (_toMod)
  {
    _toMod->fetch_add(1);
  }
  _job->m_modifiedOnCompletion = _toMod;
  bool pushed = m_jobList.Push(_job);
  if (!pushed && _toMod)
  {
    _toMod->fetch_add(-1);
  }
  return pushed;
}

bool JobManager::ProcessAJob()
{
  Job* job = m_jobList.Pop();
  if (!job)
  {
    return false;
  }
  job->Execute();
  if (job->m_modifiedOnCompletion)
  {
    job->m_modifiedOnCompletion->fetch_sub(1);
  }
  if ((job->m_flags & 0x80000000) == 0)
  {
    delete job;
  }
  return true;
}

Now, any threads that want to help in the work, can just call ProcessAJob, and assist with jobs.

This provides a nice, basic setup for submitting jobs.  We're using the lockless queue to support multiple threads all pushing and popping jobs all at the same time.

Improving job control
Currently, if you've got a series of task that you then need to ensure complete, and then kick off more jobs, you'd have to add the jobs, and then wait for the atomic value to return to 0, helping out by calling ProcessAJob.  That works, but it's not the best.  I used that for a while, but adding the ability to automatically run custom code, or add jobs to the queue on the thread that just completed the last operation in a group can greatly simplify job sequences.

For this, I create a JobGroup, which will store a list of Jobs, and that can be added to a JobManager.


class Job
{
  <...>
private:
  class JobGroup* m_owningGroup;
};

class JobGroup
{
public:
  virtual ~JobGroup() = default;
  virtual void Finalize() = 0;
  void AddJob(Job* _job) { m_jobs.push_back(_job); }
  bool UnsetAutoFree() { m_flags |= 0x80000000; }
private:
  std::vector<Job*> m_jobs;
  std::atomic<uint32_t> m_jobCount;
  std::atomic<uint32_t>* m_toFinalize;
  uint32_t m_flags;
  friend class JobManager;
};

class JobManager
{
  <...>
  void AddJobGroup(JobGroup* _group, std::atomic<uint32_t>* _toMod = nullptr);
}

Then, in the JobManager code...

bool JobManager::AddJob(Job* _job, std::atomic<uint32_t>* _toMod)
{
  _job->m_owningGroup = nullptr;
  <...>
}

void JobManager::AddJobGroup(JobGroup* _group, std::atomic<uint32_t>* _toMod)
{
  _group->m_jobCount = _group->m_jobs.size();
  _group->m_toFinalize = _toMod;
  if (_toMod)
  {
    _toMod->fetch_add(1);
  }
  for (auto job : _group->m_jobs)
  {
    job->m_owningGroup = _group;
    job->m_modifiedOnCompletion = nullptr;
    while (!m_jobList.Push(job))
    {
    }
  }
}

bool JobManager::ProcessAJob()
{
  Job* job = m_jobList.Pop();
  if (!job)
  {
    return false;
  }
  job->Execute();
  if (job->m_modifiedOnCompletion)
  {
    job->m_modifiedOnCompletion->fetch_sub(1);
  }
  JobGroup* group = job->m_owningGroup;
  if (group && group->m_jobCount.fetch_sub(1) == 1)
  {
    group->Finalize();
    if (group->m_toFinalize)
    {
      group->m_toFinalize->fetch_sub(1);
    }
    if ((group->m_flags & 0x80000000) == 0)
    {
      delete group;
    }
  }
  if ((job->m_flags & 0x80000000) == 0)
  {
    delete job;
  }
  return true;
}

So what happens, is when a job is executed, if it has an owning group, it'll decrement the atomic value tied to the group.  Once the group's count hits 0, it calls finalize, and decrements the atomic that was associated with it when it was added.  This finalize function allows for sequential operations to immediately start execution of the next sequence needed without waiting for another thread to spin a sequence around. 

With this, we've got a good start, but this post was getting long and was long in the making.  Next time I'll post about thread spawning, job and thread specialization, and attaching a limited access resource to those specializations so that you can guarantee there will never be more than a given number of those jobs in flight.

Monday, June 8, 2015

Multi Producer, Multi Consumer queues

Last time we implemented a Single Producer, Single Consumer queue.  Now we'll look at expanding that to multiple producers and multiple consumers.

Why does our previous example fail with Multiple Producers and consumers?
So the old code fails with multiple consumers.  But why?  Let's look at what would happen with multiple consumers:
Assuming we have 2 bits of data to consume, with both thread A and B trying to pop that data at the same time, it's possible that they both get the same location.  With them both reading the same location, the get a double pop.  We also have problems dealing with the increment as well - we could have a thread get paused after reading the location, another thread pop off several values, then the original thread resumes, setting the read location back.

Producers have similar issues of losing writes and possibly destroying ordering of the data.

Dealing with Multiple Consumers
So how do we deal with this?  Well, first we need to make sure only one consumer gets the item.  This requires us to first convert the T* array to an std::atomic<T*>.  Then when you go to pop an entry, you read the value, and then compare and swap that value for a nullptr.  Then if that succeeded, you know you successfully read the value and own the result.  Then, increment the reader location, and return what you found.  Looks like this:

 T* Pop()
 {
  do
  {
   uint32_t location = m_readerLoc;
   if (location == m_writerLoc)
   {
    return nullptr;
   }

   T* found = m_data[location & m_count];
   if (m_data[location & m_count].compare_exchange_strong(found, nullptr))
   {
    m_readerLoc++;
    return found;
   }
  } while (true);
 }

So, that works, right?

Accidental locks
The code as presented, accidentally introduces an accidental lock.  The situation happens if a thread succeeds to use the compare_exchange, but then gets paused before executing the increment.  Until that thread resumes execution, all other threads are blocked.  In general, any time you have a state update based on the result of an atomic compare, you have to be careful about running into that situation.

So how do we fix this?  What I do is I make it so that we increment the location independent of if we succeed or fail.  Like this:

 T* Pop()
 {
  do
  {
   uint32_t location = m_readerLoc;
   if (location == m_writerLoc)
   {
    return nullptr;
   }

   T* found = m_data[location & m_count];
   bool success = false;
   if (found)
   {
    success = m_data[location & m_count].compare_exchange_strong(found, nullptr);
   }
   m_readerLoc.compare_exchange_strong(location, location + 1);
   if (success)
   {
    return found;
   }
  } while (true);
 }

With this, if we have already read the value at location, or if someone else read the value, we'll try to go to the next slot.  Also note, this isn't a blind increment, due to possibly having many readers failing to read the value - we only want one of them to increment the results.

For the Push side of things, it looks like this:
 
 bool Push(T* _value)
 {
  do
  {
   uint32_t location = m_writerLoc;
   if (m_readerLoc + m_count < location)
   {
    return false;
   }

   T* found = nullptr;
   bool success = m_data[location & m_count].compare_exchange_strong(found, _value);
   m_writerLoc.compare_exchange_strong(location, location + 1);

   if (success)
   {
    return true;
   }
  } while (true);
 }

So now we have defeated the rare lock, the code appears to work, so everything's good, yes?

ABA problem part 2
We've got an ABA problem, and it's kind of a weird one.  I originally used this setup for a while as the job queue in my personal engine, and it worked until I started generating many jobs at the same time from several threads as I was consuming.

The problem shows up when you have three or more threads all accessing the queue at the same time.  Specifically, thread A and B are both pushing to the queue.  Thread C is poping.

If thread A reads from m_writerLoc, and then gets paused.  Thread B then performs a push operation to completion.  Thread C then pops the value and returns.  Thread A then resumes.  It sees the nullptr, and thinks it's safe to push.  However we've already passed that point as a valid write location, so we've lost the push.

So what we need to do, is rather than push nullptr, we need to push an ever changing value.  So what value should we push?  What I normally opt to do is to push a the current location, but modified by shifting by one and oring a 1 in place. and set the lowest bit so we know it's a value that's not something that was pushed.  We'll also want to replace the std::atomic<T*> array with an std::atomic<uintptr_t> array, since we'll be accessing the data as a numeric value. 

So, when poping, we need to read the value, make sure the lowest bit isn't set, and if it isn't(which indicates this is a pointer we pushed), compare and swap with our special key, and return the value we read if that succeeded, after incrementing the offset.

The code looks like this:

 
 T* Pop()
 {
  do
  {
   uint32_t location = m_readerLoc;
   if (location == m_writerLoc)
   {
    return nullptr;
   }

   uintptr_t found = m_data[location & m_count];
   uintptr_t target = (uintptr_t(location) << 1) | 1;
   bool success = false;
   if ((found & 1) == 0)
   {
    success = m_data[location & m_count].compare_exchange_strong(found, target);
   }

   m_readerLoc.compare_exchange_strong(location, location + 1);
   if (success)
   {
    return (T*)found;
   }
  } while (true);
 }

When it comes to pushing, the important thing is figuring out what the value we expect to read is.  When pushing, rather than setting found to a nullptr, we need to construct the location we would have wrote when we last popped.  That looks like this:

 bool Push(T* _value)
 {
  do
  {
   uint32_t location = m_writerLoc;
   if (m_readerLoc + m_count < location)
   {
    return false;
   }

   uintptr_t found = ((uintptr_t(location) - (m_count + 1)) << 1) | 1;
   bool success = m_data[location & m_count].compare_exchange_strong(found, (uintptr_t)_value);
   m_writerLoc.compare_exchange_strong(location, location + 1);

   if (success)
   {
    return true;
   }
  } while (true);
 }

The only remaining step is the constructor, we need to make sure we initialize the values to what would have been generated one loop before.

 MPMCQueue(uint32_t _count)
 {
  m_count = 1 << _count;
  m_data = new std::atomic[m_count];
  for (uint32_t i = 0; i < m_count; ++i)
  {
   m_data[i] = ((uintptr_t(i) - m_count) << 1) | 1;
  }
  m_count -= 1;
 }

And that is my current lockless queue implementation basically.  To my knowledge, there's no weakness in this implementation.  I've been using it as a basis for a job system, which has several producers and consumers pushing and popping at the same time, and I've not detected any double pop or failing to push.  Next I think I'll look at my own job system implementation.  It's by no means the best, but it's got some nice features, and I'd like to think it's fairly straight forward.

Thursday, May 28, 2015

Lockless Queues

Last time, we implemented a lock-less stack.  Let's move on to a slightly more useful structures, a fifo queue.  But first, we need to talk about how lock-less these data structures are.

Producer consumer counts
When talking about lock-less data structures for transferring data between threads, the number of producers and consumers expected to use the data structure at the same time.  So far, the lock-less stacks have been a multi-producer, multi-consumer setup, which means they've supported multiple threads pushing and popping data to the stacks at the same time on different threads.

That was managed relatively easy, but when talking about fifo queues, we're going to start with a weaker requirement to make things easier initially - single producer, single consumer.  This means we expect to only be pushed to by one thread, and pop from by another thread.

SPSCQueue - single producer, single consumer queue
Starting this queue, we're going to focus on a fixed sized queue.
The basic design in use here is, when pushing, we write out data to an internal buffer, then increment our write offset.  When poping, we make sure our read offset is different than our write offset, and if it is, we read the value and then increment our read offset.

The code looks like this:

template <class T>
class SPSCQueue
{
public:
 SPSCQueue(uint32_t _powerOfTwo)
 {
  m_count = 1 << _powerOfTwo;
  m_data = new T*[m_count];
  m_count-=1;
 }

 ~SPSCQueue()
 {
  delete[] m_data;
 }

 bool Push(T* _value)
 {
  uint32_t location = m_writerLoc;
  if (m_readerLoc + m_count < location)
  {
   return false;
  }

  m_data[location & m_count] = _value;
  m_writerLoc++;
  return true;
 }

 T* Pop()
 {
  uint32_t location = m_readerLoc;
  if (location == m_writerLoc)
  {
   return nullptr;
  }

  T* result = m_data[location & m_count];
  m_readerLoc = location + 1;
  return result;
 }

private:
 T** m_data;
 uint32_t m_count;
 std::atomic<uint32_t> m_readerLoc;
 std::atomic<uint32_t> m_writerLoc;
}; 

A note on the count:  I require a power of 2 size to avoid expensive divides, so I have the user pass in the power of two desired.  Also, Push can fail!  Be sure to handle that situation.  If I expect it to not fail, I'll use a verify macro, which includes an assert in debug builds to verify the results.

Do we need to use atomics here?
If you notice, we won't have multi threaded contention with how our code is written.  We write the value to m_data, and then increment the writer location.  When reading, if the reader and writer locations are different, there must be a value in the m_data where we were reading from, right?

Right?

 

Memory access ordering
So there are two ways this would fail without using the C++ atomics as we're using them.

Both of the ways are due to memory access ordering.  For C++11 (and ARM and PPC), all memory accesses will appear from a single thread's perspective as consistent.  Specifically, if you write to an address, and then read from that address on the same thread, you'll always get the value you wrote.

But when dealing with multiple threads, we can run into issues.  For example, with C++11 memory model, the optimizer is allowed to re-order those writes.  For our case, if we weren't protecting the m_writerLoc with atomics, the optimizer could change the code so that m_writerLoc gets written to before m_data gets written to, which will introduce a rare double pop, but only in optimized builds.

On CPUs with weak memory models(ARM, PPC and others) we end up with the same double pop issue, even on debug builds.  This is because the CPU will queue up reads and writes to complete once their cache line is available.  So even with a deoptimized build, the CPU can execute the write of value to m_data array, and then execute the m_writerLoc modify.  However, if the address of m_writerLoc is already in cache, but m_data isn't, the m_writerLoc will finalize before m_data is, resulting in the same double-pop issue possibly showing up.


Platform specific fixes

When dealing with the optimizer issue, we need to insert a compiler barrier.  This makes it so that the optimizer cannot move memory operations across where that barrier was.  This fixes the issue where the optimizer could rearrange the writes.

When dealing with the CPU issue, these platforms will include specialized instructions to ensure specific memory operations complete in the right order.  There are typically two notable ordering instructions that are typically made available, one to drain the read buffer, one to drain the read and write buffers.  The compiler should recognize the intrinsic version of the instruction as a compiler barrier as well.  With this, you'd write the pointer out, execute the write sync instruction, and then update the offset.


C++11 atomic memory models

The C++11 atomic system allows you to specify a memory access order for the atomics.  The important orders are as follows:
  • Relaxed - No memory ordering is expected, we only want the atomicness of this action.
  • Acquire - No memory operations may appear to be reordered before this load.
  • Release - No memory operations may appear to be reordered to after this store.
  • Acquire/release - Used for read-modify-write operations, no memory operations may appear to be reordered around this atomic operation assuming it succeeded.
  • Sequentially Consistent - The default access, similar to acquire/release, may impose additional restrictions.
So we could use these orders to optimize our SPSC queue.  In our Push, we could replace
m_writerLoc = location + 1;
with
m_writerLoc.fetch_add(1, std::memory_order_release);

and then in pop, replace
uint32_t location = m_readerLoc;
with
uint32_t location = m_readerLoc.load(std::memory_order_consume);

This probably won't affect how the code gets generated for x86 platforms, due to the strong memory model that x86 implements, however both arm and ppc should generate better code.

Next time we'll modify the code to support multiple consumers, and multiple producers.

Saturday, May 2, 2015

Non-intrusive lockless stack


Issues with Intrusive Lock-less structures
 So last time, we implemented an intrusive lock-less stack.  But there's some pretty strong limitations when it comes to intrusive lock-less structures.  Primarily, objects cannot be deleted so long as the lock-less structure is in use.  There's a large set of situations where this can be safe - specifically systems that maintain a pool of objects that can be popped from and pushed to from multiple threads.

But let's implement a more general case lock-less stack that doesn't have this weakness.

Fixed size Lock-less stack
 
template <class T>
class FixedSizeLocklessStack
{
 public:
 FixedSizeLocklessStack(uint32_t count)
 {
  m_buffer = new StackData[count];
  for (uint32_t i = 0; i < count; ++i)
  {
   m_freeSlots.Push(&m_buffer[i]);
  } 
 }

 ~FixedSizeLocklessStack()
 {
  delete[] m_buffer; 
 }

 bool Push(T* _value)
 {
   StackData* slot = m_freeSlots.Pop();
   if (!slot)
   {
    return false; 
   }
   slot->m_pointer = _value;
   m_valueStack.Push(slot);
 }

 T* Pop()
 {
   StackData* value = m_valueStack.Pop();
   if (!value)
   {
    return nullptr;
   }
   T* retValue = value->pointer;
   m_freeStack.Push(value);
   return retValue;
 }

private:
 struct StackData
 {
  std::atomic<StackData*>& GetIntrusiveMember() { return intrusiveValue; }
  std::atomic<StackData*> intrusiveValue;
  T* pointer;
 };
 LocklessStack<StackData> m_freeSlots;
 LocklessStack<StackData> m_valueStack;
 StackData* m_buffer;
};
So this is a relatively simple extension on top of the base Lockless Stack.  We have and maintain the data used to point to the data being pushed and popped.  This means the data being pushed in can be managed without need of worrying about if other threads are accessing the data.

Performance of this stack may be improved by forcing the StackData to be cached aligned.  This way each thread can avoid stomping on each other's caches.

But why fixed sized?
This is implemented as a fixed sized stack, but the internal stack data being managed acts as a pointer, so surely we could just allocate a new slot instead of returning false when pushing, right?

Problem with that though, is we lose the Lock-free guarantee.  This happens for any memory allocation.  Nearly all allocators that support multiple threads require locks.  Even the allocators that don't use locks(such as tcmalloc or a lock-free allocator) frequently have locks.  Due to this, when working with lock-less structures, I intentionally avoid automatic memory management.

But, expanding this to a expandable lock-less stack is straight forward.

Expandable lock-less stack

The main change is rather than allocate a single large buffer, we allocate a bunch of small entries, and when pushing, if we fail to pop, allocate a new element there as well.  Then on delete, delete all of the allocated entries.

template<class T>
class ExpandableLocklessStack
{
public:
 ExpandableLocklessStack(uint32_t _count)
 {
  for (uint32_t i = 0; i < _count; ++i)
  {
    m_freeSlots.Push(new StackData());
  }
 }

 ~ExpandableLocklessStack()
 {
  StackData* data; 
  while (data = m_freeSlots.Pop())
  {
   delete data;
  }
  while (data = m_valueStack.Pop())
  {
   delete data; 
  }
 }

 void Push(T* _data)
 {
  StackData* value = m_freeSlots.Pop();
  if (!value)
  {
   value = new StackData(); 
  }
  slot->m_pointer = _value;
  m_valueStack.Push(slot);
 }

 T* Pop()
 {
  StackData* value = m_valueStack.Pop();
  if (!value)
  {
   return nullptr;
  }
  T* retValue = value->pointer;
  m_freeStack.Push(value);
  return retValue;
 }

private:
 struct StackData
 {
  std::atomic<StackData*>& GetIntrusiveMember() { return intrusiveValue; }
  std::atomic<StackData*> intrusiveValue;
  T* pointer; 
 };
 LocklessStack<StackData> m_freeSlots;
 LocklessStack<StackData> m_valueStack;
};

Load-Link/Store-Conditional Lock-less Stack
Last time, I indicated that a Load-Link/Store-Conditional doesn't need tagging.  Let's take a quick look at what this might look like.

I'll be using a made up namespace called platform.  In it are 3 important functions:

//Loads a value from a source location using the load link instruction
template <class T>
T* LoadLink(T** src);

//Stores the value to destination, but only if no other core has written to the cache line the destination is in.
template <class T>
bool StoreConditional(T** dest, T* value);

//Writes a value, and then ensures that it will be visible to other cores before continuing execution.  Also acts as a compiler barrier.
//This is needed due to the platforms that support these instructions also tend to have weaker memory models, which would require synchronization here.
template <class T>
void WriteAndSync(T** dest, T* value);

There is also an important macro, called CACHE_ALIGN.  This adds compiler specific instructions to ensure the member is aligned to a cache size.  This does increase the size of this structure from the size of a pointer to a cache line size.  This is important though, because if the structure we're pushing ends up in the same cache line as the stack, we would never be able push that value to the stack.

struct LocklessTester
{
 LocklessTester*& GetIntrusiveMember() { return intrusiveValue; }
 LocklessTester* intrusiveValue;
 uint32_t a;
 uint32_t b;
};

template <class T>
class LocklessStack
{
public:
 LocklessStack() : m_head(nullptr) {} 
 void Push(T* _value)
 {
  while (true)
  {
   T* head = platform::LoadLink(&m_head);
   platform::WriteAndSync(&_value->GetIntrusiveMember(), head);
   if (platform::StoreConditional(&m_head, _value))
   {
    return;
   }
  }
 }
 T* Pop()
 {
  while (true)
  {
   T* head = platform::LoadLink(&m_head);
   if (!head)
   {
    return nullptr;
   } 
   T* nextHead = head->GetIntrusiveMember();
   if (platform::StoreConditional(&m_head, nextHead))
   {
    return head;
   }
  }
 }
private:
 CACHE_ALIGN(T*, m_head);
}; 
 
With this, we've mainly gone back to our original setup, however now we can be confident that our stack will be completely safe from ABA, not just partly, because we know that no one wrote to the cache line that contains our m_head value after we loaded from it.

Please note, this is currently untested due to my not having setup the Android NDK for testing the ARM version of this.

Friday, April 24, 2015

Playing with fire: c++11 Atomics and lockless data structures.


After watching several of the C++ convention videos, and having worked with and had issues with some questionably implemented atomic algorithms, I've decided I needed to learn these better.

So I shall write them myself. After all, how better to learn than to do.

Non-blocking algorithms
According to Wikipedia, there are 3 major non-blocking algorithm types:
1. Wait-free.  This is where there is a known maximum cost for accessing the shared resource. Hardest to implement.  Most of the implementations I've seen depend on thread local storage, and fallback on a Lock-free or Obstruction-Free under specific conditions.  A good example of this would be Google's TCMalloc small allocator which uses thread local storage to avoid locking, unless memory availability requires accessing a central store.
2. Lock-Free.  This is where one thread will always succeed.  It must be impossible for thread swapping to result in a situation that results in no one to succeed.  This is what I'll be focusing on.
3. Obstruction-Free.  A single thread takes control of the shared resource at a time, the other threads must wait.  Normally this is implemented as a critical section or as a Spin lock.  Without special protection, it's possible for threads to be deadlocked on a single critical section, however most Operating Systems have code to protect against this.

Basic Lock-less Stack design
A stack is one of the first basic data structures people use.  It also makes for a good starting point for lock-free design, so we'll start with that.

I shall provide a basic lock-less stack and then discuss the implementation.  Just to clarify, this initial implementation is meant to be a simple example, and is neither optimal, nor correct, and we'll be talking about what's wrong and how to fix it.

struct LocklessTester
{
 std::atomic<LocklessTester*>& GetIntrusiveMember() { return intrusiveValue; }
 std::atomic<LocklessTester*> intrusiveValue;
 uint32_t a;
 uint32_t b;
};

template <class T>
class LocklessStack
{
public:
 LocklessStack() : m_head(nullptr) {} 
 void Push(T* _value)
 {
  T* head = m_head;
  while (true)
  {
   _value->GetIntrusiveMember() = head;
   if (m_head.compare_exchange_strong(head, _value))
   {
    return;
   }
  }
 }
 T* Pop()
 {
  T* head = m_head;
  while (true)
  {
   if (!head)
   {
    return nullptr;
   }
   T* nextHead = head->GetIntrusiveMember();
   if (m_head.compare_exchange_strong(head, nextHead))
   {
    return head;
   }
  }
 }
private:
 std::atomic<T*> m_head;
};

So, how does this work.

This stack follows a very basic lock-less concept of constructing an updated state, and then applying the state in a single atomic update.  This design is one of the best designs, as it is pretty easy to prove correct.

So what we do is we write the current head value to the provided intrusive value provided by what we are pushing(intrusiveValue in this case).  After doing so, it will compare the m_head against the original head, and if it's the same, update it with our new element.  If they don't match(which will happen if someone else has pushed or popped to this stack), we need to re-run the logic with the new head value.

Popping follows a similar logic.  It first reads from the intrusive value what the next head will be, and then performing a compare and swap to restore the previously pushed value, and then return the head we first read if the compare and swap succeeded.

So what's wrong
Lock-less programming that uses the compare_exchange operations need to be careful with regards to ABA problem.

So, does this stack have this issue?  And if so, how do we detect this?

What we have to evaluate is that when we perform the compare_exchange_<strong/weak>, does the underlying state change if changes have happened undetectably.

So first, let's consider the push logic.  First, we write the head to the intrusive member, and then we compare and exchange the head member with the value we just wrote.  Because the value being conditioned matches what we're depending on, we don't have an ABA problem.

But what about Pop.  Does that need it?  If we've modified the stack after reading nextHead from head->GetIntrusiveMember(), the value we've already read may no longer be correct!  So Pop requires ABA protection.

The situation where Pop fails looks like this:
Thread 0:
Push A
Push B
Pop B, but get interrupted before the compare_exchange.
Thread 1 starts executing:
Pop B
Pop A
Push C
Push B.
Thread 1 stops, Thread 0 continues.
At this point, nextHead points to A, but m_head still has the value B, so it will incorrectly return A, instead of C!

So we clearly have the possibility of having the ABA problem, so how do we fix that.

Avoiding the ABA problem

Some platforms can completely avoid this by using Load-link/store-conditional.  What this means is that the CPU is able to trigger a load, and then store only if no one else has stored to that address.  Platforms that support that are PowerPC(last-gen consoles and the Wii U) and recent ARM cpus(most smart phones, and modern hand-helds).  With these pairs of instructions, it's possible to completely avoid the ABA problem!  If you're working with those platforms, you should implement a load-link/store-conditional lock-less stack rather than relying on an atomic compare and swap.  This is especially important because some of these platforms actually lack the ability to perform a double wide atomic operation - atomically modifying 8 bytes on 32 bit systems, or 16 bytes on 64 bit systems.

Sticking to what is exposed in C++11, however, the most common method to strengthen our defense against ABA is to tag our data with some unique marker.  Note, this does not fix it, it just makes a rare issue more rare, hopefully so rare we never see it.

Data tagging

So for basic tagging what is normally done is have to include in the data in the data we compare and swap that changes every time.  With this, we can have increased confidence that the value has not changed since we read it.

For x86-32, since 486 era, there's been a 8 byte compare and exchange instruction that is able to be processed atomically - lock cmpxchg.  This means 32 bit x86 supports something called double wide atomic operations.  This should be how the C++ atomic library should implement 8 byte atomic operations on 32 bit builds.  Using this, we can put a full 4 bytes of tag data along side the pointer.

64 bit isn't as simple.  There is a 16 byte compare exchange available on all but early 64 bit AMD processors.  Unless you must absolutely support all processors from the last 13 years, you can safely assume this to be available.

But, if you create a 16 byte value, compile an std::atomic version of it, and look at the assembly, you'll see that MSVC at least will not generate that instruction.  It instead has an 8 byte value that it uses to store a lock variable that it implements a spin lock on. Looking at the libstdc++ 4.6 atomic header also lacks a 16 byte type.

So what can we do?  Well, we can either use compiler specific implementations to perform the 16 byte operation.  Or, we can take advantage of the fact that on x86-64, we don't actually have the full 64 bit namespace to hold pointers.  We only have 48 bits of virtual namespace.  Specifically bits 1 through 17 will always match bit 0.  Most OS implementations will dedicate the high range(where bits 0 through 17 are all 1s) for OS usage.  So for a user space lock-less stack, we could use the top 17 bits if we so choose to.

Which I have.

Data Tagged Lock-less Stack
The big change on the user side is that rather than store a pointer in the class being stored on the stack, we now always store a uint64_t.  We also implement a helper struct named ABAProt to help us hide the difference between 32 and 64 bit.  The new class looks like this:

template <class T>
class LocklessStack
{
    template <unsigned char>
    struct ABAProt {
    };

    template <>
    struct ABAProt<4U> {
        static const uint64_t markerBits = 0xFFFFFFFF00000000ULL;
        static const uint64_t markerAdd = 0x0000000100000000ULL;
    };
    template <>
    struct ABAProt<8U> {
        static const uint64_t markerBits = 0xFFFF800000000000ULL;
        static const uint64_t markerAdd = 0x0000800000000000ULL;
    };

    typedef ABAProt<sizeof(void*)> Config;

public:
    LocklessStack() : m_head(0) {}

    void Push(T* _value)
    {
        assert((uint64_t(_value) & Config::markerBits) == 0);
        uint64_t head = m_head;
        while (true)
        {
            uint64_t marker = (head + Config::markerAdd) & Config::markerBits;
            uint64_t value = uint64_t(_value) | marker;
            T* headItem = (T*)(head & ~Config::markerBits);
            _value->GetIntrusiveMember() = headItem ;
            if (m_head.compare_exchange_strong(head, value))
            {
                return;
            }
        }
    }
    T* Pop()
    {
        uint64_t head = m_head;
        while (true)
        {
            T* Ptr = (T*)(head & ~Config::markerBits);
            if (!Ptr)
            {
                return nullptr;
            }

            uint64_t nextHead = (uint64_t)(Ptr->GetIntrusiveMember().load())
            uint64_t marker = (head + Config::markerAdd) & Config::markerBits;
            nextHead = nextHead | marker;
            if (m_head.compare_exchange_strong(head, nextHead))
            {
                return Ptr;
            }
        }
    }
private:
    std::atomic<uint64_t> m_head;
};

So how does this work.  When you go to push the value, we first make sure the data we're pushing has no values in the markerBits region.  Then, in the loop, we generate the next marker we'll use by taking the head, adding the modifier, and then anding the mask.  This way we have a constantly incrementing marker relative to the last head.  Or that onto the pointer we were given, and that's the value we'll swap in.

When popping, we also generate a new mask, to ensure that pops and pushes both modify the marker state.

So we're good now, right?
Mostly.  There's a specific limitation you have to keep in mind when using intrusive pointers and lock-less data structures.

Do not free right after popping.

The reason for this, is what if, after loading the head value, but before the call to GetIntrusiveMember, the thread gets paused.  Then another thread takes over, pops the value, and deletes it.  Then the original thread takes over.  It's not reading from deleted memory.  At best, doing this is fine, and you get garbage, but you'll throw away the result.  At worst, the OS has deallocated that page, and you'll trigger a page fault error killing your program.

In a follow up post I'll give an example load-link/store-condition implementation, and implement a non-intrusive stack.  Then I'll move onto a lock-less queue, after which I'll touch on the worker system I implemented using it for my hobby game engine.