Friday, April 24, 2015

Playing with fire: c++11 Atomics and lockless data structures.


After watching several of the C++ convention videos, and having worked with and had issues with some questionably implemented atomic algorithms, I've decided I needed to learn these better.

So I shall write them myself. After all, how better to learn than to do.

Non-blocking algorithms
According to Wikipedia, there are 3 major non-blocking algorithm types:
1. Wait-free.  This is where there is a known maximum cost for accessing the shared resource. Hardest to implement.  Most of the implementations I've seen depend on thread local storage, and fallback on a Lock-free or Obstruction-Free under specific conditions.  A good example of this would be Google's TCMalloc small allocator which uses thread local storage to avoid locking, unless memory availability requires accessing a central store.
2. Lock-Free.  This is where one thread will always succeed.  It must be impossible for thread swapping to result in a situation that results in no one to succeed.  This is what I'll be focusing on.
3. Obstruction-Free.  A single thread takes control of the shared resource at a time, the other threads must wait.  Normally this is implemented as a critical section or as a Spin lock.  Without special protection, it's possible for threads to be deadlocked on a single critical section, however most Operating Systems have code to protect against this.

Basic Lock-less Stack design
A stack is one of the first basic data structures people use.  It also makes for a good starting point for lock-free design, so we'll start with that.

I shall provide a basic lock-less stack and then discuss the implementation.  Just to clarify, this initial implementation is meant to be a simple example, and is neither optimal, nor correct, and we'll be talking about what's wrong and how to fix it.

struct LocklessTester
{
 std::atomic<LocklessTester*>& GetIntrusiveMember() { return intrusiveValue; }
 std::atomic<LocklessTester*> intrusiveValue;
 uint32_t a;
 uint32_t b;
};

template <class T>
class LocklessStack
{
public:
 LocklessStack() : m_head(nullptr) {} 
 void Push(T* _value)
 {
  T* head = m_head;
  while (true)
  {
   _value->GetIntrusiveMember() = head;
   if (m_head.compare_exchange_strong(head, _value))
   {
    return;
   }
  }
 }
 T* Pop()
 {
  T* head = m_head;
  while (true)
  {
   if (!head)
   {
    return nullptr;
   }
   T* nextHead = head->GetIntrusiveMember();
   if (m_head.compare_exchange_strong(head, nextHead))
   {
    return head;
   }
  }
 }
private:
 std::atomic<T*> m_head;
};

So, how does this work.

This stack follows a very basic lock-less concept of constructing an updated state, and then applying the state in a single atomic update.  This design is one of the best designs, as it is pretty easy to prove correct.

So what we do is we write the current head value to the provided intrusive value provided by what we are pushing(intrusiveValue in this case).  After doing so, it will compare the m_head against the original head, and if it's the same, update it with our new element.  If they don't match(which will happen if someone else has pushed or popped to this stack), we need to re-run the logic with the new head value.

Popping follows a similar logic.  It first reads from the intrusive value what the next head will be, and then performing a compare and swap to restore the previously pushed value, and then return the head we first read if the compare and swap succeeded.

So what's wrong
Lock-less programming that uses the compare_exchange operations need to be careful with regards to ABA problem.

So, does this stack have this issue?  And if so, how do we detect this?

What we have to evaluate is that when we perform the compare_exchange_<strong/weak>, does the underlying state change if changes have happened undetectably.

So first, let's consider the push logic.  First, we write the head to the intrusive member, and then we compare and exchange the head member with the value we just wrote.  Because the value being conditioned matches what we're depending on, we don't have an ABA problem.

But what about Pop.  Does that need it?  If we've modified the stack after reading nextHead from head->GetIntrusiveMember(), the value we've already read may no longer be correct!  So Pop requires ABA protection.

The situation where Pop fails looks like this:
Thread 0:
Push A
Push B
Pop B, but get interrupted before the compare_exchange.
Thread 1 starts executing:
Pop B
Pop A
Push C
Push B.
Thread 1 stops, Thread 0 continues.
At this point, nextHead points to A, but m_head still has the value B, so it will incorrectly return A, instead of C!

So we clearly have the possibility of having the ABA problem, so how do we fix that.

Avoiding the ABA problem

Some platforms can completely avoid this by using Load-link/store-conditional.  What this means is that the CPU is able to trigger a load, and then store only if no one else has stored to that address.  Platforms that support that are PowerPC(last-gen consoles and the Wii U) and recent ARM cpus(most smart phones, and modern hand-helds).  With these pairs of instructions, it's possible to completely avoid the ABA problem!  If you're working with those platforms, you should implement a load-link/store-conditional lock-less stack rather than relying on an atomic compare and swap.  This is especially important because some of these platforms actually lack the ability to perform a double wide atomic operation - atomically modifying 8 bytes on 32 bit systems, or 16 bytes on 64 bit systems.

Sticking to what is exposed in C++11, however, the most common method to strengthen our defense against ABA is to tag our data with some unique marker.  Note, this does not fix it, it just makes a rare issue more rare, hopefully so rare we never see it.

Data tagging

So for basic tagging what is normally done is have to include in the data in the data we compare and swap that changes every time.  With this, we can have increased confidence that the value has not changed since we read it.

For x86-32, since 486 era, there's been a 8 byte compare and exchange instruction that is able to be processed atomically - lock cmpxchg.  This means 32 bit x86 supports something called double wide atomic operations.  This should be how the C++ atomic library should implement 8 byte atomic operations on 32 bit builds.  Using this, we can put a full 4 bytes of tag data along side the pointer.

64 bit isn't as simple.  There is a 16 byte compare exchange available on all but early 64 bit AMD processors.  Unless you must absolutely support all processors from the last 13 years, you can safely assume this to be available.

But, if you create a 16 byte value, compile an std::atomic version of it, and look at the assembly, you'll see that MSVC at least will not generate that instruction.  It instead has an 8 byte value that it uses to store a lock variable that it implements a spin lock on. Looking at the libstdc++ 4.6 atomic header also lacks a 16 byte type.

So what can we do?  Well, we can either use compiler specific implementations to perform the 16 byte operation.  Or, we can take advantage of the fact that on x86-64, we don't actually have the full 64 bit namespace to hold pointers.  We only have 48 bits of virtual namespace.  Specifically bits 1 through 17 will always match bit 0.  Most OS implementations will dedicate the high range(where bits 0 through 17 are all 1s) for OS usage.  So for a user space lock-less stack, we could use the top 17 bits if we so choose to.

Which I have.

Data Tagged Lock-less Stack
The big change on the user side is that rather than store a pointer in the class being stored on the stack, we now always store a uint64_t.  We also implement a helper struct named ABAProt to help us hide the difference between 32 and 64 bit.  The new class looks like this:

template <class T>
class LocklessStack
{
    template <unsigned char>
    struct ABAProt {
    };

    template <>
    struct ABAProt<4U> {
        static const uint64_t markerBits = 0xFFFFFFFF00000000ULL;
        static const uint64_t markerAdd = 0x0000000100000000ULL;
    };
    template <>
    struct ABAProt<8U> {
        static const uint64_t markerBits = 0xFFFF800000000000ULL;
        static const uint64_t markerAdd = 0x0000800000000000ULL;
    };

    typedef ABAProt<sizeof(void*)> Config;

public:
    LocklessStack() : m_head(0) {}

    void Push(T* _value)
    {
        assert((uint64_t(_value) & Config::markerBits) == 0);
        uint64_t head = m_head;
        while (true)
        {
            uint64_t marker = (head + Config::markerAdd) & Config::markerBits;
            uint64_t value = uint64_t(_value) | marker;
            T* headItem = (T*)(head & ~Config::markerBits);
            _value->GetIntrusiveMember() = headItem ;
            if (m_head.compare_exchange_strong(head, value))
            {
                return;
            }
        }
    }
    T* Pop()
    {
        uint64_t head = m_head;
        while (true)
        {
            T* Ptr = (T*)(head & ~Config::markerBits);
            if (!Ptr)
            {
                return nullptr;
            }

            uint64_t nextHead = (uint64_t)(Ptr->GetIntrusiveMember().load())
            uint64_t marker = (head + Config::markerAdd) & Config::markerBits;
            nextHead = nextHead | marker;
            if (m_head.compare_exchange_strong(head, nextHead))
            {
                return Ptr;
            }
        }
    }
private:
    std::atomic<uint64_t> m_head;
};

So how does this work.  When you go to push the value, we first make sure the data we're pushing has no values in the markerBits region.  Then, in the loop, we generate the next marker we'll use by taking the head, adding the modifier, and then anding the mask.  This way we have a constantly incrementing marker relative to the last head.  Or that onto the pointer we were given, and that's the value we'll swap in.

When popping, we also generate a new mask, to ensure that pops and pushes both modify the marker state.

So we're good now, right?
Mostly.  There's a specific limitation you have to keep in mind when using intrusive pointers and lock-less data structures.

Do not free right after popping.

The reason for this, is what if, after loading the head value, but before the call to GetIntrusiveMember, the thread gets paused.  Then another thread takes over, pops the value, and deletes it.  Then the original thread takes over.  It's not reading from deleted memory.  At best, doing this is fine, and you get garbage, but you'll throw away the result.  At worst, the OS has deallocated that page, and you'll trigger a page fault error killing your program.

In a follow up post I'll give an example load-link/store-condition implementation, and implement a non-intrusive stack.  Then I'll move onto a lock-less queue, after which I'll touch on the worker system I implemented using it for my hobby game engine.