Saturday, May 2, 2015

Non-intrusive lockless stack


Issues with Intrusive Lock-less structures
 So last time, we implemented an intrusive lock-less stack.  But there's some pretty strong limitations when it comes to intrusive lock-less structures.  Primarily, objects cannot be deleted so long as the lock-less structure is in use.  There's a large set of situations where this can be safe - specifically systems that maintain a pool of objects that can be popped from and pushed to from multiple threads.

But let's implement a more general case lock-less stack that doesn't have this weakness.

Fixed size Lock-less stack
 
template <class T>
class FixedSizeLocklessStack
{
 public:
 FixedSizeLocklessStack(uint32_t count)
 {
  m_buffer = new StackData[count];
  for (uint32_t i = 0; i < count; ++i)
  {
   m_freeSlots.Push(&m_buffer[i]);
  } 
 }

 ~FixedSizeLocklessStack()
 {
  delete[] m_buffer; 
 }

 bool Push(T* _value)
 {
   StackData* slot = m_freeSlots.Pop();
   if (!slot)
   {
    return false; 
   }
   slot->m_pointer = _value;
   m_valueStack.Push(slot);
 }

 T* Pop()
 {
   StackData* value = m_valueStack.Pop();
   if (!value)
   {
    return nullptr;
   }
   T* retValue = value->pointer;
   m_freeStack.Push(value);
   return retValue;
 }

private:
 struct StackData
 {
  std::atomic<StackData*>& GetIntrusiveMember() { return intrusiveValue; }
  std::atomic<StackData*> intrusiveValue;
  T* pointer;
 };
 LocklessStack<StackData> m_freeSlots;
 LocklessStack<StackData> m_valueStack;
 StackData* m_buffer;
};
So this is a relatively simple extension on top of the base Lockless Stack.  We have and maintain the data used to point to the data being pushed and popped.  This means the data being pushed in can be managed without need of worrying about if other threads are accessing the data.

Performance of this stack may be improved by forcing the StackData to be cached aligned.  This way each thread can avoid stomping on each other's caches.

But why fixed sized?
This is implemented as a fixed sized stack, but the internal stack data being managed acts as a pointer, so surely we could just allocate a new slot instead of returning false when pushing, right?

Problem with that though, is we lose the Lock-free guarantee.  This happens for any memory allocation.  Nearly all allocators that support multiple threads require locks.  Even the allocators that don't use locks(such as tcmalloc or a lock-free allocator) frequently have locks.  Due to this, when working with lock-less structures, I intentionally avoid automatic memory management.

But, expanding this to a expandable lock-less stack is straight forward.

Expandable lock-less stack

The main change is rather than allocate a single large buffer, we allocate a bunch of small entries, and when pushing, if we fail to pop, allocate a new element there as well.  Then on delete, delete all of the allocated entries.

template<class T>
class ExpandableLocklessStack
{
public:
 ExpandableLocklessStack(uint32_t _count)
 {
  for (uint32_t i = 0; i < _count; ++i)
  {
    m_freeSlots.Push(new StackData());
  }
 }

 ~ExpandableLocklessStack()
 {
  StackData* data; 
  while (data = m_freeSlots.Pop())
  {
   delete data;
  }
  while (data = m_valueStack.Pop())
  {
   delete data; 
  }
 }

 void Push(T* _data)
 {
  StackData* value = m_freeSlots.Pop();
  if (!value)
  {
   value = new StackData(); 
  }
  slot->m_pointer = _value;
  m_valueStack.Push(slot);
 }

 T* Pop()
 {
  StackData* value = m_valueStack.Pop();
  if (!value)
  {
   return nullptr;
  }
  T* retValue = value->pointer;
  m_freeStack.Push(value);
  return retValue;
 }

private:
 struct StackData
 {
  std::atomic<StackData*>& GetIntrusiveMember() { return intrusiveValue; }
  std::atomic<StackData*> intrusiveValue;
  T* pointer; 
 };
 LocklessStack<StackData> m_freeSlots;
 LocklessStack<StackData> m_valueStack;
};

Load-Link/Store-Conditional Lock-less Stack
Last time, I indicated that a Load-Link/Store-Conditional doesn't need tagging.  Let's take a quick look at what this might look like.

I'll be using a made up namespace called platform.  In it are 3 important functions:

//Loads a value from a source location using the load link instruction
template <class T>
T* LoadLink(T** src);

//Stores the value to destination, but only if no other core has written to the cache line the destination is in.
template <class T>
bool StoreConditional(T** dest, T* value);

//Writes a value, and then ensures that it will be visible to other cores before continuing execution.  Also acts as a compiler barrier.
//This is needed due to the platforms that support these instructions also tend to have weaker memory models, which would require synchronization here.
template <class T>
void WriteAndSync(T** dest, T* value);

There is also an important macro, called CACHE_ALIGN.  This adds compiler specific instructions to ensure the member is aligned to a cache size.  This does increase the size of this structure from the size of a pointer to a cache line size.  This is important though, because if the structure we're pushing ends up in the same cache line as the stack, we would never be able push that value to the stack.

struct LocklessTester
{
 LocklessTester*& GetIntrusiveMember() { return intrusiveValue; }
 LocklessTester* intrusiveValue;
 uint32_t a;
 uint32_t b;
};

template <class T>
class LocklessStack
{
public:
 LocklessStack() : m_head(nullptr) {} 
 void Push(T* _value)
 {
  while (true)
  {
   T* head = platform::LoadLink(&m_head);
   platform::WriteAndSync(&_value->GetIntrusiveMember(), head);
   if (platform::StoreConditional(&m_head, _value))
   {
    return;
   }
  }
 }
 T* Pop()
 {
  while (true)
  {
   T* head = platform::LoadLink(&m_head);
   if (!head)
   {
    return nullptr;
   } 
   T* nextHead = head->GetIntrusiveMember();
   if (platform::StoreConditional(&m_head, nextHead))
   {
    return head;
   }
  }
 }
private:
 CACHE_ALIGN(T*, m_head);
}; 
 
With this, we've mainly gone back to our original setup, however now we can be confident that our stack will be completely safe from ABA, not just partly, because we know that no one wrote to the cache line that contains our m_head value after we loaded from it.

Please note, this is currently untested due to my not having setup the Android NDK for testing the ARM version of this.

No comments:

Post a Comment