Thread System 0.3.1
High-performance C++20 thread pool with work stealing and DAG scheduling
Loading...
Searching...
No Matches
Tutorial: Lock-Free Queue Patterns

Thread System ships several lock-free queues for cases where mutex-based synchronization is too expensive. This tutorial explains how to choose between the available types, how hazard pointers reclaim memory safely, and what performance characteristics to expect.

Introduction

A "lock-free" queue makes progress without acquiring a mutex. Producers and consumers coordinate through atomic operations on shared state. Lock-free queues are not always faster than mutex queues — they shine under high contention with short critical sections, while a well-tuned mutex queue may beat them when contention is low.

Thread System provides:

  • adaptive_job_queue — Auto-switches between mutex and lock-free modes based on observed contention. Use this by default.
  • lockfree_job_queue — Multi-producer / multi-consumer (MPMC) Michael-Scott queue with hazard pointer reclamation.
  • work_stealing_deque — Single-producer / multi-consumer Chase-Lev deque used by the work-stealing scheduler.
  • A single-producer / single-consumer (SPSC) ring buffer for the highest throughput on dedicated producer/consumer pairs.

MPMC vs SPSC: Choosing the Right Queue

Property MPMC (lockfree_job_queue) SPSC (ring buffer)
Producers Many Exactly one
Consumers Many Exactly one
Throughput High under contention Highest possible
Latency Bounded Lowest
Memory model Hazard pointers No reclamation needed
Bounded? Optional Always

Rules of thumb:

  • Pick SPSC when you can guarantee one producer and one consumer (e.g., a per-thread inbox, an audio render thread). It is the fastest option.
  • Pick MPMC when multiple producers feed multiple consumers, such as a shared job queue.
  • Pick adaptive_job_queue when you do not know the contention pattern ahead of time, or when contention varies across workloads.
  • Pick work_stealing_deque only inside scheduler internals; it is not intended as a general-purpose queue.

Hazard Pointer Usage

Lock-free queues cannot free a node immediately after popping it because another thread might still be reading the same node. Hazard pointers solve this by letting each thread publish the addresses it is currently reading. A node is only freed once no thread holds a hazard pointer to it.

Thread System's hazard pointer subsystem is mostly transparent — the lockfree queue manages slots automatically. If you implement your own lock-free structure, the API looks like this:

auto& hp = kcenon::thread::hazard_pointer_for(this_thread);
auto guard = hp.protect(slot_ptr); // marks slot_ptr as in-use
// ... read through slot_ptr ...
// guard goes out of scope -> slot can later be reclaimed.
Hazard pointer implementation for lock-free memory reclamation.

Each thread holds a small fixed array of hazard slots (4 by default). Retired nodes accumulate on a per-thread retire list and are scanned periodically; nodes whose addresses do not appear in any thread's hazard array are freed in batch.

Common pitfalls:

  • Forgetting to reload the hazard pointer after retrying a CAS — always republish before re-reading shared state.
  • Holding a guard for too long blocks reclamation; keep critical sections short.

Performance Characteristics

Approximate behavior under typical x86_64 conditions (your numbers will vary):

  • SPSC ring buffer: Tens of nanoseconds per enqueue/dequeue when both endpoints stay hot in cache.
  • Lockfree MPMC queue: Roughly 100 to 300 ns per operation under moderate contention; degrades gracefully as producers/consumers scale.
  • Mutex-based job_queue: Comparable to lockfree at low contention, slower beyond about 4 contending threads.
  • Adaptive job queue: Adds a small overhead (single atomic load) over the underlying mode, but avoids the worst case in either direction.

Memory:

  • Lock-free queues hold extra retired nodes until reclamation runs. Expect a small per-thread overhead proportional to the retire list threshold.
  • SPSC ring buffers preallocate the full capacity; choose the size carefully.

When in doubt, run the benchmarks/queue_benchmark target on your target hardware before committing to a specific queue type.

Next Steps