|
Thread System 0.3.1
High-performance C++20 thread pool with work stealing and DAG scheduling
|
When something goes wrong with concurrent code, the symptoms are often vague — a hang, a crash without a stack trace, or a slowdown that only appears under load. This page collects the most frequent issues we see with Thread System and the diagnostic steps that resolve them.
Symptoms**: The pool stops making progress, queue depth grows, futures never complete.
Common causes**:
future.get() on another job submitted to the same pool, but every worker is blocked waiting on a future, so nothing can run the awaited job.A callback blocks on a mutex held elsewhere in user code.
Diagnostic steps**:
kcenon::thread::diagnostics::dump_pool_state() (or your debugger) to list worker stack traces. If every worker is parked in std::future::wait, you have an in-pool dependency cycle.gdb / lldb and inspect each worker's backtrace. Look for matching mutex addresses across two threads.Enable thread sanitizer (cmake –preset tsan) and re-run the failing test. TSan detects most lock-order inversions automatically.
Fix**: Use the DAG scheduler for in-pool dependencies. Adopt std::scoped_lock to lock multiple mutexes atomically. Never hold a mutex across submit_task.
Symptoms**: RSS grows steadily under load even though jobs complete.
Common causes**:
submit_task and the caller never reads or drops it. The shared state stays alive until the future is destroyed.The hazard pointer retire list never reaches the reclamation threshold because one thread rarely runs.
Diagnostic steps**:
valgrind –tool=memcheck or AddressSanitizer (cmake –preset asan).heaptrack or jemalloc statistics to find the call site that allocates the leaked memory.Check job lifetimes — long-running jobs delay cleanup of everything they capture.
Fix**: Drop futures you do not need (or call .wait() and let them go). Capture large state by std::shared_ptr or move it. For hazard pointer buildup, occasionally call hazard_domain::scan_now() from a maintenance thread.
Symptoms**: The code runs on Linux but hangs on Windows, or works on x86_64 but crashes on AArch64.
Common causes**:
Differences in std::thread::hardware_concurrency() reporting (cgroups on Linux containers, hybrid cores on Windows).
Diagnostic steps**:
std::atomic usage; default to std::memory_order_seq_cst until proven slow.On Linux containers, check /sys/fs/cgroup limits — they affect what hardware_concurrency() returns.
Fix**: Use std::memory_order_seq_cst by default, then relax with care after profiling. Pin thread-local cleanup order using explicit thread_local destructors. Configure the pool's worker count from a runtime setting instead of trusting platform defaults.
Symptoms**: Throughput is lower than expected, tail latency spikes, CPU utilization is high without forward progress.
Common causes**:
False sharing between counters or job control blocks placed on the same cache line.
Diagnostic steps**:
benchmarks/thread_pool_benchmark and compare against the shipped baseline numbers.perf record -F 999 -g to find the hottest functions.thread_pool_diagnostics::queue_depth and worker idle counters — if workers are idle while the queue is non-empty, there is a wakeup or stealing problem.Run the autoscaler in observation-only mode and compare its recommendation to your static configuration.
Fix**: Switch to adaptive_job_queue, enable work stealing for skewed loads, split CPU and I/O work into separate pools, and align hot counters to cache lines (alignas(64)).
Symptoms**: The program reaches pool->stop() but never returns.
Common causes**:
Nested pools — pool A's worker waits on a future from pool B, which is already shutting down.
Diagnostic steps**:
condition_variable::wait inside stop() indicate jobs that never completed.stop().Look for futures captured by other long-lived objects.
Fix**: Cancel any cancellation tokens before stopping the pool. Use stop(std::chrono::seconds(N)) to bound the wait. Shut down dependent pools in reverse dependency order — outermost first.
If these steps do not isolate the issue, open a GitHub issue with:
thread_pool_diagnostics dump captured at the failure point.See also Frequently Asked Questions for quick answers and Tutorial: Thread Pool for usage patterns that prevent these issues in the first place.