Beyond OpenMP in C++ & Rust: Taskflow, Rayon, Fork Union Author: Ash Vardanian Date: May 20, 2025 Reading time: 12 minutes Source: Ash's Blog --- Overview OpenMP has long been the standard for coarse-grain parallelism in C and C++. However, it has limitations in fine-grain parallelism, portability, and meta-programming complexity. Many C++ and Rust thread-pool libraries rely on asynchronous task queues, which add overhead and lead to significant performance losses compared to OpenMP, especially in fork-join workloads. Ash Vardanian introduces Fork Union, a minimal (~300 lines) thread pool library for C++ and Rust that closely approaches OpenMP performance (~20% slower) without relying on complex heuristics or NUMA tricks, using only standard libraries and no dependencies. --- Key Points Motivation Against OpenMP and Existing Thread Pools OpenMP limits: Difficult for independent subsystems requiring fine-grain parallelism. Poor portability compared to STL and Rust standard libraries. Unwieldy meta-programming experience with #pragma omp inside templates. Conventional thread pools: Usually built as asynchronous task queues, heavier than fork-join. Suffer from performance killers called the "four horsemen": Locks and mutexes - expensive system calls and context switches. Heap allocations - dynamic memory for tasks and queues impacts scalability on large core counts. Atomics and Compare-And-Swap (CAS) operations - expensive and cause stalls. False sharing - unaligned counters causing cache contention. Fork Union Design Focused on fork-join parallelism. Offers four core interfaces: foreachthread: run a callback per thread (like #pragma omp parallel). foreachstatic: divide evenly sized tasks (like #pragma omp for schedule(static)). foreachslice: for slices of evenly sized tasks. foreachdynamic: for unevenly sized tasks (like #pragma omp for schedule(dynamic, 1)). Lambdas/closures must be noexcept and return void, simplifying implementation. Thread spawning done explicitly via tryspawn to avoid throwing exceptions or panics during construction. Written in standard C++ and Rust without external dependencies. Performance Benchmarks Tested on AWS Graviton 4 (96-core Arm v9 CPU), using a parallel float summation benchmark stressing synchronization rather than CPU arithmetic. C++ benchmarks (bytes/s): OpenMP: 585.5 MB/s (fastest baseline) Fork Union: 467.7 MB/s (~20% slower than OpenMP) Taskflow: 76.3 MB/s (significantly slower) Direct std::thread recreation: 3.0 MB/s (very slow) Rust benchmarks (time per iteration): Fork Union: ~5,150 ns (fastest) Rayon: ~47,251 ns (9x slower than Fork Union) Smol async executor: ~54,931 ns Tokio: ~240,707 ns (slowest) Comparisons: Alternative APIs Rayon (Rust): Excellent for data parallelism with parallel iterators but underlying thread-pool design inherits performance issues related to asynchronous task queues. Taskflow (C++): Popular for task graphs and asynchronous execution, but heavyweight and slower for simple fork-join workloads. Fork Union vs. OpenMP: Competes closely on synchronization-heavy workloads without additional complexity. The Four Horsemen of Low Performance Locks & Mutexes: Slow paths involve syscalls (futex on Linux) and context switches. Prefer atomic operations over mutexes when possible. Memory Allocations: Using heap allocations for futures, tasks, and queues adds unpredictability and overhead. Avoid global allocators to maintain scalability on large multi-core systems. Atomics and Compare-And-Swap (CAS):** CAS instructions are expensive and cause stalls especially under contention. ARM architecture has weaker memory