Rust std::thread on GPU: A Performance Trap or a Real Breakthrough?
VectorWare recently announced: Rust's std::thread now runs on the GPU. The marketing claims are predictable: "bridges the gap," "unlocks the ecosystem," "normal Rust." These claims are familiar. Easier GPU code is always tempting, but the execution model is where it breaks. With std::thread on a GPU, performance is the first casualty.
GPU programming operates on a fundamentally different paradigm. CPUs start with one thread, then explicitly spawn more, managing concurrency. GPUs launch a kernel that runs thousands of times in parallel, inherently concurrent. Consider CUDA C __global__ void scale or Rust's pub unsafe extern "ptx-kernel" fn scale. This traditional model forces unsafe Rust, raw pointers, and a constant challenge with Rust's ownership model, which was built for CPU execution. Porting existing Rust libraries becomes a mess.
VectorWare's strategy is to make the GPU behave more like a CPU, specifically for std::thread. They previously demonstrated async/await on GPU. However, std::thread is the real target, as it underpins rayon, tokio, and much of the Rust parallelism ecosystem. The core idea: map each std::thread to a GPU warp.
Mapping each std::thread to a GPU warp is a deliberate choice. GPU "lanes"—the individual threads within a warp—are too primitive. They lack independent program counters and stacks, advancing in lockstep. Mapping std::thread to a lane would break Rust's thread semantics and cause brutal performance hits from divergence, serializing different code paths within a warp.
VectorWare's insight: warps do have independent program counters and register files. They can execute independently, much like CPU threads. The kernel doesn't need full concurrency from the start. It begins with a single active warp (Warp 0) running main, with other warps waking on demand. This extends Rust's safety guarantees, borrow checker, and lifetime rules directly to GPU code.
The underlying model looks like this:
This approach supports thread::current() (returning a warp ID), thread::sleep() (using nanosleep), thread::yield_now(), and all standard builder patterns. While currently for NVIDIA GPUs, the concept should adapt to other architectures with subgroups or wavefronts.
The proposed benefits are significant: abstraction, no explicit GPU concepts, divergence prevention (each thread/warp runs a single closure), Rust safety, and ecosystem compatibility. On paper, this sounds ideal; however, a critical flaw emerges.
The Cost of Abstraction
The skepticism surrounding this approach is justified. While Rust's safety and developer experience on GPU are appealing, the performance implications of this "Warp-as-Thread" model present a stark performance reality.
Finite Resources
Warps are a finite resource. You cannot thread::spawn() indefinitely. std::thread::available_parallelism() reports warp count, but exceeding this limit will halt or crash your program. This differs from CPU threads, which the OS can swap.
Synchronization Overhead
GPU synchronization primitives (mutexes, condition variables) are far more expensive than on a CPU. They decimate occupancy and throughput. std::thread code relying on shared state and locks will incur massive slowdowns. (The complexities of efficient GPU mutexes are often underestimated).
Underutilization
The critical flaw lies in underutilization: each std::thread consumes an entire warp. If your spawned closure uses only a few lanes or has a short workload, you waste hardware. For simple data-parallel tasks, this is a terrible trade-off. Traditional kernels pack lanes efficiently. This model sacrifices raw throughput for programming convenience.
Idle Warp Consumption
Unassigned warps still consume hardware resources while idle, a cost that is not negligible.
Memory Constraints
Stack memory for these threads draws from constrained GPU memory. Deep call stacks or excessive concurrent threads will quickly exhaust memory. Default CUDA stack sizes often require manual adjustment, even for kernels. Now, consider every std::thread demanding its own stack.
This approach is not a universal solution. It's a specialized tool, useful only in narrow contexts. If your goal is porting a complex CPU-centric library that relies on std::thread for irregular parallelism, or if developer experience and Rust's safety are your absolute top priorities, this might be a win. You gain the borrow checker and ecosystem compatibility. However, for heavy data-parallel work—matrix multiplication, image processing, anything requiring thousands of identical operations—stick to the traditional kernel model. The performance penalty for naively porting CPU-style std::thread code to the GPU will be brutal. You'll leave a significant portion of your GPU's potential on the table.
While the mainstream narrative emphasizes "bridging the gap," this abstraction comes with a significant performance cost. You still need to understand the underlying hardware. This isn't about making GPUs "just another Rust platform" by ignoring their architecture. It's about adding an abstraction layer. That layer comes with a cost, which many workloads may find prohibitive.