We Bought the Whole GPU, So We're Damn Well Going to Use the Whole GPU

We Bought the Whole GPU, So We’re Damn Well Going to Use the Whole GPU Authors: Benjamin Spector, Jordan Juravsky, Stuart Sul, Dylan Lim, Owen Dugan, Simran Arora, Chris Ré Work by Jordan done while at Stanford Published: Sep 28, 2025 | 24 min read --- TL;DR Released a throughput-optimized megakernel for tensor-parallel Llama-70B inference on NVIDIA H100 GPUs. The megakernel aggressively overlaps compute, memory, and communication to fully utilize GPU resources. Integrated into the Tokasaurus inference engine, it outperforms SGLang by >22% on end-to-end throughput (measured on 65,536 prompts from ShareGPT). Code available: GitHub. Code is research-grade: highly sensitive to environment, no official support planned. --- Background and Motivation Previous work introduced low-latency megakernels optimized for Llama-1B with batch size 1, focusing on eliminating memory stalls and improving latency. The new work targets high-throughput inference on Llama-70B with large batch sizes across multiple GPUs. Challenges: Mixed workload where some operations are compute-bound (e.g. matrix multiplies) and others memory-bound (e.g. RMS norm, attention decode). Multi-GPU communication bottlenecks due to NVLink network traffic, requiring fine-grained overlap of communication and compute. Existing approaches have tried overlapping by static SM assignments, bespoke kernels, or async data transfers. The megakernel approach uses a pipelined on-GPU instruction interpreter that can overlap many operations at once across SMs and GPUs. --- Key Contributions Recap on Megakernel Design Megakernels fuse entire model forward passes into one large kernel, eliminating overhead and "memory pipeline bubbles" from launching many small kernels. Use an instruction-interpreter abstraction: Model forward pass decomposed into fine-grained instructions (e.g. matrix multiply tiles, attention steps). Interpreter runs these instructions per SM, allowing aggressive pipelining across instructions (loading data for next instruction while computing current). Previous low-latency megakernel showed ~50% higher throughput than frameworks like SGLang and vLLM on Llama-1B single-token inference. Tensor-Parallel Llama-70B Forward Pass Workflow Uses a sequence-parallel variant where batches of tokens are split data-parallel, but activations for all tokens are split tensor-parallel across GPUs. Transformer block operations include: Pre-attention RMS norm (data-parallel) All-gather activations across GPUs Tensor-parallel QKV projection, attention, output projection Reduce-scatter Post-attention residual & pre-MLP RMS norm Another all-gather Tensor-parallel MLP Reduce-scatter Post-MLP residual connection Modified to replicate O projection matrix on all GPUs to reduce communication overhead, replacing reduce-scatter with a distributed transpose**, thus lowering inter-GPU traffic by a factor of 8 at the cost of ~15% batch size reduction due to extra memory. High-Throughput Megakernel Instruction Set Fused instructions for Llama-70B inference: RMS norm + all-gather QKV matrix multiply + RoPE embedding Attention + distributed transpose O-projection matrix multiply + residual add Gate matrix multiply + SiLU activation Up-projection matrix multiply + element-wise multiply with gate output Down matrix multiply + reduce-scatter + residual add RMS norm (for LM head, no all-gather) LM head matrix multiply Differences from low-latency megakernel: Instructions work on matrix-matrix multiplications (tiles of output matrix), not matrix-vector. Avoid recomputations (unlike low-latency task where RMS norm was redundantly recomputed). Expanded inter-instruction synchronization to handle dependencies across