HackerSqueeze — AI‑curated tech & startup insights

SGLang demonstrates open-source DeepSeek-V3-scale inference on a 96-GPU NVIDIA H100 cluster by combining PD disaggregation, large-scale expert parallelism (EP), and Expert Parallelism Load Balancer (EPLB). The result is a highly optimized, memory-efficient pipeline that nearly matches DeepSeek’s reported throughput at scale, while offering transparent, reusable components for community experimentation. Key ideas and components: - Prefill-Decode (PD) disaggregation: separate Prefill (full-sequence computation) and Decode (KV cache management) to optimize each phase independently, enabling use of DP attention and DeepEP dispatch modes for different phases. - DeepEP and dispatch modes: Normal (long inputs) and Low-Latency (decode). PD disaggregation allows both modes to operate in the same setup, improving throughput while retaining memory efficiency. - Large-scale EP and EPLB: Expert Parallelism distributes MoE weights across GPUs; EPLB optimizes expert placement to reduce workload imbalance, enabling more flexible EP sizes (e.g., 288 experts). This improves GPU utilization and overall throughput, especially as the scale grows. - DeepGEMM and MOE kernels: grouped GEMMs for MoE computations, with kernels tailored to prefill (contiguous layout) and decode (masked layout); integration with DP/TP strategies for efficiency. - Two-batch overlap (TBO): overlapping computation and communication to hide latency and reduce peak memory, yielding substantial throughput gains in both prefill (27–35%) and decode phases, depending on batch size and attention time. - PD implementation details: a Prefill Server and Decode Server coordinate via handshake; non-blocking RDMA transfers with Mooncake/NIXL support; asynchronous data movement and flexible API integration. - Memory management: DisposableTensor in PyTorch to release CUDA memory promptly, reducing peak usage and enabling larger effective batch sizes. - Workload analysis and simulation: tools to dump expert workload statistics and simulate expert utilization, enabling planful configuration before large-scale deployment. Experimental setup and results: - 12-node Atlas Cloud cluster with 8 H100 GPUs per node, connected via InfiniBand. Configurations include TP16x6 baseline, PD Disaggregation with full EP, PD with simulated MTP, and reference DeepSeek profile data. - Throughput highlights: - Prefill: about 57,674 to 50,302 tokens/sec per node across 1K–4K prompts in 4-node tests; dual-batch overlap and DeepGEMM kernels drive substantial gains, approaching DeepSeek’s profile with EPLB. - Decode: ~22,282 tokens/sec per node for 2K inputs (9 nodes, EP72), and ~17,373 tokens/sec per node for 4K inputs under simulated MTP, demonstrating strong scalability with EP and larger batch sizes. - End-to-end: 52.3k input tokens/sec and 22.3k output tokens/sec per node for 2000-token inputs (12 nodes, 8x H100). Estimated cost ~$0.20 per 1M output tokens, about one-fifth of the official DeepSeek Chat API. - Relative to DeepSeek