TPDE-LLVM: 10-20x Faster LLVM -O0 Back-End Overview TPDE-LLVM is an open-source LLVM back-end aiming to significantly speed up LLVM's -O0 compilation mode. It offers 10-20x faster compile times than the standard LLVM -O0 back-end, with runtime performance comparable to LLVM's baseline and about 10-30% larger code size. It supports a typical subset of LLVM-IR targeting x86-64 and AArch64 architectures. Performance Data (SPEC CPU 2017 Benchmarks - x86-64) Compilation speedups on O0 IR range from ~9.6x to 21.46x. Code size increases about 24-32% on O0 IR, but at O1 IR, code size is roughly equivalent or smaller. Geometric mean speedup is about 13.34x (O0 IR) and 17.58x (O1 IR) with code size ~1.27x (O0 IR) and 0.97x (O1 IR). AArch64 results are similar or slightly better due to GlobalISel usage. Technical Approach TPDE-LLVM workflow involves: An IR cleanup/preparation pass. An analysis pass focusing on loops and liveness. A code generation pass that combines lowering, register allocation, and machine code encoding. See the related paper for details. Features & Current Limitations Supports typical Clang O0/O1 IR, but many features remain unsupported. Flang support is partial (missing some FP ops). Rust code works but struggles with unsupported vector types. Plans include adding DWARF support and improving register allocation. Other potential expansion: support for non-ELF platforms, non-small-PIC code models, and additional targets. Usage TPDE-LLVM can be used as: A library (e.g., for JIT compilation, compatible with ORC JIT). A standalone llc-like tool. Integrated into Clang via patches (plugins cannot provide custom back-ends yet). More details: TPDE-LLVM documentation. Key Design Notes & Suggested LLVM-IR Changes for Faster Compilation Avoid ConstantExpr inside functions by rewriting them to instructions, since they are costly. Disallow arbitrarily sized struct/array values inside functions to avoid quadratic runtime complexity. Improving generation and usage of thread-local global variables by rewriting to intrinsics. Avoid arbitrary bit-width integers outside certain limits (e.g., no multi-word integers beyond i128). Random Performance-Related Insights TPDE uses 4 padding bytes in LLVM Instruction to store instruction numbers due to lack of auxiliary data slots. PHINode::getIncomingValForBlock can cause quadratic slowdown for blocks with many predecessors; mitigated by sorting incoming entries for large blocks. llvm::successors is slow, so TPDE caches successors. 90% of tpde-llc time is still spent in bitcode parsing. Community Discussion Highlights Type Legalization TPDE does not fully implement traditional legalization. Instead, it lowers types ad-hoc to "basic types" (similar to LLVM's MVT/LLT). Examples: i128 lowered to two i64 parts; i54 becomes i64. Illegal vector types are currently unsupported; plan to scalarize illegal vectors for simplicity. Legal vector element types intended to support: i1, i8, i16, i32, i64, ptr, half, bfloat, float, double. Legalization is not considered a performance bottleneck but requires substantial effort. There's acknowledgment that legalization needs may vary by architecture (e.g., RISC-V, AArch64). Comparison with LLVM Back-Ends GlobalISel is generally faster than SelectionDAGISel; the comment about speedup on AArch64 refers to GlobalISel being slower than FastISel in the -O0 context. TPDE does not aim to compete with optimized LLVM back-ends in runtime or code size while focusing on compile-time speed. Interest and Use Cases Users like Wasmer are interested in TPDE-LLVM for dramatically faster development builds, potentially replacing Cranelift. Upstreaming plans were queried but not detailed. Related Topics Discussions on LLVM's slowdowns, legalization