Why a decades old architecture decision is impeding the power of AI computing

Why a Decades-Old Architecture Decision is Impeding the Power of AI Computing Most computers today follow the von Neumann architecture, which separates the compute unit and memory, connected by a bus. While this design has served conventional computing well for over six decades, it creates major bottlenecks for AI workloads. --- The Von Neumann Bottleneck Explained Origin: Named after John von Neumann who proposed the stored-program computer in 1945. Architecture: Separate memory and processing units; memory stores both data and instructions. Advantages: Flexibility in choosing and upgrading components separately. Adaptable to various workloads and easy to scale systems. Issue for AI: AI models involve huge numbers of parameters (weights) that must be repeatedly moved from memory to processing units. The data transfer speed limits the overall system performance (the "bottleneck"). Processors often sit idle waiting for data, especially since AI tasks are highly interrelated and cannot switch to unrelated tasks during waits. --- How the Bottleneck Reduces AI Efficiency Data Movement Costs: Transferring model weights consumes most energy and causes latency in AI computations; computational energy itself is a small fraction (~10%). Memory Distance: Larger models need more distant memory (such as DRAM on other GPUs), increasing energy and time due to physical limitations of data transfer through electrical wires. Slow Progress: Training large language models (LLMs) can take months, consuming tremendous energy. AI-Specific Challenge: Unlike traditional computing where varied operations occur, AI involves repetitive, highly predictable operations mostly moving static weights. --- Solutions to Overcome the Bottleneck Near-Memory and In-Memory Computing In-Memory Computing: Combines computation and memory to eliminate data transfer delays, often using analog memory technologies. Example memory technologies: SRAM, DRAM, Flash, RRAM, PCM (Phase-Change Memory), STT-MRAM. PCM stores weights as resistivity changes in chalcogenide glass. Near-Memory Computing: Local memory close to each compute core, reducing latency and energy by minimizing data movement. IBM’s AIU NorthPole processor: Uses digital SRAM with massive cores each having local memory. Demonstrated 47x faster inference and 73x better energy efficiency compared to energy-efficient GPUs on a 3-billion parameter model. Other Hardware Innovations Co-Packaged Optics: Polymer optical waveguides bring high-speed fiber optics closer to the chip, improving data localization and lowering training times and energy consumption. Hybrid Approach: In-memory computing often deploys models trained on traditional von Neumann GPUs due to durability constraints of analog memories. --- Why Von Neumann Architecture Will Persist Best for General Purpose Computing: Offers unmatched flexibility for varied operations and higher precision. In-memory computing has lower precision, limiting suitability for applications requiring 32- or 64-bit floating point accuracy. Hybrid Future: Likely a combination of von Neumann and specialized non-von Neumann architectures will coexist, each optimized for specific tasks. --- Summary | Aspect | Von Neumann Architecture | In-memory / Near-memory Computing | |-------------------------------|--------------------------------------------|------------------------------------------------| | Compute vs. Memory | Separate units, connected via bus | Combined or co-located units | | Flexibility | High, easy to upgrade and scale | Less flexible, often specialized | | Efficiency in AI | Bottleneck due to data transfer delays | Much better due to