The RAG Obituary: Killed by agents, buried by context windows

The RAG Obituary: Killed by Agents, Buried by Context Windows By Nicolas Bustamante, Oct 01, 2025 --- Overview Nicolas Bustamante, with a decade of experience in AI and search, reflects on the decline of Retrieval-Augmented Generation (RAG) architectures in favor of emerging agentic search approaches enabled by large context windows in language models (LLMs). He argues that while RAG was crucial in a context-poor era, it's now being surpassed as LLMs support much larger context sizes and agentic behaviors. --- The Rise and Limitations of Retrieval-Augmented Generation (RAG) Background: RAG emerged as a solution to the token limits of early LLMs (GPT-3.5 with ~4K tokens, GPT-4 with ~8K tokens) which couldn't read entire large documents like SEC filings (~51K tokens). RAG Architecture: Works like a search engine: retrieves relevant chunks ("documents fragments") and feeds them to the LLM for synthesis. Chunking breaks documents into 400-1000 token pieces, but naïve chunking disrupts document structure with fragmented tables, split phrases, and disjointed narratives. Chunking Challenges: Maintaining hierarchical document structure, table integrity, footnotes, cross-references, and temporal coherence is complex and imperfect. Metadata can enrich chunks but fragments inherently lose global context. Embedding and Retrieval: Chunks converted into vectors for semantic search, typically combined with keyword search (hybrid search) using models like BM25 for precision. Challenges include vocabulary mismatch, poor handling of numbers and financial jargon, and lack of semantic causal understanding. Reranking Bottleneck: Additional step to reorder retrieved chunks by relevance using ML models. Adds latency, cost, complexity, and context window limits, making pipelines heavy and fragile. Infrastructure Burden: Running full-text search engines (e.g., Elasticsearch) requires enormous resources and maintenance for indexing terabytes of data. Cascading failures possible at every stage (chunking, embedding, fusion, reranking). Fundamental Limitation: Context fragmentation, numeric data representation issues, lack of causal reasoning or cross-document understanding. RAG treats documents independently and retrieves relevant passages but doesn't understand or reason about the information holistically. --- The Emergence of Agentic Search – A New Paradigm Inspired by Claude Code (May 2025 by Anthropic): An AI coding assistant that uses direct filesystem tools (grep, glob) instead of embedding and retrieval. Key Features: Uses live regex search (ripgrep) for immediate, precise pattern matching on whole files. No prior indexing or embedding needed, zero latency indexing, no overhead infrastructure. Search combined with autonomous multi-step task agents that navigate complex queries intelligently. Advantages: Ability to load and analyze complete documents or code files rather than fragments. Agents follow references, cross-links, and structure in real-time, much like a human researcher. Embraces large LLM context windows (“context-rich” era) to maintain coherence over thousands or millions of tokens. Context Revolution: Context window increased dramatically from 8K tokens (2022-2025) to hundreds of thousands or millions tokens (2025 onward). Enables end-to-end reasoning and exploration of entire corpora rather than isolated chunks. --- Why Agentic Search Represents the Future Handling Increasing Document Complexity: Documents are longer, more interconnected; cross-referencing and multi-document understanding required. Structured Data Integration: Combining tables, narratives, metadata in unified analysis. Real-Time Requirements and Dynamic Data: Immediate search on live