DeepCodeBench: Real-World Codebase Understanding by Q&A Benchmarking Author: Ravid Cohen, AI Research Engineer Date: September 10, 2025 Reading Time: 8 minutes --- Overview Qodo presents DeepCodeBench, a new benchmark dataset of real-world questions derived from large and complex code repositories. This dataset, alongside the methodology and prompts used for its creation, is made available to support AI research and development in codebase understanding. --- Motivation Large enterprises maintain vast codebases that are often difficult to understand fully by individual developers. Developers commonly have questions during onboarding, routine development, or AI-assisted workflows. Effective AI research agents require benchmarking on realistic, complex codebase questions that demand advanced retrieval across files. --- Prior Work Existing benchmarks like CodeQA focus largely on artificially generated snippets, lacking broader codebase retrieval. Some newer works address retrieval but not specifically from code repositories or real-world developer questions. DeepCodeBench fills this gap by creating a benchmark based on real pull requests (PRs) requiring multi-file retrieval. --- Dataset Generation Objective: Generate questions requiring deep retrieval across multiple interconnected files and realistic developer workflows. Data source: Complex PRs linking functionally related code changes, often not just explicit calls/imports but changes together. Method: Extract code snippets (methods, classes, files) involved in PR code changes from the main branch. Combine these snippets with PR title and description to create meaningful context. Use Large Language Models (LLMs) to generate authentic questions and answers based on this context. Example PR #39363 from Hugging Face Transformers repository touches 4 files with code blocks like: BaseImageProcessorFast.init BaseVideoProcessor.init The question generated: > "How do the fast image and video processor base classes prevent shared mutable state when instantiating multiple instances?" The answer details the deep-copying of mutable defaults in constructors, preventing shared mutable states. --- Dataset Statistics 1,144 questions generated from eight open-source repos. Context Distribution: Number of code blocks and files involved per question varies. PRs often touch multiple methods within the same file, increasing block count. Categorical Breakdown: Scope: Deep (specific to code block) vs. Broad (across multiple blocks/files). Core Questions: Focus on fundamental functionality. Searchability: Presence of keywords aiding direct searches. --- Evaluation Mechanism: LLM as a Judge Robust evaluation via fact recall: Extract verifiable facts from ground-truth answers. Check predicted answers for presence of these facts using automated LLM calls. This method ensures objective and scalable assessment. Inspired by the 2003 TREC QA Track and used in systems like Google/DeepMind's SAFE. --- Baselines and Results Baselines evaluated: Ground Truth: Verifies evaluation method accuracy. LLM with full context: Upper bound performance. LLM with no context: Lower bound to measure prior model knowledge. Models tested include: Codex CLI Claude Code Qodo's Deep Research agent (Qodo Aware) Highlights: Qodo's Deep Research agent outperforms competitors with around 76% fact recall, beating Codex (~74%) and Claude (~64%). Deep Research is faster and reaches 80% with high reasoning enabled (with some runtime tradeoff). Performance detailed by question scope and searchability: DeepResearch performs equally well on broad and deep questions, reflecting wide search capabilities. All agents perform better on