A $196 fine-tuned 7B model outperforms OpenAI o3 on document extraction

Extract-0: A Specialized Language Model for Document Information Extraction Author: Henrique Godoy Submitted: 26 Sep 2025 Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI) arXiv ID: arXiv:2509.22906 DOI: 10.48550/arXiv.2509.22906 --- Abstract Extract-0 is a 7-billion parameter specialized language model designed for document information extraction tasks. Despite its relatively modest size, it surpasses the performance of much larger general-purpose models on a benchmark of 1,000 diverse document extraction tasks. --- Key Features and Contributions Model Size: 7 billion parameters. Performance: Achieves a mean reward of 0.573, outperforming state-of-the-art models like: GPT-4.1 (0.457) o3 (0.464) GPT-4.1-2025 (0.459) Training Methodology: Synthetic Data Generation: Employs a memory-preserving pipeline creating 280,128 training examples derived from diverse document sources. Supervised Fine-Tuning: Uses Low-Rank Adaptation (LoRA) to fine-tune efficiently, modifying only 0.53% of model weights (40.4 million of 7.66 billion). Reinforcement Learning: Applies a novel method called Group Relative Policy Optimization (GRPO), with a semantic similarity-based reward function designed to handle the inherent ambiguity in information extraction. --- Significance Demonstrates the power of task-specific optimization, showing that specialized models can outperform larger, general-purpose language models while consuming fewer computational resources. Introduces novel techniques for data generation and training that enhance performance on complex information extraction tasks from documents. Offers a scalable and efficient alternative for document information extraction applications. --- Access PDF: Download PDF HTML (experimental): View HTML Source: TeX source and other formats available. --- Additional Information Includes bibliographic and citation tools integration. Provides access to related projects via arXivLabs such as Connected Papers, Litmaps, scite.ai, and others. Links to code, data repositories, and demos through platforms like Papers with Code, Hugging Face, and Replicate. Supports community features like bookmarking and sharing on platforms such as BibSonomy and Reddit. --- Contact and Support Acknowledges support from the Simons Foundation and member institutions. Offers donation options to support ongoing research. Provides help pages and contact details for further assistance. --- This paper highlights how focused architectural and training innovations yield superior, resource-efficient models for the specialized domain of document information extraction, setting new performance standards in the field.