Claude Can (Sometimes) Prove It

Claude Can (Sometimes) Prove It By Mike Dodds, September 16, 2025 --- Overview Claude Code, an AI coding agent by Anthropic, shows surprising ability at interactive theorem proving (ITP), a challenging area of formal methods used for verifying cryptographic libraries, compilers, and operating systems. While still requiring expert guidance, Claude Code can independently complete many complex proof steps, hinting at a future where theorem proving becomes more accessible. --- What is Interactive Theorem Proving (ITP)? ITP tools like Lean allow formal verification of complex mathematical proofs. Historically difficult due to cognitive demands, confusing interfaces, limited libraries, and sparse documentation. Proof development involves: Conceptual mathematics (defining concepts/theorems). Mapping concepts into Lean’s language. Decomposing big theorems into smaller lemmas. Proving individual theorems with tactic language. Debugging failures and refactoring. ITP is highly pedantic and mentally demanding, often requiring years of tedious work by experts. --- Why Claude Code is Different Claude Code is an AI coding agent, not just a chatbot: it breaks down tasks into subtasks and performs them interactively. It can execute commands such as reviewing code changes, writing definitions, proving or debugging theorems, and documenting fixes. Its ability to run commands like lake build and interpret error messages enables iterative proof refinement. Unlike specialized theorem proving AI models, Claude Code’s broader software engineering and reasoning capabilities help it assist with proof engineering. --- Experiment: Formalizing Deny-Guarantee Reasoning with Claude Code Task: Formalize a 2009 paper on reasoning about concurrent programs using permissions and Hoare logic. Process: Used Claude Code to convert papers to text and generate a formalization plan. Acted as project manager, giving high-level instructions. Let Claude Code draft theorem definitions first, then proofs later. The agent handled code refactoring and theorem proving, producing over 2,500 lines of Lean code. Result: About 50% through formalization; mostly AI-written code with human supervision. --- Limitations and Challenges Formalization with Claude Code was slower than manual for an expert. Experience ranged from miraculous correct proofs to persistent mistakes and thrashing. Some errors, especially deep conceptual misunderstandings by AI, required significant human intervention. AI-generated definitions are plausible but require careful auditing for complete trust. Progress improved with auxiliary tools like lean-mcp-lsp providing the AI access to proof states and code searches. --- Implications and Future Outlook Just a year ago, no AI-supported tools approached this level of theorem proving. Claude Code’s success is based on agency, task decomposition, and software engineering capabilities combined with reasoning. AI agents are currently limited in sustaining very long, complex tasks but are rapidly improving. Future enhancements may integrate better formal methods AI and parallel agent approaches to reduce error rates. Claude Code exemplifies the "bitter lesson": AI eventually dominates through scale and general architecture rather than specialized expertise. This points toward a future where theorem proving will be cheap, automated, and accessible beyond expert circles. --- Takeaways Try Claude Code or similar AI coding agents for experimentation—they can reveal surprising successes and failures with modest investment. Formal methods may evolve rapidly by incorporating AI, making formal verification widespread and less cognitively demanding. Human experts remain crucial now, particularly for deep conceptual oversight and validation. Expect AI-driven improvements to reshape how complex mathematical and