Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-Mini by 22%

Tau² Benchmark: How a Prompt Rewrite Boosted GPT-5-mini by 22% Introduction The Tau² benchmark is a framework for evaluating large language models (LLMs) in agentic tasks. A recent experiment revealed that rewriting prompts significantly improved GPT-5-mini's success rate by over 20%. This article explores how subtle changes in agent policies and task descriptions led to these gains. Benchmarking LLMs with Tau² Tau² simulates real-world agent interactions across domains like telecom, retail, and airlines. GPT-5 showed notable improvement only in the telecom domain in recent updates. GPT-5-mini is a smaller, faster, and cheaper alternative to the flagship GPT-5, trading some accuracy for efficiency. GPT-5-mini Advantages Approximately twice the latency speed of GPT-5. Better throughput efficiency. Costs about 20% of the flagship GPT-5. Delivers 85-95% of GPT-5's performance. Baseline Performance of GPT-5-mini Telecom benchmark uses a subset called telecomsmall with 20 test scenarios. Running GPT-5-mini yielded a 55% success rate, indicating a 45% failure rate. GPT-5-mini's reasoning capabilities are limited compared to flagship GPT-5. Reliability is measured with "pass^k" (success rate across repeated trials). Some tasks failed consistently, signaling intrinsic complexity or ambiguity. The Hack: Prompt Rewriting Using Claude Objective: Improve GPT-5-mini's success rate, unlock more solvable tasks, and enhance reliability. Used Claude (another AI) to analyze and rewrite telecom domain agent policies into clearer, more structured prompts. Deliverables were rephrased policy documents (mainpolicy.md, techsupportmanual.md) optimized for a smaller LLM. Key Improvements in Rewritten Prompts Structure & Flow Clear decision trees with branching logic. Numbered sequential steps. Explicit prerequisite checks before progressing. AI Agent Optimizations Clear tool call names and parameters. Binary yes/no decisions instead of ambiguous wording. Specific error handling and follow-up steps. Verification steps after fixes. Cognitive Load Reduction Reference tables for quick tool lookup. Pattern recognition for common issues. Section highlighting common AI mistakes to avoid. Actionable Language Removed verbose explanations mixed with instructions. Consolidated logic into single workflows. Used imperative commands (e.g., "Check X", "If Y, then Z"). Added immediate verification steps. Resulted in policy prompts acting as checklists instead of vague descriptions. Results: Significant Performance Improvement Success rate ("pass^1") improved from 55% to 67.5% — a 22.73% boost. Retry reliability ("pass^2") improved from 40% to 50%. Unsolvable tasks reduced by 50% (from 6 tasks to 3). GPT-5-mini outperformed OpenAI's o3 model and approached flagship GPT-5's performance. Clear demonstration that prompt engineering can unlock capabilities of smaller LLMs. Key Takeaways for Practitioners Thoughtful prompt restructuring boosts smaller models' accuracy and reliability. Smaller LLMs benefit from clear, stepwise, structured instructions with explicit conditions. Ambiguity and lengthy policies hinder smaller model performance. Using advanced LLMs to optimize prompts for smaller models is effective and cost-efficient. Optimized smaller models provide a practical alternative where latency and cost are critical. Related Articles Deep dive into the Tau² benchmark for testing AI agents. Pragmatic approaches to building Grafana dashboards with AI and CLI. Lessons from pre-LLM AI in observability, focusing on anomaly detection and AIOps. --- This post highlights how prompt engineering, assisted by generative AI, can significantly boost smaller model performance, making efficient AI more accessible and practical.