I Benchmarked 4 LLMs With Real Token Costs — The Most Expensive One Scored the Lowest
The Problem I was running AI agents on GPT-4.1, Claude, Gemini — switching models, tweaking prompts, changing architectures. But I couldn't answer basic questions: Did my last prompt change make th...

Source: DEV Community
The Problem I was running AI agents on GPT-4.1, Claude, Gemini — switching models, tweaking prompts, changing architectures. But I couldn't answer basic questions: Did my last prompt change make things better or worse? Is Claude actually better than GPT for my use case, or just 5x more expensive? Will my agent leak PII if someone tries prompt injection? My "evaluation" was manually typing questions into a chat window. That's embarrassing for an engineer. So I built LitmusAI — an open-source eval framework for AI agents. And then I actually measured things. The Benchmark Results I ran the same test suite across 4 current models. Same tasks, same assertions, same conditions: Model Pass Rate Real Cost Cost per Correct Answer GPT-4.1 100% $0.017 $0.0034 🏆 Claude Sonnet 4 100% $0.011 $0.0018 Claude Opus 4 83% $0.043 $0.0085 Gemini 2.5 Pro 50% $0.001 $0.0003* *Gemini is the cheapest per call but only passes half the tests. The surprise: Claude Opus 4 costs 14x more per correct answer than G