Benchmark wins trade hands every few months. What matters in production is more boring than the leaderboard: tool-use reliability, instruction-following at length, how forgiving the model is when your prompts are imperfect, and what it costs.
Here’s our practical, opinionated take from the projects we’ve shipped this year.
Where Claude Wins
Tool use accuracy. When your agent has 6 tools and the model picks the wrong one, the run is over. Claude (Sonnet and Opus) routinely picks correctly where GPT picks plausibly-but-wrongly. This is the single biggest reason Claude is our default in production agents.
Long-context recall. Both models advertise huge context windows. Only Claude actually uses the back half. We routinely paste 100K-token documents into Sonnet and get accurate citations from page 240. GPT degrades earlier on the same task.
Instruction following at length. “Return exactly this JSON shape, never include explanations” — Claude obeys at Day 1 and at Day 60. GPT drifts faster as conversation history grows.
Writing quality. Subjective, but Claude reads less like ChatGPT-flavoured corporate. For customer-facing content, this matters.
Prompt caching. Anthropic’s cache cuts cost up to 90% on repeated prefix patterns. If you have a long system prompt + tools section repeated across requests, caching is free money.
Where GPT Wins
Image generation. If you need DALL-E-style generation inline, GPT is the only option. Claude doesn’t generate images.
Native voice mode. The voice product is more polished and faster than Anthropic’s equivalent.
Code execution. The Code Interpreter integration is mature. Anthropic’s equivalent (Code Execution) is newer and less feature-rich.
Eclectic knowledge. On obscure trivia or very-recent news, GPT’s training data updates feel slightly fresher. Marginal but real.
Where They Tie
Cost. At the small-model tiers (Haiku vs GPT-5 nano), both are similar per million tokens. At the big-model tiers (Opus vs GPT-5 Pro), both are pricey. Caching matters more than sticker price.
JSON-mode reliability. Both are excellent in 2026. Five years of agent-shaped pressure has fixed the structured-output problem.
Multimodal vision input. Both handle charts, screenshots, document scans well. We rarely see a quality difference on real production images.
What We Actually Use
For the projects we’ve shipped this year, the split is roughly:
- Claude Sonnet — 70% of agent calls. Default for tool-using agents, document analysis, customer support.
- Claude Haiku — 20%. High-volume, lower-complexity calls (classification, extraction, summarisation).
- GPT-5 — 10%. Image-generating endpoints, voice features, and one client who’s contractually on OpenAI.
- Gemini — 0% in production. Strong on long context but tool-use reliability is still behind.
- Open-weights (Llama, DeepSeek) via Groq/Together — for the dev-loop where data residency or cost dominates.
How To Pick For A New Project
- If the agent uses tools, start with Claude Sonnet. The hit rate on the first weekend will tell you whether it’s a fit.
- If the agent generates lots of short outputs at scale, benchmark Haiku vs GPT-5 nano on your specific eval. Whichever wins, that’s the answer.
- If the agent needs image gen, you’re on GPT. Don’t fight it.
- For everything else: pick the one your team prefers writing prompts in. Familiarity compounds.
Don’t Lock In Too Early
Both providers are OpenAI-API-compatible in their popular client libraries. Build behind an abstraction (LiteLLM, the LangChain provider layer, Vercel AI SDK) so swapping providers is a config change, not a rewrite. The state of the art moves quarterly — give yourself the optionality to follow it.