Token cost benchmark for an autonomous Follow-up Orchestration agent, across 13 models. Prices as of 14 Jun 2026.
| Model | $/1M in | $/1M out | Cost / outcome | Cost / month* |
|---|---|---|---|---|
| GPT-4o mini | $0.15 | $0.60 | $0.0234 | $234 |
| Llama 4 Maverick | $0.27 | $0.85 | $0.0407 | $407 |
| Gemini 2.5 Flash | $0.30 | $2.50 | $0.0554 | $554 |
| GPT-4.1 mini | $0.40 | $1.60 | $0.0625 | $625 |
| DeepSeek V4 | $0.44 | $0.87 | $0.0629 | $629 |
| Claude Haiku 4.5 | $1.00 | $5.00 | $0.163 | $1,628 |
| Gemini 2.5 Pro | $1.25 | $10.00 | $0.228 | $2,282 |
| Mistral Large 3 | $2.00 | $6.00 | $0.299 | $2,992 |
| GPT-4.1 | $2.00 | $8.00 | $0.312 | $3,124 |
| GPT-4o | $2.50 | $10.00 | $0.391 | $3,905 |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.488 | $4,884 |
| Claude Opus 4.8 | $5.00 | $25.00 | $0.814 | $8,140 |
| Claude Fable 5 | $10.00 | $50.00 | $1.63 | $16,280 |
*At 10,000 outcomes per month. Cheapest model highlighted.
The clean-path steps this benchmark prices:
This path runs 9 steps: 2 tool calls, 4 reasoning steps, 3 decision points and 0 human checkpoints. Tool steps make two model calls each, and the agent re-reads its growing context on every call. That compounding is why one Follow-up Orchestration outcome costs about 21x a single chat message ($0.488 on Claude Sonnet 4.6), not the price of one message.
On the clean path with default assumptions, an agent for Follow-up Orchestration costs about $0.0234 to $1.63 per outcome depending on the model, or roughly $234 to $16,280 per month at 10,000 outcomes. The cheapest model here is GPT-4o mini at $0.0234; the most expensive is Claude Fable 5 at $1.63.
An agent does not make one model call. It plans, calls tools, retrieves context and re-reads its growing working context on every step. For Follow-up Orchestration that adds up to about 21x the cost of a single chat message.
Across the 13 models benchmarked, GPT-4o mini is cheapest at $0.0234 per outcome and Claude Fable 5 is the most expensive at $1.63. A cheaper model is not always the right choice, but it sets the floor for this workflow.
The biggest levers are prompt caching on the base context, fewer planning loops, smaller tool results, less retrieval, and choosing a cheaper model where quality allows. You can test each lever in the live estimator.