The problem with all-Claude architectures
I get it. Claude Sonnet is good. Really good. The instruction following, the nuance, the way it handles edge cases that make GPT-5.4 stumble. When you're building an AI product, the temptation is to route everything through Claude and call it done.
But then the bill arrives. And you realize that 70% of your API calls were summarization jobs, context compression, and embedding generation that a model one-tenth the cost could have handled just fine.
We've seen this pattern with dozens of teams. A startup launches with Claude for everything, gets traction, scales to 500K API calls a month, and suddenly they're spending $15K/month on tokens. Their investors start asking questions. Their engineering lead starts "exploring alternatives" at 2am.
What users actually notice
Here's the thing nobody talks about: users can't tell the difference between Claude and GPT-5 Nano on about 65-80% of interactions. We know this because we ran blind tests with real users across three different products.
| Interaction type | User notices model quality? | Recommended tier |
|---|---|---|
| First reply in conversation | Yes, strongly | A-tier (Claude Sonnet, GPT-5.4) |
| Follow-up clarification | Sometimes | A-tier |
| Error recovery / retry | Yes, frustration is high | A-tier |
| Background summarization | No | Value-tier |
| Context window compression | No | Value-tier |
| Classification / tagging | No | Value-tier |
| Embedding generation | No | Dedicated embedding model |
| Draft generation (before editing) | Rarely | Value-tier |
The pattern: users care about the first thing they see and the moments when something goes wrong. Everything else is invisible plumbing.
How hybrid routing actually works
The concept is simple. The implementation has some subtlety.
When a request hits Token Landing's API, we look at the routing policy you've configured. A routing policy is a set of rules like:
# Example routing policy
rules:
- match: { role: "user", position: "first" }
tier: premium # First user message -> Claude Sonnet
- match: { tools: true }
tier: premium # Tool calls -> Claude Sonnet
- match: { role: "system", length_gt: 4000 }
tier: economy # Long system prompts -> Haiku
- default:
tier: auto # Let the router decide
The "auto" tier is where it gets interesting. Our router considers the prompt complexity, expected output length, and whether the response will be directly shown to users. It's not perfect, but it gets the tier right about 90% of the time, and when it's wrong, it errs toward A-tier (better to overspend slightly than to show bad output).
Real numbers from a production deployment
One of our customers runs a legal AI assistant. Before Token Landing, they were spending $22K/month on Claude Sonnet for everything. Here's what happened after switching to hybrid routing:
| Metric | Before (all Claude) | After (hybrid) |
|---|---|---|
| Monthly API cost | $22,000 | $7,800 |
| User satisfaction (NPS) | 62 | 64 |
| A-tier request % | 100% | 28% |
| Median response time | 1.8s | 1.4s |
| Quality incidents/month | 3 | 2 |
NPS actually went up slightly. We think this is because value-tier models respond faster for simple tasks, so the overall experience felt snappier. The quality incidents went down because our retry logic automatically escalates to A-tier on failure, whereas before they were just hitting Claude again with the same failing prompt.
When not to use hybrid routing
Being honest here. Hybrid routing isn't right for every situation:
- Regulated industries requiring audit trails that need single-vendor traceability. If your compliance team needs to certify which model processed every request, hybrid routing adds complexity.
- Very low volume (under 10K requests/month). The cost savings won't justify the setup time. Just use Claude directly.
- Research / evaluation workloads where you need consistent model behavior across all requests for reproducibility.
For everything else, especially products at scale where costs are growing faster than revenue, hybrid routing is worth evaluating.
Getting started
Token Landing uses an OpenAI-compatible API, so migration is a base-URL swap. You keep your existing SDK, your prompts, your streaming logic. We work with you during onboarding to set up routing rules that match your product's interaction patterns.
Fill out the contact form and we'll reply with API credentials and a suggested routing policy within a day.