Most developers discover this the hard way: your LLM bill explodes not from asking complex questions, but from getting chatty responses. Output tokens cost 2-10x more than input tokens across major providers, and a single verbose assistant can wreck your budget.
I learned this lesson building an AI support bot that started giving paragraph-long answers to simple "yes/no" questions. Our monthly OpenAI bill jumped from $400 to $1,200 before we caught it.
Input vs Output Token Pricing Reality
Input tokens are what you send to the API – your prompts, context, and conversation history. Output tokens are what comes back – the model's generated response. The pricing split isn't arbitrary; text generation requires significantly more computational resources than text processing.
Here's the current pricing breakdown for popular models:
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Ratio |
|---|---|---|---|
| GPT-5.4 | $2.50 | $10.00 | 4:1 |
| Claude 3.5 Sonnet | $3.00 | $15.00 | 5:1 |
| Gemini 1.5 Pro | $1.25 | $5.00 | 4:1 |
| Llama 3.1 70B | $0.35 | $0.40 | 1.14:1 |
Notice how open-source models like Llama have much smaller ratios, while premium closed models can hit 5:1 or higher.
Why Completions Swing Costs
Your AI assistant's personality directly impacts your wallet. A verbose assistant doubles spend even when the question was tiny. I've seen support bots cost $0.02 per interaction instead of $0.005 simply because they were programmed to be "helpful and thorough."
The fix requires multiple approaches. Setting max completion length caps runaway responses – we use 150 tokens for customer service, 500 for technical documentation. Adopting structured outputs through JSON mode or function calling forces concise responses. And trimming reasoning traces in production eliminates the internal "thinking" that users never see but still gets billed.
Teams combine these completion controls with routing and caching strategies. Route simple queries to cheaper models, cache common responses, and use streaming to cut connections early when you detect the response is complete.
Tool Calls and Hidden Bytes
Function calling introduces billing complexity that catches teams off guard. Function schemas and intermediate tool results usually bill as input tokens on the next turn – or get bundled inline if your SDK handles multi-step workflows automatically.
Consider this tool call sequence:
// Initial call with function schema (input tokens)
const response = await openai.chat.completions.create({
model: "gpt-4o",
messages: [{
role: "user",
content: "What's the weather in Tokyo?"
}],
tools: [weatherToolSchema] // ~200 tokens
});
// Tool result gets added back as input
const finalResponse = await openai.chat.completions.create({
model: "gpt-4o",
messages: [
// Previous conversation
{
role: "function",
content: JSON.stringify(weatherData) // ~100 more input tokens
}
]
});That "simple" weather query consumed ~300 input tokens for schemas and results, plus whatever output tokens the final response generated. Surface this complexity in customer-facing docs so no one gets surprised at month-end.
The JSON response from your weather API might be 2KB of detailed forecast data, but users only wanted "sunny, 75°F." Consider preprocessing tool outputs to extract only relevant information before sending back to the LLM.
Hybrid and Blended Meters
If your product blends premium-path and economy models under one price list, you need transparent routing rules. We route 80% of queries to Claude Haiku ($0.25/$1.25 per 1M tokens) and escalate complex requests to GPT-5.4 automatically.
Explain which lane receives which traffic in your billing documentation. Something like: "Simple queries under 50 tokens use our economy tier. Requests requiring reasoning, code generation, or multi-step workflows use premium models."
See hybrid tokens and the disclosure pattern for implementation examples. The key is setting clear expectations upfront rather than explaining surprise charges after the fact.
When Not to Optimize Output Tokens
Sometimes verbose outputs justify their cost. Creative writing tools, educational content, and code generation often require lengthy responses. Don't sacrifice quality to save $0.01 per interaction.
However, customer support, quick lookups, and yes/no questions should be ruthlessly optimized for brevity. The goal isn't minimum tokens – it's maximum value per token spent.
Monitoring and Alerts
Set up billing alerts when daily token spend exceeds normal patterns. We trigger warnings at 150% of average daily spend and hard stops at 300%. Most providers offer usage APIs to track input/output ratios programmatically.
Track these metrics weekly: average tokens per request, input/output ratio, cost per user interaction, and completion length distribution. Sudden spikes usually indicate prompt engineering issues or model misbehavior.