What is a token and how tokenization works
Before you can understand pricing, you need to understand what you are paying for. A token is the smallest unit of text an LLM processes. Most providers use sub-word tokenizers (BPE or SentencePiece) that split text into pieces roughly 3-4 characters long. The word "tokenization" becomes three or four tokens; a short JSON payload may use more tokens than the same data in plain English.
Tokenizer choice varies by provider and model family. That means the same prompt can cost different amounts depending on which API you call. For a deeper dive, see Understanding LLM tokens.
How providers charge: per-token billing
Almost every major AI API bills in token units, typically quoted per million tokens. The critical nuance: input tokens and output tokens carry different prices. Output tokens are usually 2-5x more expensive because generation is more compute-intensive than encoding a prompt.
For example, a provider might charge $3 per million input tokens and $15 per million output tokens. A request with a 2,000-token prompt and a 500-token response costs roughly $0.006 for input and $0.0075 for output. Small numbers per call, but they compound fast at scale. Our input vs output tokens guide breaks this split down further.
Hidden costs: context window waste, retries, and system prompts
Your invoice rarely reflects just the tokens you intended to send. Three categories of overhead quietly inflate bills:
Context window waste. Stuffing a 128k context window when your task needs 4k means you pay for padding the model never uses productively. Larger context windows also increase latency, which can trigger client-side timeouts and retries. See Context window token limits for sizing strategies.
Retry tokens. When a request fails or returns an unsatisfactory result, the retry re-sends the full prompt. If your system retries three times, you pay for the prompt four times. Exponential back-off helps with rate limits but does nothing about the token bill.
System prompt overhead. Many applications prepend a long system prompt to every request. A 2,000-token system prompt across 100,000 daily calls adds 200 million input tokens per day to your bill. Caching, prompt compression, or moving static instructions into fine-tuning can reduce this dramatically.
Pricing model comparison: flat-rate vs per-token vs hybrid
Flat-rate plans give cost predictability but penalize light users and often throttle heavy ones. They work best when usage is steady and predictable month to month.
Pure per-token billing is the industry default. You pay exactly for what you use, which sounds fair until you realize spiky workloads can blow budgets with no warning. It also makes cost forecasting harder for finance teams.
Hybrid models blend committed capacity with per-token overflow. Token Landing's approach goes further: it routes high-value turns through premium-path (A-tier) models and bulk work through value-tier models, so you get Claude-class quality where it matters without paying Claude-class prices everywhere. Read Hybrid AI tokens for the full breakdown.
How to estimate your monthly token spend
Start with three numbers: average prompt length (in tokens), average response length, and daily request volume. Multiply to get daily input and output tokens, then apply your provider's per-million rates.
Add a 20-30% overhead buffer for retries, system prompts, and context padding. If you use multi-turn conversations, remember that each turn re-sends the full history, so token consumption grows quadratically with conversation length unless you summarize or truncate.
For teams spending over $5,000/month, a cost optimization audit typically uncovers 30-50% in savings through prompt trimming, caching, and tier-aware routing.