Understanding Context Windows and Token Limits
Every LLM API has a maximum number of tokens it can process in a single request, called the context window. This isn't just a technical constraint—it's often your biggest cost driver and the source of mysterious production failures.
I've seen teams burn through $50K monthly budgets in days because they didn't understand how their 100K+ token conversations were hitting GPT-4's pricing. One startup I worked with was sending entire codebases (80K+ tokens) for simple debugging tasks that needed maybe 500 tokens of context.
Token Limits Across Popular Models
| Model | Context Window | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) |
|---|---|---|---|
| GPT-4 Turbo | 128K tokens | $10 | $30 |
| GPT-3.5 Turbo | 16K tokens | $1 | $2 |
| Claude 3 Opus | 200K tokens | $15 | $75 |
| Gemini Pro 1.5 | 2M tokens | $7 | $21 |
| Llama 2 70B | 4K tokens | $0.70 | $0.80 |
These numbers change frequently, but the pattern stays consistent: larger windows cost more, and output tokens are typically 2-5x more expensive than input tokens.
What Really Eats Your Token Budget
Raw text isn't usually the problem. The budget killers are structured payloads that teams don't account for properly.
JSON tool definitions can easily consume 2K+ tokens per function. I've audited APIs where OpenAI function calling definitions used 15K tokens before the actual conversation started. Here's a bloated example:
{
"name": "search_database",
"description": "Searches the customer database for records matching specified criteria with advanced filtering options",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query string with support for wildcards, boolean operators, and field-specific searches"
},
"filters": {
"type": "array",
"items": {
"type": "object",
"properties": {
// ... 50+ more lines
}
}
}
}
}
}Base64 encoded data is another trap. A 1MB image becomes roughly 1.4MB of base64 text, which translates to about 350K-400K tokens. Always resize images before encoding—a 512x512 image usually provides enough detail while using 10x fewer tokens.
Log files and XML dumps from enterprise systems often contain massive amounts of redundant information. One client was sending complete Kubernetes logs (200K+ tokens) when they only needed the error messages (maybe 500 tokens).
Error Handling: Hard Limits vs Silent Truncation
Different providers handle token limit violations completely differently, and this can break your application in subtle ways.
OpenAI returns a clear 400 error with details:
{
"error": {
"message": "This model's maximum context length is 16385 tokens. However, your messages resulted in 18420 tokens.",
"type": "invalid_request_error",
"param": "messages",
"code": "context_length_exceeded"
}
}Anthropic behaves similarly with explicit errors, but some other providers will silently truncate your input. They'll drop the oldest conversation turns or trim the system prompt without telling you. This can lead to completely incorrect responses that appear normal.
Always implement proper error handling and logging for token limit issues. I recommend tracking your token usage proactively:
// Rough token estimation (1 token ≈ 4 characters for English)
const estimateTokens = (text) => Math.ceil(text.length / 4);
const totalTokens = messages.reduce((sum, msg) => {
return sum + estimateTokens(msg.content);
}, 0);
if (totalTokens > MODEL_LIMIT * 0.9) {
// Implement compression or chunking strategy
}Cost Optimization Strategies That Actually Work
Wider context windows don't just allow longer prompts—they make every request more expensive. A conversation that costs $0.01 with GPT-3.5 Turbo (16K) might cost $0.15 with GPT-4 Turbo (128K), even if you're only using 10K tokens.
Smart summarization can reduce costs by 60-80%. Instead of keeping full conversation history, summarize older turns:
- Keep the last 3-5 exchanges in full detail
- Summarize everything older into key points
- Always preserve the system prompt and current context
Two-model routing works well for many use cases. Use a cheap model (GPT-3.5 or Claude Haiku) for initial processing and context compression, then send the condensed version to your main model. We've seen teams reduce costs by 70% this way.
Retrieval-based filtering beats stuffing everything into context. Instead of sending 50 documents, use vector search to find the 3 most relevant ones. Your accuracy often improves because the model isn't distracted by irrelevant information.
When Large Context Windows Backfire
Bigger isn't always better. Models can struggle with "needle in a haystack" problems where important information gets buried in massive contexts. GPT-4 with 100K+ tokens sometimes performs worse than GPT-3.5 with carefully selected 8K tokens.
Large contexts also increase latency significantly. Processing 128K tokens can take 10-30 seconds vs 2-5 seconds for 8K tokens. This matters for real-time applications.
Consider breaking large tasks into smaller, focused requests rather than cramming everything into one massive context.
Production Implementation Tips
Monitor your token usage patterns over time. Most teams are surprised to discover their average request size—it's usually much higher than expected. Set up alerts when requests approach 80% of your model's limit.
Implement graceful degradation. When hitting token limits, try automatic summarization, switching to a larger context model, or breaking the task into smaller chunks. Don't just error out.
Test with realistic data volumes. Your development environment with clean, minimal test data won't reveal the token bloat that happens in production with real user conversations and system integrations.