Token Optimization
There’s a growing narrative that AI token consumption is too expensive and too wasteful. Engineers are “tokenmaxxing”. CFOs are nervous. Budgets are blown.
The concern isn’t wrong. There is waste. But it misses the structural picture.
The Mental Model
AI spend = no. of users × tasks/user × tokens/task × blended $/token
The first half — users and tasks per user — is almost certainly going to keep ripping. Claude Code’s adoption curve is steeper than Cursor’s was at the same stage. Cowork is ramping faster than Claude Code. It’s scary that we are barely scratching the surface.
The tension lives in the second half: tokens/task and $/token. That’s where optimization happens, and where the bull/bear debate gets heated.
Reference:
Three Levers for Optimization
Lever 1: Same Work, Cheaper Tokens (↓ $/token)
Model routing is the highest-impact play. A routing layer that sends trivial tasks to Haiku and reserves Opus for complex reasoning can cut 60–80% of spend on eligible tasks. Volume discounts are table stakes for $1M+ accounts. OSS models for commodity tasks — self-hosting Llama or Qwen for boilerplate — means zero per-token cost, swapped for GPU capex. Or the simplest strategy: wait. Token prices fall roughly 10x every 18 months. Do nothing, save money next quarter.
Lever 2: Same Work, Fewer Tokens (↓ tokens/task)
Prompt caching is low-hanging fruit — cache repeated system prompts, reads cost 10% of input price, 90% savings. Context window management — summarize history instead of re-sending full conversations, prune irrelevant files. Thinking budget tuning — cap thinking tokens for simple completions, uncap for hard problems. Prompt engineering — “return only the diff” vs. “explain your changes.” Response caching — if 50 engineers ask “how do I set up X,” cache it, zero marginal tokens. And agent loop pruning, possibly the biggest single source of waste: most agents waste 50–70% of their tokens on redundant tool calls, retries, and pointless sub-agent spawns.
Lever 3: Stop Doing Unnecessary Work (↓ task count)
Who Optimizes What
Optimization happens at every layer of the stack, but each layer targets different metrics.
Infrastructure (NVIDIA, AMD, Cerebras, Groq) optimizes tokens/watt, tokens/second, tokens/dollar, and TTFT latency. Model providers (Anthropic, OpenAI, Google) optimize quality/token, gross margin, task success rate, and thinking efficiency. Application layer (Cursor, Claude Code, Codex) optimizes cost/task, tokens/task, cache hit rate, and routing accuracy. Enterprise buyers (Uber, every company deploying AI) optimize cost/engineer, output/dollar, and ROI vs. headcount.
Each layer’s gains create pressure on the layers around it. Faster hardware forces model providers to compete on price. Better models reduce the tokens applications need. Application-layer routing erodes premium pricing. Enterprise CFOs demand all of the above.
Bear vs. Bull
The core debate: does optimization compress AI revenue faster than new demand replaces it?
The bear case
Rationalization is the CFO’s first instinct. When Uber’s CTO revisits the AI budget, the reaction isn’t “great, let’s 10x usage” — it’s “finally back inside the envelope.” Savings flow to the bottom line, not back into tokens.
Model routing is a revenue attack. Routing a task from Opus to Haiku delivers a near-identical user experience. Revenue per task for the model provider drops 10–20x.
OSS is closing the gap. A year ago, open-source models were “not even close” on coding. Three months ago, “almost there.” Once good enough, workloads migrate to self-hosted — and that revenue doesn’t shrink, it vanishes.
Caching is pure token destruction. Cache hit = zero revenue. No new demand is generated. Same work, no payment.
Thinking efficiency is self-cannibalization. If Anthropic improves extended thinking by 3x, billing for the same reasoning task drops by two-thirds. A product improvement that directly erodes the provider’s own topline.
The bull case
Current usage is cost-constrained, not demand-constrained. Uber blew its budget and had to throttle. Drop costs 5x and every budget-killed use case comes back. Today only coding is at scale — testing, documentation, code review, security auditing are all waiting for the economics.
Penetration is still single digits. Cheaper tokens don’t just mean existing users do more — they mean new users and new companies adopt. Customer count growth offsets per-customer compression.
Agentic workflows are a token multiplier. A human-in-the-loop conversation: thousands of tokens. An autonomous agent on a complex task: hundreds of thousands. Optimization compresses each step; agentic architectures multiply the number of steps.
New modalities are net-new demand. Vision, audio, video — token volumes that dwarf text, not subject to existing text-optimization pressures.
And Jensen Huang’s framing: a $500K/year engineer should consume at least $250K/year in tokens. At $5K, you’re dramatically under-leveraging AI. Under this model, the right level of spend is much higher than what anyone is doing today — and part of it offsets labor costs directly.
What to Watch / Leading Indicators
Rather than picking sides, track the metrics that reveal which force is winning:
Revenue growth vs. token growth. Tokens still exploding but revenue decelerating? Price compression is winning.
Frontier model mix. Is the most expensive model’s revenue share shrinking? Traffic migrating to smaller models signals eroding pricing power.
Expansion patterns. Growth from individual users consuming more (fragile, optimizable) vs. more seats and workflows (durable, harder to compress).
Net dollar retention. Under optimization pressure, NDR dips — unless new use-case expansion outruns it.
Gross margin. Prices falling faster than inference costs = margin compression. Inference costs falling faster than prices = margins hold.
Inference infra startups. If Fireworks AI, Together AI, and Anyscale are raising aggressively and growing customers, enterprises are shifting from direct API calls to cost-efficient infra — pulling spend from model providers.
API pricing cadence. Frequent, large cuts (Sonnet dropping 3–4x in a year) signal competitive pressure forcing margin givebacks. Third-party trackers like Artificial Analysis aggregate this.
OSS benchmark convergence. Every new Llama or Qwen release, check the gap on SWE-bench and HumanEval. Narrowing gap = lower migration friction = less pricing power.
Cost-normalized performance. Not who scores highest — who delivers the most performance per dollar. If smaller models are closing this gap fast, routing incentives strengthen and premium revenue erodes.
Where This Lands
The optimizers will win every individual battle. Every caching trick, every routing layer, every pruned agent loop will work. Cost per task will drop dramatically.
But the number of tasks, the number of users, and the complexity of what gets delegated to AI will grow faster than efficiency compresses spend.
Token costs are going down. Token spend is going up. Both things are true, and they aren’t in contradiction.





