Intelligent cost-aware routing engine for LLM API calls
Route requests to the optimal language model based on complexity, cost, and latency requirements. Built for production systems that use multiple LLM providers.
- Cost Optimization — Automatically routes simple queries to cheaper models (Haiku) and complex ones to powerful models (Opus)
- Latency-Aware — Factors in response time requirements for real-time vs batch workloads
- Multi-Provider — Supports Anthropic, OpenAI, NVIDIA NIM, Ollama, and OpenRouter
- Fallback Chains — Automatic failover when a provider is down or rate-limited
- Token Estimation — Pre-estimates token usage to pick the right context window
from model_router import ModelRouter
router = ModelRouter()
# Simple query → routes to Haiku (fast, cheap)
response = router.route("What is 2+2?")
# Complex query → routes to Opus (powerful)
response = router.route("Analyze this 50-page legal document and extract all liability clauses...")| Query Type | Model | Cost | Latency |
|---|---|---|---|
| Simple Q&A | Claude Haiku | $0.25/M | ~200ms |
| Code generation | Claude Sonnet | $3/M | ~1s |
| Deep analysis | Claude Opus | $15/M | ~3s |
| Fallback | GPT-4o | $5/M | ~1s |
- Python 3.11+
- Anthropic SDK, OpenAI SDK
- Token estimation via tiktoken
MIT