Multi-provider LLM routing with fallback, rate limiting, and caching.
Route requests across OpenAI, Anthropic, and other providers with automatic failover, intelligent load balancing, and response caching.
pip install -r requirements.txtimport asyncio
from llm_gateway import (
Gateway, GatewayConfig, Request,
OpenAIProvider, AnthropicProvider, ProviderConfig,
RateLimitConfig, CacheConfig,
)
# Configure providers
providers = [
OpenAIProvider(ProviderConfig(
name="openai",
api_key="sk-...",
cost_per_1k_input=0.01,
cost_per_1k_output=0.03,
priority=1,
)),
AnthropicProvider(ProviderConfig(
name="anthropic",
api_key="sk-ant-...",
cost_per_1k_input=0.008,
cost_per_1k_output=0.024,
priority=0, # Fallback
)),
]
# Create gateway with rate limiting and caching
gateway = Gateway(
providers=providers,
config=GatewayConfig(
rate_limit=RateLimitConfig(requests_per_minute=60),
cache=CacheConfig(ttl_seconds=3600),
),
)
# Make request
async def main():
response = await gateway.complete(Request(
messages=[{"role": "user", "content": "Hello!"}],
model="gpt-4",
))
print(response.content)
asyncio.run(main())Distribute requests evenly across healthy providers:
from llm_gateway import RoundRobinStrategy
gateway = Gateway(providers, strategy=RoundRobinStrategy())Route to the provider with best average response time:
from llm_gateway import LowestLatencyStrategy
gateway = Gateway(providers, strategy=LowestLatencyStrategy())Prefer cheaper providers:
from llm_gateway import CostOptimizedStrategy
gateway = Gateway(providers, strategy=CostOptimizedStrategy())Route by provider priority with automatic fallback:
from llm_gateway import PriorityStrategy
gateway = Gateway(providers, strategy=PriorityStrategy())Token bucket and sliding window limiters:
from llm_gateway import RateLimitConfig
config = RateLimitConfig(
requests_per_minute=60,
requests_per_hour=1000,
tokens_per_minute=100000,
burst_multiplier=1.5, # Allow 1.5x burst
)The gateway raises RateLimitError when limits are exceeded:
from llm_gateway import RateLimitError
try:
response = await gateway.complete(request)
except RateLimitError as e:
print(f"Rate limited. Wait {e.wait_time:.1f}s")Cache identical requests to reduce costs:
from llm_gateway import CacheConfig
config = CacheConfig(
enabled=True,
ttl_seconds=3600,
max_entries=1000,
include_model_in_key=True,
include_temperature_in_key=True,
)Cache keys are generated from message content, model, and temperature.
Automatic fallback when providers fail:
config = GatewayConfig(
fallback_enabled=True,
max_fallback_attempts=3,
timeout=30.0,
)Providers are tried in priority order. Unhealthy providers (3+ consecutive failures) are temporarily skipped.
Track usage and performance:
metrics = gateway.get_metrics()
# {
# "total_requests": 1000,
# "cached_requests": 250,
# "failed_requests": 5,
# "cache_hit_rate": 0.25,
# "avg_latency_ms": 150,
# "total_cost": 0.42,
# "providers": {
# "openai": {"success_rate": 0.995, ...},
# "anthropic": {"success_rate": 0.99, ...},
# }
# }
status = gateway.get_provider_status()
# {"openai": {"status": "healthy", ...}}Extend Provider for custom backends:
from llm_gateway import Provider, ProviderConfig, ToolResult
class LocalLLMProvider(Provider):
async def complete(self, messages, model, **kwargs):
# Your implementation
return {
"content": "response",
"model": model,
"provider": self.name,
"usage": {"prompt_tokens": 10, "completion_tokens": 20},
}pytest tests/ -v104 tests covering:
- Provider management and metrics
- Routing strategies
- Rate limiting (token bucket, sliding window)
- Response caching
- Gateway integration
MIT