Production-grade boilerplate for building reliable LLM agents.
Zero-dependency monitoring, quality evaluation, and circuit breaking — drop it into any Python project.
Inspired by patterns from darshjme/arsenal.
A battle-tested template for LLM infrastructure engineers who need:
- Observability — every call logged with latency, tokens, cost, retries
- Quality gates — response scoring before replies reach users
- Fault isolation — circuit breaker prevents cascade failures under provider outages
- Retry resilience — exponential backoff on transient errors, configurable per deployment
No vendor lock-in. Works with OpenAI, Anthropic, local Ollama, or any callable.
pip install -r requirements.txt
python3 demo.py # full demo, no API key needed
pytest tests/ -v # 20+ unit testsllm-reliability-starter/
├── config.py — Pydantic Settings (env-based, .env support)
├── monitor.py — LLM call wrapper: latency, tokens, cost, retry, structured logs
├── evaluator.py — Response quality scorer: relevance + coherence + hallucination flags
├── circuit_breaker.py — CLOSED → OPEN → HALF_OPEN state machine
├── demo.py — End-to-end demo (5 scenes, mock LLM, no API key)
└── tests/
└── test_monitor.py — 20+ unit tests across all three modules
Wraps any LLM callable with full observability:
from monitor import LLMMonitor, LLMRequest
monitor = LLMMonitor(my_openai_fn, max_retries=3, backoff_seconds=1.0)
response = monitor.call(LLMRequest(prompt="Summarize this contract."))
print(response.latency_ms) # 342.17
print(response.total_tokens) # 847
print(response.estimated_cost_usd()) # 0.000127
print(monitor.summary()) # aggregate statsFeatures:
- Per-call structured JSON logs (structlog)
- Configurable retry with exponential backoff
- Separate
retryable_exceptionsfrom fatal ones - Running metrics: success rate, P99 latency, total spend
Scores responses 0–1 across three dimensions before they leave your system:
from evaluator import ResponseEvaluator
evaluator = ResponseEvaluator(weights={"relevance": 0.5, "coherence": 0.3, "factuality": 0.2})
result = evaluator.evaluate(prompt="What is Python?", response=llm_output)
print(result.quality_score) # 0.87
print(result.hallucination_flags) # ['unsourced_authority_claim']
print(result.passed(threshold=0.7)) # True / FalseScoring:
| Dimension | Method | What it catches |
|---|---|---|
| Relevance | Keyword overlap + length bonus | Off-topic, empty, generic responses |
| Coherence | Structural pattern analysis | Truncated, over-hedged, incoherent replies |
| Factuality | Regex hallucination patterns | Overconfident claims, unsourced stats, future dates |
Hallucination patterns detected:
"Studies show"/"research proves"without source- Overconfident language:
"definitely/certainly/absolutely true" - Large unverified statistics:
"5.7 billion users" - Knowledge cutoff hedges when context doesn't warrant them
Note: These are fast heuristic signals. Layer in model-graded evals (DeepEval, G-Eval) for production depth.
Classic three-state machine preventing cascade failures:
CLOSED ──(N failures)──► OPEN ──(T seconds)──► HALF_OPEN ──(success)──► CLOSED
└──(failure)──► OPEN
from circuit_breaker import CircuitBreaker, CircuitOpenError
cb = CircuitBreaker(name="openai-api", failure_threshold=5, recovery_timeout=60)
try:
result = cb.call(my_llm_fn, prompt="Hello")
except CircuitOpenError as e:
# Fast-fail: no call made, circuit is OPEN
return fallback_response(retry_after=e.retry_after)Stats available:
cb.stats.total_calls # 1247
cb.stats.rejected_calls # 83 (fast-failed while OPEN)
cb.stats.state_transitions # [{from: CLOSED, to: OPEN, ts: ...}]All settings via env vars or .env file:
LLM_API_KEY=sk-...
LLM_MODEL=gpt-4o
LLM_MAX_RETRIES=3
LLM_RETRY_BACKOFF_SECONDS=1.0
LLM_CB_FAILURE_THRESHOLD=5
LLM_CB_RECOVERY_TIMEOUT=60
LLM_LOG_JSON=true
LLM_TIMEOUT_SECONDS=30from config import settings
from monitor import LLMMonitor, LLMRequest
from evaluator import ResponseEvaluator
from circuit_breaker import CircuitBreaker, CircuitOpenError
cb = CircuitBreaker(
name="llm-api",
failure_threshold=settings.cb_failure_threshold,
recovery_timeout=settings.cb_recovery_timeout,
)
monitor = LLMMonitor(my_llm_fn)
evaluator = ResponseEvaluator()
def reliable_llm_call(prompt: str) -> str:
try:
response = cb.call(monitor.call, LLMRequest(prompt=prompt))
except CircuitOpenError as e:
return f"[service unavailable, retry in {e.retry_after:.0f}s]"
result = evaluator.evaluate(prompt, response.content)
if not result.passed():
# Log quality failure, trigger human review, return safe fallback
logger.warning("quality_gate_failed", score=result.quality_score, flags=result.hallucination_flags)
return fallback_response()
return response.contentpytest tests/ -vExpected: 20+ tests across all three modules, all passing.
| Need | Where to add |
|---|---|
| Real OpenAI calls | Replace mock in demo.py with openai.OpenAI().chat.completions.create |
| Async support | Wrap monitor.call in asyncio.to_thread or add async def acall |
| Model-graded evals | Add deepeval or ragas scorer in evaluator.py |
| Prometheus metrics | Export monitor.summary() to /metrics via prometheus_client |
| Persistent logging | Configure structlog to write to file or ship to Datadog/Grafana |
| Multi-provider routing | Swap make_mock_llm with a router that tries GPT-4 → Claude → Gemini |
- Pydantic Settings over raw
os.environ— validation, type coercion,.envsupport out of the box - structlog over
logging— structured JSON logs that parse cleanly in Datadog, CloudWatch, Loki - Heuristic evaluation first — zero-cost checks catch 80% of failures; expensive model-graded evals run only on ambiguous cases
- Thread-safe circuit breaker —
threading.Lockthroughout; safe for concurrent request handlers - Dataclass responses — not dicts; IDE autocomplete, type safety, serializable via
asdict()
This template demonstrates skills directly relevant to:
- LLM reliability engineering — monitoring, evaluation, fault tolerance
- Agent infrastructure — composable observability layer for any agent framework
- Platform engineering — production patterns, structured logging, circuit breaking
MIT — use freely in client projects, products, and portfolios.