LLM Reliability Starter

Production-grade boilerplate for building reliable LLM agents.
Zero-dependency monitoring, quality evaluation, and circuit breaking — drop it into any Python project.

Inspired by patterns from darshjme/arsenal.

What This Is

A battle-tested template for LLM infrastructure engineers who need:

Observability — every call logged with latency, tokens, cost, retries
Quality gates — response scoring before replies reach users
Fault isolation — circuit breaker prevents cascade failures under provider outages
Retry resilience — exponential backoff on transient errors, configurable per deployment

No vendor lock-in. Works with OpenAI, Anthropic, local Ollama, or any callable.

Quick Start

pip install -r requirements.txt
python3 demo.py          # full demo, no API key needed
pytest tests/ -v         # 20+ unit tests

Architecture

llm-reliability-starter/
├── config.py          — Pydantic Settings (env-based, .env support)
├── monitor.py         — LLM call wrapper: latency, tokens, cost, retry, structured logs
├── evaluator.py       — Response quality scorer: relevance + coherence + hallucination flags
├── circuit_breaker.py — CLOSED → OPEN → HALF_OPEN state machine
├── demo.py            — End-to-end demo (5 scenes, mock LLM, no API key)
└── tests/
    └── test_monitor.py — 20+ unit tests across all three modules

Module Breakdown

`monitor.py` — `LLMMonitor`

Wraps any LLM callable with full observability:

from monitor import LLMMonitor, LLMRequest

monitor = LLMMonitor(my_openai_fn, max_retries=3, backoff_seconds=1.0)
response = monitor.call(LLMRequest(prompt="Summarize this contract."))

print(response.latency_ms)           # 342.17
print(response.total_tokens)         # 847
print(response.estimated_cost_usd()) # 0.000127
print(monitor.summary())             # aggregate stats

Features:

Per-call structured JSON logs (structlog)
Configurable retry with exponential backoff
Separate retryable_exceptions from fatal ones
Running metrics: success rate, P99 latency, total spend

`evaluator.py` — `ResponseEvaluator`

Scores responses 0–1 across three dimensions before they leave your system:

from evaluator import ResponseEvaluator

evaluator = ResponseEvaluator(weights={"relevance": 0.5, "coherence": 0.3, "factuality": 0.2})
result = evaluator.evaluate(prompt="What is Python?", response=llm_output)

print(result.quality_score)         # 0.87
print(result.hallucination_flags)   # ['unsourced_authority_claim']
print(result.passed(threshold=0.7)) # True / False

Scoring:

Dimension	Method	What it catches
Relevance	Keyword overlap + length bonus	Off-topic, empty, generic responses
Coherence	Structural pattern analysis	Truncated, over-hedged, incoherent replies
Factuality	Regex hallucination patterns	Overconfident claims, unsourced stats, future dates

Hallucination patterns detected:

"Studies show" / "research proves" without source
Overconfident language: "definitely/certainly/absolutely true"
Large unverified statistics: "5.7 billion users"
Knowledge cutoff hedges when context doesn't warrant them

Note: These are fast heuristic signals. Layer in model-graded evals (DeepEval, G-Eval) for production depth.

`circuit_breaker.py` — `CircuitBreaker`

Classic three-state machine preventing cascade failures:

CLOSED ──(N failures)──► OPEN ──(T seconds)──► HALF_OPEN ──(success)──► CLOSED
                                                          └──(failure)──► OPEN

from circuit_breaker import CircuitBreaker, CircuitOpenError

cb = CircuitBreaker(name="openai-api", failure_threshold=5, recovery_timeout=60)

try:
    result = cb.call(my_llm_fn, prompt="Hello")
except CircuitOpenError as e:
    # Fast-fail: no call made, circuit is OPEN
    return fallback_response(retry_after=e.retry_after)

Stats available:

cb.stats.total_calls      # 1247
cb.stats.rejected_calls   # 83  (fast-failed while OPEN)
cb.stats.state_transitions # [{from: CLOSED, to: OPEN, ts: ...}]

`config.py` — Environment Configuration

All settings via env vars or .env file:

LLM_API_KEY=sk-...
LLM_MODEL=gpt-4o
LLM_MAX_RETRIES=3
LLM_RETRY_BACKOFF_SECONDS=1.0
LLM_CB_FAILURE_THRESHOLD=5
LLM_CB_RECOVERY_TIMEOUT=60
LLM_LOG_JSON=true
LLM_TIMEOUT_SECONDS=30

Composing the Stack

from config import settings
from monitor import LLMMonitor, LLMRequest
from evaluator import ResponseEvaluator
from circuit_breaker import CircuitBreaker, CircuitOpenError

cb = CircuitBreaker(
    name="llm-api",
    failure_threshold=settings.cb_failure_threshold,
    recovery_timeout=settings.cb_recovery_timeout,
)
monitor = LLMMonitor(my_llm_fn)
evaluator = ResponseEvaluator()

def reliable_llm_call(prompt: str) -> str:
    try:
        response = cb.call(monitor.call, LLMRequest(prompt=prompt))
    except CircuitOpenError as e:
        return f"[service unavailable, retry in {e.retry_after:.0f}s]"

    result = evaluator.evaluate(prompt, response.content)
    if not result.passed():
        # Log quality failure, trigger human review, return safe fallback
        logger.warning("quality_gate_failed", score=result.quality_score, flags=result.hallucination_flags)
        return fallback_response()

    return response.content

Running Tests

pytest tests/ -v

Expected: 20+ tests across all three modules, all passing.

Extending This

Need	Where to add
Real OpenAI calls	Replace mock in `demo.py` with `openai.OpenAI().chat.completions.create`
Async support	Wrap `monitor.call` in `asyncio.to_thread` or add `async def acall`
Model-graded evals	Add `deepeval` or `ragas` scorer in `evaluator.py`
Prometheus metrics	Export `monitor.summary()` to `/metrics` via `prometheus_client`
Persistent logging	Configure structlog to write to file or ship to Datadog/Grafana
Multi-provider routing	Swap `make_mock_llm` with a router that tries GPT-4 → Claude → Gemini

Design Decisions

Pydantic Settings over raw os.environ — validation, type coercion, .env support out of the box
structlog over logging — structured JSON logs that parse cleanly in Datadog, CloudWatch, Loki
Heuristic evaluation first — zero-cost checks catch 80% of failures; expensive model-graded evals run only on ambiguous cases
Thread-safe circuit breaker — threading.Lock throughout; safe for concurrent request handlers
Dataclass responses — not dicts; IDE autocomplete, type safety, serializable via asdict()

Relevant Job Targets

This template demonstrates skills directly relevant to:

LLM reliability engineering — monitoring, evaluation, fault tolerance
Agent infrastructure — composable observability layer for any agent framework
Platform engineering — production patterns, structured logging, circuit breaking

License

MIT — use freely in client projects, products, and portfolios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Reliability Starter

What This Is

Quick Start

Architecture

Module Breakdown

`monitor.py` — `LLMMonitor`

`evaluator.py` — `ResponseEvaluator`

`circuit_breaker.py` — `CircuitBreaker`

`config.py` — Environment Configuration

Composing the Stack

Running Tests

Extending This

Design Decisions

Relevant Job Targets

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
tests		tests
README.md		README.md
circuit_breaker.py		circuit_breaker.py
config.py		config.py
demo.py		demo.py
evaluator.py		evaluator.py
monitor.py		monitor.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

LLM Reliability Starter

What This Is

Quick Start

Architecture

Module Breakdown

monitor.py — LLMMonitor

evaluator.py — ResponseEvaluator

circuit_breaker.py — CircuitBreaker

config.py — Environment Configuration

Composing the Stack

Running Tests

Extending This

Design Decisions

Relevant Job Targets

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`monitor.py` — `LLMMonitor`

`evaluator.py` — `ResponseEvaluator`

`circuit_breaker.py` — `CircuitBreaker`

`config.py` — Environment Configuration

Packages