Skip to content

darshjme/llm-reliability-starter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Reliability Starter

Production-grade boilerplate for building reliable LLM agents.
Zero-dependency monitoring, quality evaluation, and circuit breaking — drop it into any Python project.

Inspired by patterns from darshjme/arsenal.


What This Is

A battle-tested template for LLM infrastructure engineers who need:

  • Observability — every call logged with latency, tokens, cost, retries
  • Quality gates — response scoring before replies reach users
  • Fault isolation — circuit breaker prevents cascade failures under provider outages
  • Retry resilience — exponential backoff on transient errors, configurable per deployment

No vendor lock-in. Works with OpenAI, Anthropic, local Ollama, or any callable.


Quick Start

pip install -r requirements.txt
python3 demo.py          # full demo, no API key needed
pytest tests/ -v         # 20+ unit tests

Architecture

llm-reliability-starter/
├── config.py          — Pydantic Settings (env-based, .env support)
├── monitor.py         — LLM call wrapper: latency, tokens, cost, retry, structured logs
├── evaluator.py       — Response quality scorer: relevance + coherence + hallucination flags
├── circuit_breaker.py — CLOSED → OPEN → HALF_OPEN state machine
├── demo.py            — End-to-end demo (5 scenes, mock LLM, no API key)
└── tests/
    └── test_monitor.py — 20+ unit tests across all three modules

Module Breakdown

monitor.pyLLMMonitor

Wraps any LLM callable with full observability:

from monitor import LLMMonitor, LLMRequest

monitor = LLMMonitor(my_openai_fn, max_retries=3, backoff_seconds=1.0)
response = monitor.call(LLMRequest(prompt="Summarize this contract."))

print(response.latency_ms)           # 342.17
print(response.total_tokens)         # 847
print(response.estimated_cost_usd()) # 0.000127
print(monitor.summary())             # aggregate stats

Features:

  • Per-call structured JSON logs (structlog)
  • Configurable retry with exponential backoff
  • Separate retryable_exceptions from fatal ones
  • Running metrics: success rate, P99 latency, total spend

evaluator.pyResponseEvaluator

Scores responses 0–1 across three dimensions before they leave your system:

from evaluator import ResponseEvaluator

evaluator = ResponseEvaluator(weights={"relevance": 0.5, "coherence": 0.3, "factuality": 0.2})
result = evaluator.evaluate(prompt="What is Python?", response=llm_output)

print(result.quality_score)         # 0.87
print(result.hallucination_flags)   # ['unsourced_authority_claim']
print(result.passed(threshold=0.7)) # True / False

Scoring:

Dimension Method What it catches
Relevance Keyword overlap + length bonus Off-topic, empty, generic responses
Coherence Structural pattern analysis Truncated, over-hedged, incoherent replies
Factuality Regex hallucination patterns Overconfident claims, unsourced stats, future dates

Hallucination patterns detected:

  • "Studies show" / "research proves" without source
  • Overconfident language: "definitely/certainly/absolutely true"
  • Large unverified statistics: "5.7 billion users"
  • Knowledge cutoff hedges when context doesn't warrant them

Note: These are fast heuristic signals. Layer in model-graded evals (DeepEval, G-Eval) for production depth.


circuit_breaker.pyCircuitBreaker

Classic three-state machine preventing cascade failures:

CLOSED ──(N failures)──► OPEN ──(T seconds)──► HALF_OPEN ──(success)──► CLOSED
                                                          └──(failure)──► OPEN
from circuit_breaker import CircuitBreaker, CircuitOpenError

cb = CircuitBreaker(name="openai-api", failure_threshold=5, recovery_timeout=60)

try:
    result = cb.call(my_llm_fn, prompt="Hello")
except CircuitOpenError as e:
    # Fast-fail: no call made, circuit is OPEN
    return fallback_response(retry_after=e.retry_after)

Stats available:

cb.stats.total_calls      # 1247
cb.stats.rejected_calls   # 83  (fast-failed while OPEN)
cb.stats.state_transitions # [{from: CLOSED, to: OPEN, ts: ...}]

config.py — Environment Configuration

All settings via env vars or .env file:

LLM_API_KEY=sk-...
LLM_MODEL=gpt-4o
LLM_MAX_RETRIES=3
LLM_RETRY_BACKOFF_SECONDS=1.0
LLM_CB_FAILURE_THRESHOLD=5
LLM_CB_RECOVERY_TIMEOUT=60
LLM_LOG_JSON=true
LLM_TIMEOUT_SECONDS=30

Composing the Stack

from config import settings
from monitor import LLMMonitor, LLMRequest
from evaluator import ResponseEvaluator
from circuit_breaker import CircuitBreaker, CircuitOpenError

cb = CircuitBreaker(
    name="llm-api",
    failure_threshold=settings.cb_failure_threshold,
    recovery_timeout=settings.cb_recovery_timeout,
)
monitor = LLMMonitor(my_llm_fn)
evaluator = ResponseEvaluator()

def reliable_llm_call(prompt: str) -> str:
    try:
        response = cb.call(monitor.call, LLMRequest(prompt=prompt))
    except CircuitOpenError as e:
        return f"[service unavailable, retry in {e.retry_after:.0f}s]"

    result = evaluator.evaluate(prompt, response.content)
    if not result.passed():
        # Log quality failure, trigger human review, return safe fallback
        logger.warning("quality_gate_failed", score=result.quality_score, flags=result.hallucination_flags)
        return fallback_response()

    return response.content

Running Tests

pytest tests/ -v

Expected: 20+ tests across all three modules, all passing.


Extending This

Need Where to add
Real OpenAI calls Replace mock in demo.py with openai.OpenAI().chat.completions.create
Async support Wrap monitor.call in asyncio.to_thread or add async def acall
Model-graded evals Add deepeval or ragas scorer in evaluator.py
Prometheus metrics Export monitor.summary() to /metrics via prometheus_client
Persistent logging Configure structlog to write to file or ship to Datadog/Grafana
Multi-provider routing Swap make_mock_llm with a router that tries GPT-4 → Claude → Gemini

Design Decisions

  • Pydantic Settings over raw os.environ — validation, type coercion, .env support out of the box
  • structlog over logging — structured JSON logs that parse cleanly in Datadog, CloudWatch, Loki
  • Heuristic evaluation first — zero-cost checks catch 80% of failures; expensive model-graded evals run only on ambiguous cases
  • Thread-safe circuit breakerthreading.Lock throughout; safe for concurrent request handlers
  • Dataclass responses — not dicts; IDE autocomplete, type safety, serializable via asdict()

Relevant Job Targets

This template demonstrates skills directly relevant to:

  • LLM reliability engineering — monitoring, evaluation, fault tolerance
  • Agent infrastructure — composable observability layer for any agent framework
  • Platform engineering — production patterns, structured logging, circuit breaking

License

MIT — use freely in client projects, products, and portfolios.

About

Production-grade LLM agent reliability: monitor, evaluator, circuit breaker. 27 tests, zero API key needed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages