Skip to content

Commit aac2029

Browse files
authored
feat: engine intelligence -- quality signals, health monitoring, trajectory scoring, coordination metrics (#1099)
## Summary Adds four engine intelligence features: step-level quality signals, two-layer health monitoring, best-of-K trajectory scoring, and distributed systems coordination metrics. ### #697: Step-Level Quality Signals - `StepQuality` ternary enum (correct/neutral/incorrect) based on AgentProcessBench - `StepQualityClassifier` protocol + `RuleBasedStepClassifier` (deterministic, no LLM cost) - `AccuracyEffortRatio` model with computed `accuracy` and `ratio` fields - `compute_accuracy_effort()` with weak-model-trap warning (early termination detection) - Wired `accuracy_effort_ratio` into `TaskCompletionMetrics` and `CostRecord` ### #707: Two-Layer Health Monitoring Pipeline - `EscalationTicket` model with severity, cause, evidence, quality signals - `HealthJudge` (sensitive layer): emits tickets on stagnation, error+recovery, quality degradation - `TriageFilter` (conservative layer): rule-based dismiss LOW, escalate HIGH/CRITICAL, threshold MEDIUM - `HealthMonitoringPipeline`: composes judge + triage + NotificationSink delivery - Added `HEALTH` notification category ### #705: TrajectoryScorer for HybridLoop - `TrajectoryConfig` (off by default, K=2-5, complexity-gated, budget margin) - `TrajectoryScorer` with self-consistency filter (majority-vote on fingerprints), verbalized confidence (log-space), trace length scoring - `check_trajectory_budget()` budget guard for K-candidate sampling - `CandidateResult` and `TrajectoryScore` models with computed `joint_score` - Wired `TrajectoryConfig` into `HybridLoopConfig` ### #703: Coordination Metrics from Distributed Systems Theory - `AmdahlCeiling`: S_max = 1/(1-p), recommended team size at 90% speedup - `StragglerGap`: slowest - mean duration with cross-field validation - `TokenSpeedupRatio`: token_multiplier / latency_speedup, alert at 2.0 - `MessageOverhead`: O(n^2) message growth detection - Extended `CoordinationMetrics` container from 5 to 9 metrics ## Scope Note This PR implements the **models, scoring logic, and metrics computation layer** for all four issues. The hybrid loop integration (wiring trajectory scoring into `hybrid_loop.py` with `asyncio.TaskGroup` for K-parallel candidates) and approval gate wiring (consuming quality signals at review boundaries) are follow-up work -- the infrastructure is complete and tested, the loop integration requires careful coordination with the execution path. ## Design Spec Updates Needed After merge, update: - `docs/design/engine.md` -- add Quality Signals, Health Monitoring, Trajectory Scoring sections - `docs/design/operations.md` -- update Coordination Metrics table (5 to 9 metrics) ## Test Plan - 163 new unit tests + 14 Hypothesis property tests (177 total new tests) - Full suite: 14766 passed, 0 failed - Pre-reviewed by 6 agents (code-reviewer, type-design-analyzer, silent-failure-hunter, docs-consistency, test-quality-reviewer, issue-resolution-verifier), 20 findings addressed ## Files - **17 new source files** across 3 new engine subpackages + 4 event modules - **14 new test files** with conftest - **9 modified files** (CLAUDE.md, coordination_metrics, cost_record, hybrid_models, metrics, notifications/models, test_coordination_metrics, test_events) Closes #697 Closes #707 Closes #705 Closes #703
1 parent c845d22 commit aac2029

44 files changed

Lines changed: 4135 additions & 9 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -91,14 +91,14 @@ curl http://localhost:3000/api/v1/health # backend (via web proxy)
9191
src/synthorg/
9292
api/ # Litestar REST + WebSocket API, RFC 9457 errors, setup wizard, personality presets, auth/ (role-based access control, JWT sessions, session store, user presence, OrgRole enum for org config permissions), guards (HumanRole-based + OrgRole-based with department scoping via require_org_mutation), user management (CRUD + org-role grant/revoke), dto_org (request DTOs for company/department/agent mutations), services/org_mutations (read-modify-write config mutation service), auto-wiring, lifecycle (auto-promote first owner), bootstrap (agent registry init from config), template packs (list + live-apply), memory admin (fine-tuning pipeline with orchestrator, checkpoint management, preflight checks, run history, embedder queries), optimistic concurrency (ETag/If-Match), TLS config, tiered rate limiting (unauth by IP, auth by user ID), workflows (visual workflow definition CRUD, validation, YAML export, blueprint listing, blueprint instantiation, version history, diff, rollback), workflow executions (activate, list, get, cancel), ceremony policy (project + per-department query/override, resolved policy with field origins), quality overrides (per-agent quality score override CRUD), reports (on-demand report generation, period listing), notification_dispatcher (fan-out notification sink)
9393
backup/ # Backup/restore orchestrator, scheduler, retention, handlers/
94-
budget/ # Cost tracking, budget enforcement, quota degradation (including synchronous peek for routing-time selector hints), CFO optimization, trend analysis, budget forecasting, configurable currency formatting, risk budget (cumulative risk-unit tracking, risk scoring integration, risk check, risk records), automated reporting (periodic comprehensive reports, spending/performance/task-completion/risk-trends templates, report scheduling config)
94+
budget/ # Cost tracking, budget enforcement, quota degradation (including synchronous peek for routing-time selector hints), CFO optimization, trend analysis, budget forecasting, configurable currency formatting, risk budget (cumulative risk-unit tracking, risk scoring integration, risk check, risk records), automated reporting (periodic comprehensive reports, spending/performance/task-completion/risk-trends templates, report scheduling config), coordination metrics (9 empirical metrics: efficiency, overhead, error amplification, message density, redundancy, Amdahl ceiling, straggler gap, token/speedup ratio, message overhead)
9595
cli/ # Python CLI module (superseded by top-level cli/ Go binary)
9696
communication/ # Message bus, dispatcher, channels, delegation, conflict resolution, meeting/
9797
config/ # YAML company config loading and validation
9898
core/ # Shared domain models, base classes, resilience config
99-
engine/ # Orchestration, execution loops, task engine (observer registration, background observer dispatch), coordination, checkpoint recovery, structured failure diagnosis (FailureCategory, infer_failure_category, RecoveryResult failure_context/criteria_failed/stagnation_evidence), approval/review gates (no-self-review enforcement via SelfReviewError, immutable DecisionRecord drop-box), stagnation detection, context budget, compaction, hybrid loop, prompt profiles (tier-based prompt adaptation, personality trimming via max_personality_tokens), procedural memory integration (failure-driven), post_execution/ (extracted memory hooks -- distillation capture, procedural memory pipeline), workspace/ (git worktree isolation, merge orchestration, semantic conflict detection), workflow/ (Kanban board, Agile sprints, WIP limits, sprint lifecycle, velocity tracking, ceremony scheduling, strategy migration, strategies/ (pluggable scheduling strategies), velocity_calculators/ (pluggable velocity calculators), definition (visual workflow graph model, node/edge types, validation, YAML export), blueprint_loader (starter blueprint loading), blueprint_models (blueprint data models), blueprints/ (5 YAML starter templates), diff (version diff computation), version (version snapshot model), execution (workflow activation service, execution models, condition evaluator (compound AND/OR/NOT), graph utilities, execution_observer (TaskEngine bridge for lifecycle transitions), execution_activation_helpers (graph walking, conditional processing, task config parsing)))
99+
engine/ # Orchestration, execution loops, task engine (observer registration, background observer dispatch), coordination, checkpoint recovery, structured failure diagnosis (FailureCategory, infer_failure_category, RecoveryResult failure_context/criteria_failed/stagnation_evidence), approval/review gates (no-self-review enforcement via SelfReviewError, immutable DecisionRecord drop-box), stagnation detection, context budget, compaction, hybrid loop, prompt profiles (tier-based prompt adaptation, personality trimming via max_personality_tokens), procedural memory integration (failure-driven), post_execution/ (extracted memory hooks -- distillation capture, procedural memory pipeline), workspace/ (git worktree isolation, merge orchestration, semantic conflict detection), quality/ (step-level quality signal classifier, accuracy-effort ratio, StepQualityClassifier protocol), health/ (two-layer health monitoring pipeline, HealthJudge + TriageFilter, EscalationTicket, NotificationSink wiring), trajectory/ (best-of-K trajectory scoring, TrajectoryScorer, budget guard, TrajectoryConfig), workflow/ (Kanban board, Agile sprints, WIP limits, sprint lifecycle, velocity tracking, ceremony scheduling, strategy migration, strategies/ (pluggable scheduling strategies), velocity_calculators/ (pluggable velocity calculators), definition (visual workflow graph model, node/edge types, validation, YAML export), blueprint_loader (starter blueprint loading), blueprint_models (blueprint data models), blueprints/ (5 YAML starter templates), diff (version diff computation), version (version snapshot model), execution (workflow activation service, execution models, condition evaluator (compound AND/OR/NOT), graph utilities, execution_observer (TaskEngine bridge for lifecycle transitions), execution_activation_helpers (graph walking, conditional processing, task config parsing)))
100100
hr/ # Hiring, firing, onboarding, agent registry, performance tracking, activity timeline, activity event types, cost event redaction, career history, promotion/demotion, evaluation/ (five-pillar evaluation framework, pluggable pillar scoring strategies, EvaluationConfig), quality scoring (layered composite: CI signal + LLM judge + human override, QualityOverrideStore)
101-
notifications/ # NotificationSink protocol, NotificationDispatcher fan-out, Notification model (category + severity taxonomy), adapters/ (console, ntfy, slack, email), config
101+
notifications/ # NotificationSink protocol, NotificationDispatcher fan-out, Notification model (category taxonomy: approval/budget/security/stagnation/system/agent/health + severity taxonomy), adapters/ (console, ntfy, slack, email), config
102102
memory/ # Pluggable MemoryBackend, retrieval pipeline (hybrid dense+BM25 sparse with RRF fusion, MMR diversity re-ranking via apply_diversity_penalty with pre-computed bigram cache), tool-based injection strategy with iterative Search-and-Ask reformulation loop (fail-safe reformulator/sufficiency_checker), ToolRegistry memory tool wrappers (SearchMemoryTool, RecallMemoryTool), fail-closed memory filter, agentic query reformulation, org memory, backends/ (composite namespace-based routing, inmemory session-scoped, mem0 Qdrant+SQLite, EmbeddingCostConfig embedding cost tracking), consolidation/ (SimpleConsolidationStrategy, DualModeConsolidationStrategy density-aware, LLMConsolidationStrategy with parallel TaskGroup per-category processing + trajectory-context injection from distillation entries, LLMConsolidationConfig, DistillationRequest capture helper tagged "distillation" EPISODIC, retention, archival), embedding/ (LMEB-ranked model selection, embedder config resolution, fine-tuning pipeline with orchestrator, cancellation, checkpoint management), procedural/ (failure-driven auto-generation, proposer LLM pipeline, SKILL.md materialization, ProceduralMemoryConfig)
103103
persistence/ # Pluggable PersistenceBackend, SQLite, settings + user + artifact + project + preset + workflow definition + workflow execution + workflow version + fine-tune + decision record (append-only audit drop-box) repositories, artifact content storage (pluggable ArtifactStorageBackend, filesystem impl)
104104
observability/ # Structured logging, correlation tracking, redaction, third-party logger taming, log shipping (syslog, HTTP), compressed archival, events/
@@ -146,7 +146,7 @@ See `web/CLAUDE.md` for the full component inventory, design token rules, and po
146146
- **Every module** with business logic MUST have: `from synthorg.observability import get_logger` then `logger = get_logger(__name__)`
147147
- **Never** use `import logging` / `logging.getLogger()` / `print()` in application code (exception: `observability/setup.py`, `observability/sinks.py`, `observability/syslog_handler.py`, and `observability/http_handler.py` may use stdlib `logging` and `print(..., file=sys.stderr)` for handler construction, bootstrap, and error reporting code that runs before or during logging system configuration)
148148
- **Variable name**: always `logger` (not `_logger`, not `log`)
149-
- **Event names**: always use constants from the domain-specific module under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`, `GIT_COMMAND_START` from `events.git`, `CONTEXT_BUDGET_FILL_UPDATED` from `events.context_budget`, `BACKUP_STARTED` from `events.backup`, `SETUP_COMPLETED` from `events.setup`, `ROUTING_CANDIDATE_SELECTED` from `events.routing`, `SHIPPING_HTTP_BATCH_SENT` from `events.shipping`, `EVAL_REPORT_COMPUTED` from `events.evaluation`, `PROMPT_PROFILE_SELECTED` from `events.prompt`, `PROCEDURAL_MEMORY_START` from `events.procedural_memory`, `PERF_LLM_JUDGE_STARTED` from `events.performance`, `TASK_ENGINE_OBSERVER_FAILED` from `events.task_engine`, `WORKFLOW_EXEC_COMPLETED` from `events.workflow_execution`, `BLUEPRINT_INSTANTIATE_START` from `events.blueprint`, `WORKFLOW_DEF_ROLLED_BACK` from `events.workflow_definition`, `WORKFLOW_VERSION_SAVED` from `events.workflow_version`, `MEMORY_FINE_TUNE_STARTED` from `events.memory`, `REPORTING_GENERATION_STARTED` from `events.reporting`, `RISK_BUDGET_SCORE_COMPUTED` from `events.risk_budget`, `LLM_STRATEGY_SYNTHESIZED` and `DISTILLATION_CAPTURED` from `events.consolidation`, `MEMORY_DIVERSITY_RERANKED`, `MEMORY_DIVERSITY_RERANK_FAILED`, and `MEMORY_REFORMULATION_ROUND` from `events.memory`, `NOTIFICATION_DISPATCHED` and `NOTIFICATION_DISPATCH_FAILED` from `events.notification`). Each domain has its own module -- see `src/synthorg/observability/events/` for the full inventory of constants. Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`
149+
- **Event names**: always use constants from the domain-specific module under `synthorg.observability.events` (e.g., `API_REQUEST_STARTED` from `events.api`, `TOOL_INVOKE_START` from `events.tool`, `GIT_COMMAND_START` from `events.git`, `CONTEXT_BUDGET_FILL_UPDATED` from `events.context_budget`, `BACKUP_STARTED` from `events.backup`, `SETUP_COMPLETED` from `events.setup`, `ROUTING_CANDIDATE_SELECTED` from `events.routing`, `SHIPPING_HTTP_BATCH_SENT` from `events.shipping`, `EVAL_REPORT_COMPUTED` from `events.evaluation`, `PROMPT_PROFILE_SELECTED` from `events.prompt`, `PROCEDURAL_MEMORY_START` from `events.procedural_memory`, `PERF_LLM_JUDGE_STARTED` from `events.performance`, `TASK_ENGINE_OBSERVER_FAILED` from `events.task_engine`, `WORKFLOW_EXEC_COMPLETED` from `events.workflow_execution`, `BLUEPRINT_INSTANTIATE_START` from `events.blueprint`, `WORKFLOW_DEF_ROLLED_BACK` from `events.workflow_definition`, `WORKFLOW_VERSION_SAVED` from `events.workflow_version`, `MEMORY_FINE_TUNE_STARTED` from `events.memory`, `REPORTING_GENERATION_STARTED` from `events.reporting`, `RISK_BUDGET_SCORE_COMPUTED` from `events.risk_budget`, `LLM_STRATEGY_SYNTHESIZED` and `DISTILLATION_CAPTURED` from `events.consolidation`, `MEMORY_DIVERSITY_RERANKED`, `MEMORY_DIVERSITY_RERANK_FAILED`, and `MEMORY_REFORMULATION_ROUND` from `events.memory`, `NOTIFICATION_DISPATCHED` and `NOTIFICATION_DISPATCH_FAILED` from `events.notification`, `QUALITY_STEP_CLASSIFIED` from `events.quality`, `HEALTH_TICKET_EMITTED` from `events.health`, `TRAJECTORY_SCORING_START` from `events.trajectory`, `COORD_METRICS_AMDAHL_COMPUTED` from `events.coordination_metrics`). Each domain has its own module -- see `src/synthorg/observability/events/` for the full inventory of constants. Import directly: `from synthorg.observability.events.<domain> import EVENT_CONSTANT`
150150
- **Structured kwargs**: always `logger.info(EVENT, key=value)` -- never `logger.info("msg %s", val)`
151151
- **All error paths** must log at WARNING or ERROR with context before raising
152152
- **All state transitions** must log at INFO

src/synthorg/budget/__init__.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,16 @@
2222
OrchestrationAlertThresholds,
2323
)
2424
from synthorg.budget.coordination_metrics import (
25+
AmdahlCeiling,
2526
CoordinationEfficiency,
2627
CoordinationMetrics,
2728
CoordinationOverhead,
2829
ErrorAmplification,
2930
MessageDensity,
31+
MessageOverhead,
3032
RedundancyRate,
33+
StragglerGap,
34+
TokenSpeedupRatio,
3135
)
3236
from synthorg.budget.cost_record import CostRecord
3337
from synthorg.budget.cost_tiers import (
@@ -131,6 +135,7 @@
131135
"AgentEfficiency",
132136
"AgentPerformanceSummary",
133137
"AgentSpending",
138+
"AmdahlCeiling",
134139
"AnomalyDetectionResult",
135140
"AnomalySeverity",
136141
"AnomalyType",
@@ -174,6 +179,7 @@
174179
"ErrorTaxonomyConfig",
175180
"LLMCallCategory",
176181
"MessageDensity",
182+
"MessageOverhead",
177183
"ModelDistribution",
178184
"OrchestrationAlertLevel",
179185
"OrchestrationAlertThresholds",
@@ -209,10 +215,12 @@
209215
"SpendingAnomaly",
210216
"SpendingReport",
211217
"SpendingSummary",
218+
"StragglerGap",
212219
"SubscriptionConfig",
213220
"TaskCompletionReport",
214221
"TaskSpending",
215222
"TeamBudget",
223+
"TokenSpeedupRatio",
216224
"billing_period_start",
217225
"classify_model_tier",
218226
"compute_rebalance",

0 commit comments

Comments
 (0)