-
Notifications
You must be signed in to change notification settings - Fork 0
Implement LLM call categorization, coordination metrics suite, and orchestration tracking (DESIGN_SPEC §10.5 M4) #135
Copy link
Copy link
Closed
Labels
prio:mediumShould do, but not blockingShould do, but not blockingscope:medium1-3 days of work1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:budgetDESIGN_SPEC Section 10 - Cost & Budget ManagementDESIGN_SPEC Section 10 - Cost & Budget Managementspec:providersDESIGN_SPEC Section 9 - Model Provider LayerDESIGN_SPEC Section 9 - Model Provider Layerspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementationtype:testTest coverage, test infrastructureTest coverage, test infrastructure
Description
Context
M3 ships proxy metrics (turns/tokens/cost per task). M4 adds call categorization — classifying each LLM call by purpose to detect orchestration overhead — and a coordination metrics suite with 5 empirically-grounded metrics for data-driven tuning of multi-agent configurations. This builds on existing CostRecord infrastructure.
Acceptance Criteria
Call Categorization
- Each LLM call tagged with a category:
productive(direct task work),coordination(delegation, status checks),system(planning, self-reflection, error recovery) - Category stored as a field on
CostRecord(or companion model) - Categorization happens at the call site (engine/agent level), not in the provider layer
Orchestration Overhead Ratio
-
orchestration_ratio = (coordination + system) / totalcomputed per task and per agent - Ratio available in spending summaries and task completion metadata
- Tiered orchestration ratio alerts: info (>30%), warn (>50%), critical (>70%)
Coordination Metrics Suite (§10.5 — new)
- Coordination efficiency (
Ec):success_rate / (turns / turns_sas)— ROI of coordination - Coordination overhead (
O%):(turns_mas - turns_sas) / turns_sas × 100%— optimal band 200–300% - Error amplification (
Ae):error_rate_mas / error_rate_sas— error propagation factor - Message density (
c): inter-agent messages per reasoning turn - Redundancy rate (
R): mean cosine similarity of agent output embeddings - All 5 metrics opt-in via
coordination_metrics.enabledconfig -
EcandO%are cheap (turn counting);Aerequires SAS baseline;candRrequire semantic analysis - Configurable
baseline_windowfor establishing SAS comparison data
Analytics Queries
- Query: breakdown by category for a given task
- Query: breakdown by category for a given agent over time
- Query: company-wide orchestration ratio
- Query: coordination metrics (Ec, O%, Ae, c, R) per task and per agent
Testing
- Unit tests for categorization logic
- Unit tests for all 5 coordination metrics calculations
- Integration test: multi-agent task with delegation → verify category breakdown
- Integration test: verify coordination metrics collection with opt-in config
Dependencies
- Implement per-call cost tracking and usage logging #7 — Per-call cost tracking (done)
- Implement single-task execution lifecycle (assign, execute, complete) #21 — Task lifecycle with proxy metrics (M3)
- Multi-agent execution (M4 prerequisite for coordination calls to exist)
Design Spec Reference
- §10.5 — LLM Call Analytics (M4: Call Categorization + Coordination Metrics Suite)
- §16.3 — Agent Scaling Research (Kim et al., 2025 — empirical basis for metrics)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
prio:mediumShould do, but not blockingShould do, but not blockingscope:medium1-3 days of work1-3 days of workspec:agent-systemDESIGN_SPEC Section 3 - Agent SystemDESIGN_SPEC Section 3 - Agent Systemspec:budgetDESIGN_SPEC Section 10 - Cost & Budget ManagementDESIGN_SPEC Section 10 - Cost & Budget Managementspec:providersDESIGN_SPEC Section 9 - Model Provider LayerDESIGN_SPEC Section 9 - Model Provider Layerspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow EngineDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationNew feature implementationtype:testTest coverage, test infrastructureTest coverage, test infrastructure