fix: address PR review items from CodeRabbit, Gemini, and Copilot

Aureliolo · Aureliolo · commit ffb8b571c600 · 2026-04-07T20:19:09.000+02:00
- engine.md: fix CompactionConfig field name threshold_percent to fill_threshold_percent
- engine.md: expand TF-IDF proxy implementation details in Phase 3 roadmap
- agent-controlled-compaction.md: correct invoke_compaction() argument order and add missing turn_number
- control-plane-audit.md: fix approval endpoints (decide split into approve + reject)
- control-plane-audit.md: fix analytics endpoint (overview/trends/forecast)
- control-plane-audit.md: fix budget endpoints (config, records, agents/{id})
- acg-formalism-evaluation.md: clarify CoordinationResult frozen=True integration approach
- acg-formalism-evaluation.md: specify API surface for quality-cost tradeoff exposure
- acg-formalism-evaluation.md: add language tag to PruningEvaluation code block
- acg-formalism-evaluation.md: specify all required ApprovalItem fields for pruning gate
- acg-formalism-evaluation.md: add backward compat note for optional TurnRecord.node_types
- multi-agent-failure-audit.md: add language tag to fallback-chain code block
- communication.md: document concrete circuit breaker exponential backoff options
diff --git a/docs/design/communication.md b/docs/design/communication.md
@@ -492,8 +492,20 @@ Five mechanisms protect against swarm drift (`communication/loop_prevention/guar
 
 **Known risk -- circuit breaker bounce count reset**: After cooldown, the state entry is
 evicted entirely, resetting the bounce count to 0. Slow-burn delegation patterns (>60s
-between delegations) can bypass all five guards after each cooldown expiry. Mitigation:
-use exponential backoff on cooldown resets or a non-resetting global bounce counter.
+between delegations) can bypass all five guards after each cooldown expiry.
+
+Recommended mitigation -- two options:
+
+1. **Exponential backoff on cooldown**: instead of evicting the entry, retain it and
+   apply `cooldown_seconds = base_cooldown * 2^bounce_count`. Each bounce extends the
+   cooldown duration exponentially, making slow-burn bypass progressively harder.
+2. **Non-resetting global bounce counter**: store a per-pair lifetime bounce count
+   separate from the per-window circuit breaker. Once the lifetime count exceeds a
+   threshold (e.g., 10), escalate to a permanent circuit-open state requiring manual
+   reset.
+
+Option 1 is simpler to implement within `circuit_breaker.py` without breaking the
+existing eviction model. Option 2 is more robust against very long-horizon patterns.
 
 **Known risk -- in-memory state**: All guard state (circuit breaker, dedup window, rate
 limiter) is in-memory. Service restart resets all guardrails. Consider persisting circuit
diff --git a/docs/design/engine.md b/docs/design/engine.md
@@ -1500,7 +1500,7 @@ Key claims from the ACG survey validated against SynthOrg's architecture:
 *Research findings from #687. See also: `docs/research/agent-controlled-compaction.md`.*
 
 Context compaction is invoked at turn boundaries when context fill exceeds the configured
-threshold (`CompactionConfig.threshold_percent`, default 80%). The `invoke_compaction()`
+threshold (`CompactionConfig.fill_threshold_percent`, default 80%). The `invoke_compaction()`
 helper in `engine/loop_helpers.py` is shared across all three execution loops.
 
 ### Current Implementation
@@ -1550,6 +1550,11 @@ reasoning artifacts.
 **Phase 3**: Evaluate surprisal-based token cost (arXiv:2603.08462) -- per-token cost
 weighted by surprisal under a frozen base model. Empirical results: 41% token reduction,
 <1.5% accuracy drop. **Not recommended for Phase 1/2**: inference cost (forward pass
-per token) is not justified until Phase 2 data validates the need. TF-IDF importance
-weighting is the recommended lighter proxy if semantic token cost is needed before
-Phase 3.
+per token) is not justified until Phase 2 data validates the need.
+
+If semantic token cost is needed before Phase 3, the recommended lighter proxy is
+**TF-IDF importance weighting**: build a TF-IDF corpus from the current context turns,
+score each token, and treat low-scoring tokens (below a tunable percentile threshold)
+as compressible filler. The resulting importance map can drive selective truncation in
+`_build_summary()` without any additional model inference -- a significantly cheaper
+approximation of the surprisal signal.
diff --git a/docs/research/acg-formalism-evaluation.md b/docs/research/acg-formalism-evaluation.md
@@ -131,7 +131,10 @@ quality-cost tradeoff. The `DegradationConfig` and quota degradation strategies
 (Amdahl ceiling, straggler gap) provide efficiency bounds.
 
 **Implication**: The existing budget architecture is sound. The missing piece is exposing
-the quality-cost tradeoffs in the API (see #688 coordination metrics gap).
+the quality-cost tradeoffs via the REST API: specifically, `GET /tasks/{id}` response
+and the `CoordinationResult` Python type should surface cost, quality, and efficiency
+metadata (estimated cost, actual cost, quality score, Amdahl ceiling, straggler gap).
+See #688 coordination metrics gap (Gap G4) for the full scoping.
 
 ---
 
@@ -159,7 +162,14 @@ SynthOrg currently attributes all failure information to the executing agent's
 
 ### Proposed Design
 
-**AgentContribution model** -- extend `CoordinationResult`:
+**AgentContribution model** -- integrate with `CoordinationResult`:
+
+Note: `CoordinationResult` has `model_config = ConfigDict(frozen=True)`. Adding
+`agent_contributions` directly is a breaking change. The recommended approach is a
+separate wrapper: `CoordinationResultWithAttribution(result: CoordinationResult,
+agent_contributions: tuple[AgentContribution, ...])`, stored and returned in place of
+the bare result by `_post_execution_pipeline`. This preserves immutability and avoids
+migrating existing persisted `CoordinationResult` records.
 
 ```python
 class AgentContribution(BaseModel):
@@ -231,7 +241,7 @@ Four signal categories that should drive pruning recommendations:
 
 ### Proposed Protocol
 
-```
+```python
 PruningEvaluation (new model)
   agent_id: str
   pruning_score: float   # 0.0 = retain, 1.0 = prune
@@ -253,9 +263,19 @@ PruningService (new service)
 ```
 
 **Human approval gate**: Any `PruningEvaluation` with `recommendation="PRUNE"` creates an
-`ApprovalItem` with `action_type="org:prune"` and `ApprovalRiskLevel.MEDIUM`. This follows
-the same approval pattern used by the hiring and promotion pipelines. Pruning is never
-fully automated -- it is recommendation + human approval.
+`ApprovalItem` following the same approval pattern used by the hiring and promotion
+pipelines. Required fields:
+
+- `id`: unique UUID per `PruningEvaluation`
+- `title`: short summary, e.g. `"Prune agent {agent_id} ({reason})"`
+- `description`: rationale from `PruningEvaluation.signals` (quality decline slope,
+  utilization, Jaccard overlap), affected team, and safety constraint check results
+- `requested_by`: the `PruningService` identifier or calling system
+- `action_type`: `"org:prune"`
+- `risk_level`: `ApprovalRiskLevel.MEDIUM`
+- `created_at`: ISO 8601 timestamp
+
+Pruning is never fully automated -- it is recommendation + human approval.
 
 ### Safety Constraints
 
@@ -322,6 +342,13 @@ node types executed in that turn would improve execution trace analysis without
 significant refactoring. This is optional but would directly enable structural credit
 assignment (knowing which node type failed).
 
+**Backward compatibility**: `TurnRecord` is part of execution traces and may be
+persisted. The `node_types` field must be added as **optional with a default** (e.g.,
+`node_types: tuple[NodeType, ...] = ()`) so existing records remain valid without
+migration. Serialization/deserialization must tolerate the absent field. Consumers
+(trace analyzers, evaluation pipelines) should treat an empty tuple as "unknown
+composition" rather than erroring.
+
 ---
 
 ## Summary of Recommendations
diff --git a/docs/research/agent-controlled-compaction.md b/docs/research/agent-controlled-compaction.md
@@ -311,7 +311,9 @@ compaction directive in the results and applies compaction via `invoke_compactio
 ```python
 # In loop_helpers.py execute_tool_calls() or in the loop's per-turn handler:
 if any(r.metadata.get("compaction_directive") for r in tool_results):
-    ctx = await invoke_compaction(compaction_callback, ctx)
+    compacted = await invoke_compaction(ctx, compaction_callback, turn_number)
+    if compacted is not None:
+        ctx = compacted
 ```
 
 This preserves the immutable context pattern: the tool signals intent, the loop applies the
diff --git a/docs/research/control-plane-audit.md b/docs/research/control-plane-audit.md
@@ -99,7 +99,8 @@ agents at runtime, without per-agent configuration.
 | Get autonomy config | `GET /agents/{id}/autonomy` | Per-agent override |
 | Set autonomy config | `PUT /agents/{id}/autonomy` | Write access |
 | List pending approvals | `GET /approvals` | Approval queue for escalated actions |
-| Decide approval | `POST /approvals/{id}/decide` | CEO/manager/board |
+| Approve pending approval | `POST /approvals/{approval_id}/approve` | CEO/manager/board approval action |
+| Reject pending approval | `POST /approvals/{approval_id}/reject` | CEO/manager/board rejection action |
 
 **"Write Once, Enforce Everywhere" Validation**:
 
@@ -159,19 +160,20 @@ queryable history and enforcement at multiple boundaries.
 
 | Operation | Endpoint | Notes |
 |---|---|---|
-| Current budget status | `GET /budget` | Total spent, remaining, utilization % |
-| Spending history | `GET /budget/history` | Historical records |
-| Set/update budget | `POST /budget` | Write access |
+| Budget configuration | `GET /budget/config` | Budget settings, thresholds, and enforcement config |
+| Spending records | `GET /budget/records` | Paginated cost records with optional `agent_id`/`task_id` filters + daily/period summaries |
+| Agent budget records | `GET /budget/agents/{agent_id}` | Per-agent total spending |
 | Generate report | `POST /reports/generate` | Spending, performance, risk trends |
 
-**Coverage**: Basic budget queries are covered. `GET /budget` returns utilization percentage
-and alert status. `GET /budget/history` returns spending records.
+**Coverage**: Basic budget queries are covered. `GET /budget/config` exposes budget
+configuration. `GET /budget/records` returns paginated spending records with daily and period
+summaries. `GET /budget/agents/{agent_id}` provides per-agent cost totals.
 
 **Gap -- G6**: `CostTracker` is in-memory with TTL eviction; it is not a durable time-series
 store. Budget history granularity is limited -- the tracker supports `get_agent_cost(agent_id,
-start=)` and `get_total_cost(start=)` but the API endpoint does not expose multi-dimensional
+start=)` and `get_total_cost(start=)` but the API does not expose multi-dimensional
 queries (e.g., spending by provider X for agent Y during period Z). External cost dashboards
-need this level of attribution. The persistence layer backing `GET /budget/history` needs
+need this level of attribution. The persistence layer backing `GET /budget/records` needs
 inspection to confirm whether it provides richer query semantics than the in-memory tracker.
 
 **Gap -- G4**: The 9 coordination metrics (`budget/coordination_metrics.py`) are computed
@@ -205,7 +207,9 @@ formats.
 
 | Operation | Endpoint | Notes |
 |---|---|---|
-| Analytics dashboard | `GET /analytics` | Summary metrics |
+| Analytics overview | `GET /analytics/overview` | Summary metrics (task counts, cost totals, budget status) |
+| Analytics trends | `GET /analytics/trends` | Time-series cost and task-completion trends |
+| Analytics forecast | `GET /analytics/forecast` | Forward-looking spend projections |
 | Generate report | `POST /reports/generate` | Spending, performance, task completion |
 | List log sinks | `GET /settings/observability/sinks` | Current sink configuration |
 | Test sink connectivity | `POST /settings/observability/sinks/_test` | CEO/manager |
diff --git a/docs/research/multi-agent-failure-audit.md b/docs/research/multi-agent-failure-audit.md
@@ -171,7 +171,7 @@ Single LLM review call. If the winner matches a participant, auto-resolves. On a
 
 ### Complete Fallback Chain
 
-```
+```text
 HybridResolver
   └─ clear winner found  ─────────────────────────────→ RESOLVED_BY_HYBRID
   └─ ambiguous + escalate_on_ambiguity=True  ──────────→ ESCALATED_TO_HUMAN (stub)