Skip to content

Commit ffb8b57

Browse files
committed
fix: address PR review items from CodeRabbit, Gemini, and Copilot
- engine.md: fix CompactionConfig field name threshold_percent to fill_threshold_percent - engine.md: expand TF-IDF proxy implementation details in Phase 3 roadmap - agent-controlled-compaction.md: correct invoke_compaction() argument order and add missing turn_number - control-plane-audit.md: fix approval endpoints (decide split into approve + reject) - control-plane-audit.md: fix analytics endpoint (overview/trends/forecast) - control-plane-audit.md: fix budget endpoints (config, records, agents/{id}) - acg-formalism-evaluation.md: clarify CoordinationResult frozen=True integration approach - acg-formalism-evaluation.md: specify API surface for quality-cost tradeoff exposure - acg-formalism-evaluation.md: add language tag to PruningEvaluation code block - acg-formalism-evaluation.md: specify all required ApprovalItem fields for pruning gate - acg-formalism-evaluation.md: add backward compat note for optional TurnRecord.node_types - multi-agent-failure-audit.md: add language tag to fallback-chain code block - communication.md: document concrete circuit breaker exponential backoff options
1 parent 72298ee commit ffb8b57

6 files changed

Lines changed: 73 additions & 23 deletions

File tree

docs/design/communication.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -492,8 +492,20 @@ Five mechanisms protect against swarm drift (`communication/loop_prevention/guar
492492

493493
**Known risk -- circuit breaker bounce count reset**: After cooldown, the state entry is
494494
evicted entirely, resetting the bounce count to 0. Slow-burn delegation patterns (>60s
495-
between delegations) can bypass all five guards after each cooldown expiry. Mitigation:
496-
use exponential backoff on cooldown resets or a non-resetting global bounce counter.
495+
between delegations) can bypass all five guards after each cooldown expiry.
496+
497+
Recommended mitigation -- two options:
498+
499+
1. **Exponential backoff on cooldown**: instead of evicting the entry, retain it and
500+
apply `cooldown_seconds = base_cooldown * 2^bounce_count`. Each bounce extends the
501+
cooldown duration exponentially, making slow-burn bypass progressively harder.
502+
2. **Non-resetting global bounce counter**: store a per-pair lifetime bounce count
503+
separate from the per-window circuit breaker. Once the lifetime count exceeds a
504+
threshold (e.g., 10), escalate to a permanent circuit-open state requiring manual
505+
reset.
506+
507+
Option 1 is simpler to implement within `circuit_breaker.py` without breaking the
508+
existing eviction model. Option 2 is more robust against very long-horizon patterns.
497509

498510
**Known risk -- in-memory state**: All guard state (circuit breaker, dedup window, rate
499511
limiter) is in-memory. Service restart resets all guardrails. Consider persisting circuit

docs/design/engine.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1500,7 +1500,7 @@ Key claims from the ACG survey validated against SynthOrg's architecture:
15001500
*Research findings from #687. See also: `docs/research/agent-controlled-compaction.md`.*
15011501
15021502
Context compaction is invoked at turn boundaries when context fill exceeds the configured
1503-
threshold (`CompactionConfig.threshold_percent`, default 80%). The `invoke_compaction()`
1503+
threshold (`CompactionConfig.fill_threshold_percent`, default 80%). The `invoke_compaction()`
15041504
helper in `engine/loop_helpers.py` is shared across all three execution loops.
15051505
15061506
### Current Implementation
@@ -1550,6 +1550,11 @@ reasoning artifacts.
15501550
**Phase 3**: Evaluate surprisal-based token cost (arXiv:2603.08462) -- per-token cost
15511551
weighted by surprisal under a frozen base model. Empirical results: 41% token reduction,
15521552
<1.5% accuracy drop. **Not recommended for Phase 1/2**: inference cost (forward pass
1553-
per token) is not justified until Phase 2 data validates the need. TF-IDF importance
1554-
weighting is the recommended lighter proxy if semantic token cost is needed before
1555-
Phase 3.
1553+
per token) is not justified until Phase 2 data validates the need.
1554+
1555+
If semantic token cost is needed before Phase 3, the recommended lighter proxy is
1556+
**TF-IDF importance weighting**: build a TF-IDF corpus from the current context turns,
1557+
score each token, and treat low-scoring tokens (below a tunable percentile threshold)
1558+
as compressible filler. The resulting importance map can drive selective truncation in
1559+
`_build_summary()` without any additional model inference -- a significantly cheaper
1560+
approximation of the surprisal signal.

docs/research/acg-formalism-evaluation.md

Lines changed: 33 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,10 @@ quality-cost tradeoff. The `DegradationConfig` and quota degradation strategies
131131
(Amdahl ceiling, straggler gap) provide efficiency bounds.
132132

133133
**Implication**: The existing budget architecture is sound. The missing piece is exposing
134-
the quality-cost tradeoffs in the API (see #688 coordination metrics gap).
134+
the quality-cost tradeoffs via the REST API: specifically, `GET /tasks/{id}` response
135+
and the `CoordinationResult` Python type should surface cost, quality, and efficiency
136+
metadata (estimated cost, actual cost, quality score, Amdahl ceiling, straggler gap).
137+
See #688 coordination metrics gap (Gap G4) for the full scoping.
135138

136139
---
137140

@@ -159,7 +162,14 @@ SynthOrg currently attributes all failure information to the executing agent's
159162

160163
### Proposed Design
161164

162-
**AgentContribution model** -- extend `CoordinationResult`:
165+
**AgentContribution model** -- integrate with `CoordinationResult`:
166+
167+
Note: `CoordinationResult` has `model_config = ConfigDict(frozen=True)`. Adding
168+
`agent_contributions` directly is a breaking change. The recommended approach is a
169+
separate wrapper: `CoordinationResultWithAttribution(result: CoordinationResult,
170+
agent_contributions: tuple[AgentContribution, ...])`, stored and returned in place of
171+
the bare result by `_post_execution_pipeline`. This preserves immutability and avoids
172+
migrating existing persisted `CoordinationResult` records.
163173

164174
```python
165175
class AgentContribution(BaseModel):
@@ -231,7 +241,7 @@ Four signal categories that should drive pruning recommendations:
231241

232242
### Proposed Protocol
233243

234-
```
244+
```python
235245
PruningEvaluation (new model)
236246
agent_id: str
237247
pruning_score: float # 0.0 = retain, 1.0 = prune
@@ -253,9 +263,19 @@ PruningService (new service)
253263
```
254264

255265
**Human approval gate**: Any `PruningEvaluation` with `recommendation="PRUNE"` creates an
256-
`ApprovalItem` with `action_type="org:prune"` and `ApprovalRiskLevel.MEDIUM`. This follows
257-
the same approval pattern used by the hiring and promotion pipelines. Pruning is never
258-
fully automated -- it is recommendation + human approval.
266+
`ApprovalItem` following the same approval pattern used by the hiring and promotion
267+
pipelines. Required fields:
268+
269+
- `id`: unique UUID per `PruningEvaluation`
270+
- `title`: short summary, e.g. `"Prune agent {agent_id} ({reason})"`
271+
- `description`: rationale from `PruningEvaluation.signals` (quality decline slope,
272+
utilization, Jaccard overlap), affected team, and safety constraint check results
273+
- `requested_by`: the `PruningService` identifier or calling system
274+
- `action_type`: `"org:prune"`
275+
- `risk_level`: `ApprovalRiskLevel.MEDIUM`
276+
- `created_at`: ISO 8601 timestamp
277+
278+
Pruning is never fully automated -- it is recommendation + human approval.
259279

260280
### Safety Constraints
261281

@@ -322,6 +342,13 @@ node types executed in that turn would improve execution trace analysis without
322342
significant refactoring. This is optional but would directly enable structural credit
323343
assignment (knowing which node type failed).
324344

345+
**Backward compatibility**: `TurnRecord` is part of execution traces and may be
346+
persisted. The `node_types` field must be added as **optional with a default** (e.g.,
347+
`node_types: tuple[NodeType, ...] = ()`) so existing records remain valid without
348+
migration. Serialization/deserialization must tolerate the absent field. Consumers
349+
(trace analyzers, evaluation pipelines) should treat an empty tuple as "unknown
350+
composition" rather than erroring.
351+
325352
---
326353

327354
## Summary of Recommendations

docs/research/agent-controlled-compaction.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,9 @@ compaction directive in the results and applies compaction via `invoke_compactio
311311
```python
312312
# In loop_helpers.py execute_tool_calls() or in the loop's per-turn handler:
313313
if any(r.metadata.get("compaction_directive") for r in tool_results):
314-
ctx = await invoke_compaction(compaction_callback, ctx)
314+
compacted = await invoke_compaction(ctx, compaction_callback, turn_number)
315+
if compacted is not None:
316+
ctx = compacted
315317
```
316318

317319
This preserves the immutable context pattern: the tool signals intent, the loop applies the

docs/research/control-plane-audit.md

Lines changed: 13 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,8 @@ agents at runtime, without per-agent configuration.
9999
| Get autonomy config | `GET /agents/{id}/autonomy` | Per-agent override |
100100
| Set autonomy config | `PUT /agents/{id}/autonomy` | Write access |
101101
| List pending approvals | `GET /approvals` | Approval queue for escalated actions |
102-
| Decide approval | `POST /approvals/{id}/decide` | CEO/manager/board |
102+
| Approve pending approval | `POST /approvals/{approval_id}/approve` | CEO/manager/board approval action |
103+
| Reject pending approval | `POST /approvals/{approval_id}/reject` | CEO/manager/board rejection action |
103104

104105
**"Write Once, Enforce Everywhere" Validation**:
105106

@@ -159,19 +160,20 @@ queryable history and enforcement at multiple boundaries.
159160

160161
| Operation | Endpoint | Notes |
161162
|---|---|---|
162-
| Current budget status | `GET /budget` | Total spent, remaining, utilization % |
163-
| Spending history | `GET /budget/history` | Historical records |
164-
| Set/update budget | `POST /budget` | Write access |
163+
| Budget configuration | `GET /budget/config` | Budget settings, thresholds, and enforcement config |
164+
| Spending records | `GET /budget/records` | Paginated cost records with optional `agent_id`/`task_id` filters + daily/period summaries |
165+
| Agent budget records | `GET /budget/agents/{agent_id}` | Per-agent total spending |
165166
| Generate report | `POST /reports/generate` | Spending, performance, risk trends |
166167

167-
**Coverage**: Basic budget queries are covered. `GET /budget` returns utilization percentage
168-
and alert status. `GET /budget/history` returns spending records.
168+
**Coverage**: Basic budget queries are covered. `GET /budget/config` exposes budget
169+
configuration. `GET /budget/records` returns paginated spending records with daily and period
170+
summaries. `GET /budget/agents/{agent_id}` provides per-agent cost totals.
169171

170172
**Gap -- G6**: `CostTracker` is in-memory with TTL eviction; it is not a durable time-series
171173
store. Budget history granularity is limited -- the tracker supports `get_agent_cost(agent_id,
172-
start=)` and `get_total_cost(start=)` but the API endpoint does not expose multi-dimensional
174+
start=)` and `get_total_cost(start=)` but the API does not expose multi-dimensional
173175
queries (e.g., spending by provider X for agent Y during period Z). External cost dashboards
174-
need this level of attribution. The persistence layer backing `GET /budget/history` needs
176+
need this level of attribution. The persistence layer backing `GET /budget/records` needs
175177
inspection to confirm whether it provides richer query semantics than the in-memory tracker.
176178

177179
**Gap -- G4**: The 9 coordination metrics (`budget/coordination_metrics.py`) are computed
@@ -205,7 +207,9 @@ formats.
205207

206208
| Operation | Endpoint | Notes |
207209
|---|---|---|
208-
| Analytics dashboard | `GET /analytics` | Summary metrics |
210+
| Analytics overview | `GET /analytics/overview` | Summary metrics (task counts, cost totals, budget status) |
211+
| Analytics trends | `GET /analytics/trends` | Time-series cost and task-completion trends |
212+
| Analytics forecast | `GET /analytics/forecast` | Forward-looking spend projections |
209213
| Generate report | `POST /reports/generate` | Spending, performance, task completion |
210214
| List log sinks | `GET /settings/observability/sinks` | Current sink configuration |
211215
| Test sink connectivity | `POST /settings/observability/sinks/_test` | CEO/manager |

docs/research/multi-agent-failure-audit.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -171,7 +171,7 @@ Single LLM review call. If the winner matches a participant, auto-resolves. On a
171171

172172
### Complete Fallback Chain
173173

174-
```
174+
```text
175175
HybridResolver
176176
└─ clear winner found ─────────────────────────────→ RESOLVED_BY_HYBRID
177177
└─ ambiguous + escalate_on_ambiguity=True ──────────→ ESCALATED_TO_HUMAN (stub)

0 commit comments

Comments
 (0)