feat: TrajectoryScorer for HybridLoop (best-of-K with uncertainty signals)

## Context

Deep dive on ["SRLM: Self-Reflective Program Search" (arXiv:2603.15653)](https://arxiv.org/abs/2603.15653). Core finding: trajectory selection quality matters more than recursion depth (+22% over RLMs, no retraining).

## Three Uncertainty Signals

1. **Self-consistency filter**: Sample K candidates, majority-vote filter. Discard candidates disagreeing with majority output.
2. **Verbalized confidence (VC)**: Per-step structured confidence value (0-100). Log-space aggregation: `VC(p) = sum_t log(nu_t / 100)`. Single low-confidence step penalizes entire candidate.
3. **Trace length (Len)**: Total output tokens across steps. Shorter = more confident. Already tracked via `TurnRecord.output_tokens` -- zero new data collection.

**Joint scoring**: `s(p) = VC(p) * Len(p)` -- least-negative wins (most confident + most concise).

## Implementation Blueprint

1. `TrajectoryConfig` dataclass: `enabled: bool = False`, `k_candidates: int = 2`, `complexity_gate: tuple[str, ...] = ("complex", "epic")`
2. New `engine/trajectory_scorer.py`: `CandidateStep` model, `_filter_consistent()`, `_score()`, `select_best()`
3. Verbalized confidence elicitation: structured `{"confidence": <0-100>}` JSON block in step generation prompt
4. `hybrid_loop.py` integration: `asyncio.TaskGroup` for K parallel candidates when enabled + budget allows
5. **Budget guard (mandatory)**: If K-candidate sampling would exceed remaining task `budget_limit`, fall back to single-candidate

## Design Decisions

- K=2-3 initially (not K=8 from paper -- 8x cost multiplier)
- Complex/epic tasks only -- never activate for ReAct (Loop 1) or simple/medium
- Off by default (`enabled: False`)
- Complements stagnation detection: SRLM is forward-looking (pre-commit selection), stagnation is backward-looking (past repetition correction)

## Risks

- **Cost multiplier**: K=8 = 8x provider calls. Budget gate is not optional.
- **Verbalized confidence calibration**: Many models are poorly calibrated (DCPO research). Scoring degrades gracefully to Len-only.
- **Self-consistency comparator**: Step-type-aware comparison needed (tool-call intent, not raw string equality).
- **Confident mistakes**: If all K agree but all wrong, no protection. Stagnation detection + review gates remain backstop.

## References

- [arXiv:2603.15653](https://arxiv.org/abs/2603.15653)
- Related: #687 (agent-controlled compaction -- both touch HybridLoop decision points)
- Related: #697 (step-level quality signals -- complementary observability)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TrajectoryScorer for HybridLoop (best-of-K with uncertainty signals) #705

Context

Three Uncertainty Signals

Implementation Blueprint

Design Decisions

Risks

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: TrajectoryScorer for HybridLoop (best-of-K with uncertainty signals) #705

Description

Context

Three Uncertainty Signals

Implementation Blueprint

Design Decisions

Risks

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions