Context
Deep dive on "SRLM: Self-Reflective Program Search" (arXiv:2603.15653). Core finding: trajectory selection quality matters more than recursion depth (+22% over RLMs, no retraining).
Three Uncertainty Signals
- Self-consistency filter: Sample K candidates, majority-vote filter. Discard candidates disagreeing with majority output.
- Verbalized confidence (VC): Per-step structured confidence value (0-100). Log-space aggregation:
VC(p) = sum_t log(nu_t / 100). Single low-confidence step penalizes entire candidate.
- Trace length (Len): Total output tokens across steps. Shorter = more confident. Already tracked via
TurnRecord.output_tokens -- zero new data collection.
Joint scoring: s(p) = VC(p) * Len(p) -- least-negative wins (most confident + most concise).
Implementation Blueprint
TrajectoryConfig dataclass: enabled: bool = False, k_candidates: int = 2, complexity_gate: tuple[str, ...] = ("complex", "epic")
- New
engine/trajectory_scorer.py: CandidateStep model, _filter_consistent(), _score(), select_best()
- Verbalized confidence elicitation: structured
{"confidence": <0-100>} JSON block in step generation prompt
hybrid_loop.py integration: asyncio.TaskGroup for K parallel candidates when enabled + budget allows
- Budget guard (mandatory): If K-candidate sampling would exceed remaining task
budget_limit, fall back to single-candidate
Design Decisions
- K=2-3 initially (not K=8 from paper -- 8x cost multiplier)
- Complex/epic tasks only -- never activate for ReAct (Loop 1) or simple/medium
- Off by default (
enabled: False)
- Complements stagnation detection: SRLM is forward-looking (pre-commit selection), stagnation is backward-looking (past repetition correction)
Risks
- Cost multiplier: K=8 = 8x provider calls. Budget gate is not optional.
- Verbalized confidence calibration: Many models are poorly calibrated (DCPO research). Scoring degrades gracefully to Len-only.
- Self-consistency comparator: Step-type-aware comparison needed (tool-call intent, not raw string equality).
- Confident mistakes: If all K agree but all wrong, no protection. Stagnation detection + review gates remain backstop.
References
Context
Deep dive on "SRLM: Self-Reflective Program Search" (arXiv:2603.15653). Core finding: trajectory selection quality matters more than recursion depth (+22% over RLMs, no retraining).
Three Uncertainty Signals
VC(p) = sum_t log(nu_t / 100). Single low-confidence step penalizes entire candidate.TurnRecord.output_tokens-- zero new data collection.Joint scoring:
s(p) = VC(p) * Len(p)-- least-negative wins (most confident + most concise).Implementation Blueprint
TrajectoryConfigdataclass:enabled: bool = False,k_candidates: int = 2,complexity_gate: tuple[str, ...] = ("complex", "epic")engine/trajectory_scorer.py:CandidateStepmodel,_filter_consistent(),_score(),select_best(){"confidence": <0-100>}JSON block in step generation prompthybrid_loop.pyintegration:asyncio.TaskGroupfor K parallel candidates when enabled + budget allowsbudget_limit, fall back to single-candidateDesign Decisions
enabled: False)Risks
References