Skip to content

feat: TrajectoryScorer for HybridLoop (best-of-K with uncertainty signals) #705

@Aureliolo

Description

@Aureliolo

Context

Deep dive on "SRLM: Self-Reflective Program Search" (arXiv:2603.15653). Core finding: trajectory selection quality matters more than recursion depth (+22% over RLMs, no retraining).

Three Uncertainty Signals

  1. Self-consistency filter: Sample K candidates, majority-vote filter. Discard candidates disagreeing with majority output.
  2. Verbalized confidence (VC): Per-step structured confidence value (0-100). Log-space aggregation: VC(p) = sum_t log(nu_t / 100). Single low-confidence step penalizes entire candidate.
  3. Trace length (Len): Total output tokens across steps. Shorter = more confident. Already tracked via TurnRecord.output_tokens -- zero new data collection.

Joint scoring: s(p) = VC(p) * Len(p) -- least-negative wins (most confident + most concise).

Implementation Blueprint

  1. TrajectoryConfig dataclass: enabled: bool = False, k_candidates: int = 2, complexity_gate: tuple[str, ...] = ("complex", "epic")
  2. New engine/trajectory_scorer.py: CandidateStep model, _filter_consistent(), _score(), select_best()
  3. Verbalized confidence elicitation: structured {"confidence": <0-100>} JSON block in step generation prompt
  4. hybrid_loop.py integration: asyncio.TaskGroup for K parallel candidates when enabled + budget allows
  5. Budget guard (mandatory): If K-candidate sampling would exceed remaining task budget_limit, fall back to single-candidate

Design Decisions

  • K=2-3 initially (not K=8 from paper -- 8x cost multiplier)
  • Complex/epic tasks only -- never activate for ReAct (Loop 1) or simple/medium
  • Off by default (enabled: False)
  • Complements stagnation detection: SRLM is forward-looking (pre-commit selection), stagnation is backward-looking (past repetition correction)

Risks

  • Cost multiplier: K=8 = 8x provider calls. Budget gate is not optional.
  • Verbalized confidence calibration: Many models are poorly calibrated (DCPO research). Scoring degrades gracefully to Len-only.
  • Self-consistency comparator: Step-type-aware comparison needed (tool-call intent, not raw string equality).
  • Confident mistakes: If all K agree but all wrong, no protection. Stagnation detection + review gates remain backstop.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:highImportant, should be prioritizedscope:medium1-3 days of workspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationv0.7Minor version v0.7v0.7.8Patch release v0.7.8

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions