Skip to content

feat: step-level quality signals + accuracy-effort observability metric #697

@Aureliolo

Description

@Aureliolo

Context

Two complementary findings on improving engine observability:

  1. AgentProcessBench -- 1,000 trajectories, 8,509 human-labeled step annotations. Ternary step labeling (correct/neutral-exploratory/incorrect) with error-propagation rules. Current stagnation detection operates at task level only.
  2. MADQA Benchmark -- Agents trapped in loops despite having answers. Novel accuracy-effort trade-off metric (outcome quality vs steps consumed).

Action Items

  • Add step-level quality signals to approval gate triggers (not just task-boundary outcomes)
  • Implement ternary step classification: correct / neutral-exploratory / incorrect
  • Expose accuracy-effort ratio in observability layer (task quality / normalized steps)
  • Wire accuracy-effort metric into budget module for cost-per-outcome analysis
  • Address finding: weaker models terminate early, inflating "correct step" ratios (trap for HR performance tracking)

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    prio:highImportant, should be prioritizedscope:medium1-3 days of workspec:task-workflowDESIGN_SPEC Section 6 - Task & Workflow Enginetype:featureNew feature implementationv0.7Minor version v0.7v0.7.7Patch release v0.7.7

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions