Context
Two complementary findings on improving engine observability:
- AgentProcessBench -- 1,000 trajectories, 8,509 human-labeled step annotations. Ternary step labeling (correct/neutral-exploratory/incorrect) with error-propagation rules. Current stagnation detection operates at task level only.
- MADQA Benchmark -- Agents trapped in loops despite having answers. Novel accuracy-effort trade-off metric (outcome quality vs steps consumed).
Action Items
References
Context
Two complementary findings on improving engine observability:
Action Items
References