feat: Agentic On-Policy Distillation (OPD) environment by teknium1 · Pull Request #1149 · NousResearch/hermes-agent

teknium1 · 2026-03-13T09:45:39Z

Summary

First Atropos environment to populate distill_token_ids / distill_logprobs on ScoredDataGroup, enabling on-policy distillation (OPD) training for agentic tasks.

Background

Based on OpenClaw-RL (Princeton, March 2026), which proved that next-state signals from agent interactions contain two forms of wasted training data:

Evaluative signals — implicit +1/-1 scores (test pass/fail, error traces)
Directive signals — token-level correction info ("you should have checked the file first")

Their combined OPD+RL method improved personalization from 0.17 → 0.81 in 36 conversations, and tool-call accuracy by 76% using process rewards vs outcome-only.

What this PR adds

environments/agentic_opd_env.py — a new Atropos environment that:

OPD Pipeline (the novel part):

Walks the agent conversation to find (assistant_turn, next_state) pairs
Uses an LLM judge with majority voting to extract hindsight hints from next-state signals (tool results, error messages, test verdicts)
Builds enhanced prompts (original context + hint)
Scores student tokens under the enhanced distribution using Atropos's built-in get_logprobs API (VLLM prompt_logprobs)
Packages the teacher's top-K predictions as distill_token_ids / distill_logprobs on ScoredDataGroup

Task: Coding problems with test verification

8 built-in coding tasks (fizzbuzz, two_sum, merge_intervals, etc.)
HuggingFace dataset support for custom coding benchmarks
Rich next-state signals from test pass/fail, error traces, terminal output

Reward: Multi-signal scoring

Correctness (0.7): test pass/fail via ToolContext
Efficiency (0.15): fewer turns = better
Tool usage (0.15): appropriate use of terminal + file tools

Configuration (AgenticOPDConfig):

opd_enabled: toggle OPD pipeline (default: True)
distill_topk: top-K teacher predictions per position (default: 50)
prm_votes: majority voting count for hint judge (default: 3)
hint_max_next_state_chars: truncation for long tool outputs (default: 4000)

WandB Metrics:

opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate
Standard training/eval metrics (correctness, reward, pass_rate)

Architecture

AgenticOPDEnv(HermesAgentBaseEnv)
  └── collect_trajectories() override:
        1. super().collect_trajectories() → standard agentic rollouts
        2. _apply_opd_pipeline() for each rollout:
           a. _extract_turn_pairs() — find (assistant, tool_result) pairs
           b. _extract_hint() — LLM judge with majority voting
           c. Build enhanced prompt + tokenize
           d. server.get_logprobs() — VLLM prompt_logprobs scoring
           e. Map teacher top-K back to student token positions
        3. Set distill_token_ids / distill_logprobs on ScoredDataGroup

No external servers needed — the same VLLM backend that generates rollouts also scores teacher logprobs via prompt_logprobs.

Requirements

VLLM backend (server_type: vllm) for prompt logprob scoring
Phase 2 mode (ManagedServer) for token-level tracking

Test Plan

Import test passes ✅
Hint extraction helpers verified (parse, select, inject) ✅
Turn pair extraction logic verified ✅
Token span search verified ✅
Full test suite: 3376 passed (6 pre-existing failures unrelated) ✅

First Atropos environment to populate distill_token_ids / distill_logprobs on ScoredDataGroup, enabling on-policy distillation training. Based on OpenClaw-RL (Princeton, arXiv:2603.10165): - Extracts hindsight hints from next-state signals (tool results, errors) - Uses LLM judge with majority voting for hint extraction - Scores student tokens under hint-enhanced distribution via get_logprobs - Packages teacher's top-K predictions as distillation targets Architecture: - AgenticOPDEnv extends HermesAgentBaseEnv - Overrides collect_trajectories to add OPD pipeline after standard rollouts - Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring - No external servers needed — same VLLM backend handles both rollouts and scoring Task: Coding problems with test verification (8 built-in tasks, HF dataset support) Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15) OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate

anthropic/claude-opus-4.6 (OpenRouter format) was being sent as claude-opus-4.6 to the Anthropic API, which expects claude-opus-4-6 (hyphens, not dots). normalize_model_name() now converts dots to hyphens after stripping the provider prefix, matching Anthropic's naming convention. Fixes 404: 'model: claude-opus-4.6 was not found'

…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment

teknium1 added 2 commits March 13, 2026 02:45

teknium1 merged commit c097e56 into main Mar 13, 2026
1 check failed

angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026

Merge pull request NousResearch#1149 from NousResearch/hermes/hermes-…

82d1c4b

…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment

02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026

Merge pull request NousResearch#1149 from NousResearch/hermes/hermes-…

703a633

…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment

olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026

Merge pull request NousResearch#1149 from NousResearch/hermes/hermes-…

8762400

…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment

Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026

Merge pull request NousResearch#1149 from NousResearch/hermes/hermes-…

132daf0

…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Agentic On-Policy Distillation (OPD) environment#1149

feat: Agentic On-Policy Distillation (OPD) environment#1149
teknium1 merged 2 commits into
mainfrom
hermes/hermes-d28bf447

teknium1 commented Mar 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

teknium1 commented Mar 13, 2026

Summary

Background

What this PR adds

Architecture

Requirements

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant