Skip to content

feat: Agentic On-Policy Distillation (OPD) environment#1149

Merged
teknium1 merged 2 commits into
mainfrom
hermes/hermes-d28bf447
Mar 13, 2026
Merged

feat: Agentic On-Policy Distillation (OPD) environment#1149
teknium1 merged 2 commits into
mainfrom
hermes/hermes-d28bf447

Conversation

@teknium1

Copy link
Copy Markdown
Contributor

Summary

First Atropos environment to populate distill_token_ids / distill_logprobs on ScoredDataGroup, enabling on-policy distillation (OPD) training for agentic tasks.

Background

Based on OpenClaw-RL (Princeton, March 2026), which proved that next-state signals from agent interactions contain two forms of wasted training data:

  1. Evaluative signals — implicit +1/-1 scores (test pass/fail, error traces)
  2. Directive signals — token-level correction info ("you should have checked the file first")

Their combined OPD+RL method improved personalization from 0.17 → 0.81 in 36 conversations, and tool-call accuracy by 76% using process rewards vs outcome-only.

What this PR adds

environments/agentic_opd_env.py — a new Atropos environment that:

OPD Pipeline (the novel part):

  • Walks the agent conversation to find (assistant_turn, next_state) pairs
  • Uses an LLM judge with majority voting to extract hindsight hints from next-state signals (tool results, error messages, test verdicts)
  • Builds enhanced prompts (original context + hint)
  • Scores student tokens under the enhanced distribution using Atropos's built-in get_logprobs API (VLLM prompt_logprobs)
  • Packages the teacher's top-K predictions as distill_token_ids / distill_logprobs on ScoredDataGroup

Task: Coding problems with test verification

  • 8 built-in coding tasks (fizzbuzz, two_sum, merge_intervals, etc.)
  • HuggingFace dataset support for custom coding benchmarks
  • Rich next-state signals from test pass/fail, error traces, terminal output

Reward: Multi-signal scoring

  • Correctness (0.7): test pass/fail via ToolContext
  • Efficiency (0.15): fewer turns = better
  • Tool usage (0.15): appropriate use of terminal + file tools

Configuration (AgenticOPDConfig):

  • opd_enabled: toggle OPD pipeline (default: True)
  • distill_topk: top-K teacher predictions per position (default: 50)
  • prm_votes: majority voting count for hint judge (default: 3)
  • hint_max_next_state_chars: truncation for long tool outputs (default: 4000)

WandB Metrics:

  • opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate
  • Standard training/eval metrics (correctness, reward, pass_rate)

Architecture

AgenticOPDEnv(HermesAgentBaseEnv)
  └── collect_trajectories() override:
        1. super().collect_trajectories() → standard agentic rollouts
        2. _apply_opd_pipeline() for each rollout:
           a. _extract_turn_pairs() — find (assistant, tool_result) pairs
           b. _extract_hint() — LLM judge with majority voting
           c. Build enhanced prompt + tokenize
           d. server.get_logprobs() — VLLM prompt_logprobs scoring
           e. Map teacher top-K back to student token positions
        3. Set distill_token_ids / distill_logprobs on ScoredDataGroup

No external servers needed — the same VLLM backend that generates rollouts also scores teacher logprobs via prompt_logprobs.

Requirements

  • VLLM backend (server_type: vllm) for prompt logprob scoring
  • Phase 2 mode (ManagedServer) for token-level tracking

Test Plan

  • Import test passes ✅
  • Hint extraction helpers verified (parse, select, inject) ✅
  • Turn pair extraction logic verified ✅
  • Token span search verified ✅
  • Full test suite: 3376 passed (6 pre-existing failures unrelated) ✅

First Atropos environment to populate distill_token_ids / distill_logprobs
on ScoredDataGroup, enabling on-policy distillation training.

Based on OpenClaw-RL (Princeton, arXiv:2603.10165):
- Extracts hindsight hints from next-state signals (tool results, errors)
- Uses LLM judge with majority voting for hint extraction
- Scores student tokens under hint-enhanced distribution via get_logprobs
- Packages teacher's top-K predictions as distillation targets

Architecture:
- AgenticOPDEnv extends HermesAgentBaseEnv
- Overrides collect_trajectories to add OPD pipeline after standard rollouts
- Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring
- No external servers needed — same VLLM backend handles both rollouts and scoring

Task: Coding problems with test verification (8 built-in tasks, HF dataset support)
Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15)
OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs

Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length
Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate
anthropic/claude-opus-4.6 (OpenRouter format) was being sent as
claude-opus-4.6 to the Anthropic API, which expects claude-opus-4-6
(hyphens, not dots).

normalize_model_name() now converts dots to hyphens after stripping
the provider prefix, matching Anthropic's naming convention.

Fixes 404: 'model: claude-opus-4.6 was not found'
@teknium1 teknium1 merged commit c097e56 into main Mar 13, 2026
1 check failed
angelburgosrosado pushed a commit to angelburgosrosado/hermes-agent that referenced this pull request Apr 27, 2026
…d28bf447

feat: Agentic On-Policy Distillation (OPD) environment
02356abc pushed a commit to 02356abc/hermes-agent that referenced this pull request May 14, 2026
…d28bf447

feat: Agentic On-Policy Distillation (OPD) environment
olympus-terminal pushed a commit to olympus-terminal/hermes-agent that referenced this pull request May 16, 2026
…d28bf447

feat: Agentic On-Policy Distillation (OPD) environment
Egavasyug pushed a commit to Egavasyug/hermes-agent that referenced this pull request Jun 10, 2026
…d28bf447

feat: Agentic On-Policy Distillation (OPD) environment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant