feat: Agentic On-Policy Distillation (OPD) environment#1149
Merged
Conversation
First Atropos environment to populate distill_token_ids / distill_logprobs on ScoredDataGroup, enabling on-policy distillation training. Based on OpenClaw-RL (Princeton, arXiv:2603.10165): - Extracts hindsight hints from next-state signals (tool results, errors) - Uses LLM judge with majority voting for hint extraction - Scores student tokens under hint-enhanced distribution via get_logprobs - Packages teacher's top-K predictions as distillation targets Architecture: - AgenticOPDEnv extends HermesAgentBaseEnv - Overrides collect_trajectories to add OPD pipeline after standard rollouts - Uses Atropos's built-in get_logprobs (VLLM prompt_logprobs) for teacher scoring - No external servers needed — same VLLM backend handles both rollouts and scoring Task: Coding problems with test verification (8 built-in tasks, HF dataset support) Reward: correctness (0.7) + efficiency (0.15) + tool usage (0.15) OPD: Per-turn hint extraction → enhanced prompt → teacher top-K logprobs Configurable: opd_enabled, distill_topk, prm_votes, hint truncation length Metrics: opd/mean_hints_per_rollout, opd/mean_turns_scored, opd/hint_rate
anthropic/claude-opus-4.6 (OpenRouter format) was being sent as claude-opus-4.6 to the Anthropic API, which expects claude-opus-4-6 (hyphens, not dots). normalize_model_name() now converts dots to hyphens after stripping the provider prefix, matching Anthropic's naming convention. Fixes 404: 'model: claude-opus-4.6 was not found'
angelburgosrosado
pushed a commit
to angelburgosrosado/hermes-agent
that referenced
this pull request
Apr 27, 2026
…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment
02356abc
pushed a commit
to 02356abc/hermes-agent
that referenced
this pull request
May 14, 2026
…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment
olympus-terminal
pushed a commit
to olympus-terminal/hermes-agent
that referenced
this pull request
May 16, 2026
…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment
Egavasyug
pushed a commit
to Egavasyug/hermes-agent
that referenced
this pull request
Jun 10, 2026
…d28bf447 feat: Agentic On-Policy Distillation (OPD) environment
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First Atropos environment to populate
distill_token_ids/distill_logprobsonScoredDataGroup, enabling on-policy distillation (OPD) training for agentic tasks.Background
Based on OpenClaw-RL (Princeton, March 2026), which proved that next-state signals from agent interactions contain two forms of wasted training data:
Their combined OPD+RL method improved personalization from 0.17 → 0.81 in 36 conversations, and tool-call accuracy by 76% using process rewards vs outcome-only.
What this PR adds
environments/agentic_opd_env.py— a new Atropos environment that:OPD Pipeline (the novel part):
get_logprobsAPI (VLLMprompt_logprobs)distill_token_ids/distill_logprobsonScoredDataGroupTask: Coding problems with test verification
Reward: Multi-signal scoring
Configuration (
AgenticOPDConfig):opd_enabled: toggle OPD pipeline (default: True)distill_topk: top-K teacher predictions per position (default: 50)prm_votes: majority voting count for hint judge (default: 3)hint_max_next_state_chars: truncation for long tool outputs (default: 4000)WandB Metrics:
opd/mean_hints_per_rollout,opd/mean_turns_scored,opd/hint_rateArchitecture
No external servers needed — the same VLLM backend that generates rollouts also scores teacher logprobs via
prompt_logprobs.Requirements
server_type: vllm) for prompt logprob scoringTest Plan