You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Current delegation policy in Hermes is static: the orchestrator picks a profile (#4928), a model tier (#7929), or a chain (#7481) based on configured rules or the agent's in-context judgment. There is no feedback loop from delegation outcomes back into future delegation decisions — if claude -p just wasted 90s on a task that the local model would have solved in 30s, nothing stops the orchestrator picking the same route next time.
Hermes already has the pieces a learning policy needs:
At delegate_task call time the orchestrator asks the policy "given these features, which dispatch wins?"; at completion it reports (features, dispatch, wall_time_ms, cost_tokens, succeeded) back. The policy updates its posterior and the next similar task gets a better pick.
Why it matters
Cost — the capability gap between orchestrator and worker is modest on easy-to-medium tasks; uniformly delegating pays worker cost for no capability gain. An adaptive policy converges on delegating only where the capability gap actually pays off.
No-code personalization — different users have different models / hardware / providers; a shared static config can't express "delegate to Claude Opus on hard-refactor, stay local for simple edits" for everyone at once. A learned policy adapts per user.
There's already a real seed dataset: an internal benchmark we ran of 30 trials × 3 delegation strategies × 10 tasks (strategies A: never-delegate / B: always-delegate / C: rule-based SOUL.md). Results: A won 7/10, C won 2/10 (the two hardest), B won 1/10. That's 30 labeled (task_features, dispatch, wall_time, succeeded) rows ready to bootstrap a policy, plus a ready-made test bed for regression testing the policy head.
Happy to contribute the benchmark data as a delegation_seed.jsonl if this is useful.
Non-goals
Not replacing the agent's in-context reasoning — the policy's output is a prior, the agent can still override when it has task-specific knowledge the features don't capture.
Not a training pipeline — a Thompson-sampling / UCB posterior over a small discrete action set is enough; no model training, no GPU.
Not a new backend — persists via the existing Hindsight store so there's one source of truth.
Problem
Current delegation policy in Hermes is static: the orchestrator picks a profile (#4928), a model tier (#7929), or a chain (#7481) based on configured rules or the agent's in-context judgment. There is no feedback loop from delegation outcomes back into future delegation decisions — if
claude -pjust wasted 90s on a task that the local model would have solved in 30s, nothing stops the orchestrator picking the same route next time.Hermes already has the pieces a learning policy needs:
What's missing is an adaptive layer that consumes outcomes and biases future dispatch.
Proposal
An opt-in, profile-scoped delegation policy that maps
task features → recommended dispatch:At
delegate_taskcall time the orchestrator asks the policy "given these features, which dispatch wins?"; at completion it reports(features, dispatch, wall_time_ms, cost_tokens, succeeded)back. The policy updates its posterior and the next similar task gets a better pick.Why it matters
Cold start
There's already a real seed dataset: an internal benchmark we ran of 30 trials × 3 delegation strategies × 10 tasks (strategies A: never-delegate / B: always-delegate / C: rule-based SOUL.md). Results: A won 7/10, C won 2/10 (the two hardest), B won 1/10. That's 30 labeled
(task_features, dispatch, wall_time, succeeded)rows ready to bootstrap a policy, plus a ready-made test bed for regression testing the policy head.Happy to contribute the benchmark data as a
delegation_seed.jsonlif this is useful.Non-goals
Cross-refs