You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hermes agents write and overwrite files (train.py, research.md) repeatedly without keeping track of whether each change actually improved anything. If you ask an agent to "research ML models" or "build a competitive analysis," it produces output but has no mechanism to:
Compare experiments against a baseline
Discard bad experiments and keep only what worked
Maintain progress across long-running background runs
Self-evaluate the quality of its own additions before merging
Why I want it
An agent should be able to run multiple experiments, objectively evaluate each one, and keep only the improvements - like a human researcher who tries approaches and only writes up the ones that work. The main branch should always hold the best version, preventing the destructive overwriting loop.
AC
Agent initializes research workspace with state files (config, status, plan, control, checkpoint)
Agent creates experiments, branches from main for each, evaluates via deterministic metric (ML) or structured rubric (knowledge)
Experiments that improve the target metric/score are merged into main; experiments that regress are reverted
Research runs autonomously via cron without blocking the active chat
User can pause/stop/resume mid-run via control.json
Budget enforcement (time, tokens, experiment hard cap) prevents runaway runs
Watchdog cron monitors progress and alerts on stalls (>30 min without update)
Final report auto-delivers with findings, experiment log, and token usage
Open questions
Should the self-evaluation rubric use a different threshold for domain-specific research (e.g., security audit needs evidence=4 minimum)?
Should parallel experiments run via delegate_task for faster iteration?
Proposed Solution
A git-based branch -> experiment -> evaluate -> merge/revert loop inspired by Karpathy's autoresearch. For each experiment, the agent:
git checkout -b exp_N from main
Does the work (modifies code, gathers data, trains models)
Evaluates via metric or rubric
If improved -> git merge exp_N into main (new baseline)
If worse -> git checkout -f main (discard branch, previous best preserved)
Supported flows:
Flow 1: ML/Code Optimization - Iterate on train.py, try different models/hyperparameters, keep only what beats baseline. Deterministic: metric comparison only.
Flow 2: Knowledge Research - Build research.md via web research. Self-eval rubric: evidence (1-5), accuracy (1-5), depth (1-5), relevance (1-5), net improvement (1-5). Total >= 13 with evidence >= 3 and relevance >= 3 -> merge.
Flow 3: Code Analysis/Security Audit - Use tree-sitter and static analysis to find issues. Must include file paths, line numbers, code snippets to merge.
Flow 4: Recurring Competitive Intelligence - Weekly cron that checks for competitor changes, merges only genuinely new findings.
Flow 5: Product Requirements / Spec Refinement - Start with rough PRD, iteratively refine via research on feasibility, competitors, market gaps.
Implementation Notes
WHERE
Already implemented on the fork branch feat/autoresearch-skill at tugrulguner/hermes-agent:
state.py: tempfile.mkstemp + os.replace for atomic writes. Budget checks compare elapsed time, tokens, experiments vs hard caps. control.json read before every loop iteration.
workspace.py: git init --initial-branch=main with fallback for older git. Outputs shell commands for agent to execute.
evaluate.py: two scoring paths - score(evidence, accuracy, depth, relevance, net_improvement) for knowledge, log-result-ml(metric_name, metric_value, prev_best) for ML.
report.py: parses results.log for merge/revert decisions, reads usage.json for token counts.
DEPENDENCIES
Zero new dependencies - all stdlib Python. Cron system already exists; messaging disabled for cron by design (handled by watchdog pattern).
Alternatives Considered
Using delegate_task instead of cronjob for background runs - ruled out because delegate_task blocks the conversation while cron is fire-and-forget, and cron has built-in auto-delivery of the final result.
Problem or Use Case
What Happens Now
Hermes agents write and overwrite files (train.py, research.md) repeatedly without keeping track of whether each change actually improved anything. If you ask an agent to "research ML models" or "build a competitive analysis," it produces output but has no mechanism to:
Why I want it
An agent should be able to run multiple experiments, objectively evaluate each one, and keep only the improvements - like a human researcher who tries approaches and only writes up the ones that work. The main branch should always hold the best version, preventing the destructive overwriting loop.
AC
Open questions
Proposed Solution
A git-based branch -> experiment -> evaluate -> merge/revert loop inspired by Karpathy's autoresearch. For each experiment, the agent:
Supported flows:
Flow 1: ML/Code Optimization - Iterate on train.py, try different models/hyperparameters, keep only what beats baseline. Deterministic: metric comparison only.
Flow 2: Knowledge Research - Build research.md via web research. Self-eval rubric: evidence (1-5), accuracy (1-5), depth (1-5), relevance (1-5), net improvement (1-5). Total >= 13 with evidence >= 3 and relevance >= 3 -> merge.
Flow 3: Code Analysis/Security Audit - Use tree-sitter and static analysis to find issues. Must include file paths, line numbers, code snippets to merge.
Flow 4: Recurring Competitive Intelligence - Weekly cron that checks for competitor changes, merges only genuinely new findings.
Flow 5: Product Requirements / Spec Refinement - Start with rough PRD, iteratively refine via research on feasibility, competitors, market gaps.
Implementation Notes
WHERE
Already implemented on the fork branch feat/autoresearch-skill at tugrulguner/hermes-agent:
HOW
Agent launches via: cronjob(action="create", name="research-", schedule="1m", skills=["autoresearch"], prompt=<filled_template>, deliver="origin")
Key implementation:
DEPENDENCIES
Zero new dependencies - all stdlib Python. Cron system already exists; messaging disabled for cron by design (handled by watchdog pattern).
Alternatives Considered
Using delegate_task instead of cronjob for background runs - ruled out because delegate_task blocks the conversation while cron is fire-and-forget, and cron has built-in auto-delivery of the final result.
Feature Type
Skill
Scope
Research, ML, Knowledge
Contribution