Skip to content

[Feature]: Autoresearch skill - autonomous git-based experiment loop for ML optimization and knowledge research #5114

@tugrulguner

Description

@tugrulguner

Problem or Use Case

What Happens Now

Hermes agents write and overwrite files (train.py, research.md) repeatedly without keeping track of whether each change actually improved anything. If you ask an agent to "research ML models" or "build a competitive analysis," it produces output but has no mechanism to:

  1. Compare experiments against a baseline
  2. Discard bad experiments and keep only what worked
  3. Maintain progress across long-running background runs
  4. Self-evaluate the quality of its own additions before merging

Why I want it

An agent should be able to run multiple experiments, objectively evaluate each one, and keep only the improvements - like a human researcher who tries approaches and only writes up the ones that work. The main branch should always hold the best version, preventing the destructive overwriting loop.

AC

  • Agent initializes research workspace with state files (config, status, plan, control, checkpoint)
  • Agent creates experiments, branches from main for each, evaluates via deterministic metric (ML) or structured rubric (knowledge)
  • Experiments that improve the target metric/score are merged into main; experiments that regress are reverted
  • Research runs autonomously via cron without blocking the active chat
  • User can pause/stop/resume mid-run via control.json
  • Budget enforcement (time, tokens, experiment hard cap) prevents runaway runs
  • Watchdog cron monitors progress and alerts on stalls (>30 min without update)
  • Final report auto-delivers with findings, experiment log, and token usage

Open questions

  • Should the self-evaluation rubric use a different threshold for domain-specific research (e.g., security audit needs evidence=4 minimum)?
  • Should parallel experiments run via delegate_task for faster iteration?

Proposed Solution

A git-based branch -> experiment -> evaluate -> merge/revert loop inspired by Karpathy's autoresearch. For each experiment, the agent:

  1. git checkout -b exp_N from main
  2. Does the work (modifies code, gathers data, trains models)
  3. Evaluates via metric or rubric
  4. If improved -> git merge exp_N into main (new baseline)
  5. If worse -> git checkout -f main (discard branch, previous best preserved)

Supported flows:

Flow 1: ML/Code Optimization - Iterate on train.py, try different models/hyperparameters, keep only what beats baseline. Deterministic: metric comparison only.

Flow 2: Knowledge Research - Build research.md via web research. Self-eval rubric: evidence (1-5), accuracy (1-5), depth (1-5), relevance (1-5), net improvement (1-5). Total >= 13 with evidence >= 3 and relevance >= 3 -> merge.

Flow 3: Code Analysis/Security Audit - Use tree-sitter and static analysis to find issues. Must include file paths, line numbers, code snippets to merge.

Flow 4: Recurring Competitive Intelligence - Weekly cron that checks for competitor changes, merges only genuinely new findings.

Flow 5: Product Requirements / Spec Refinement - Start with rough PRD, iteratively refine via research on feasibility, competitors, market gaps.

Implementation Notes

WHERE

Already implemented on the fork branch feat/autoresearch-skill at tugrulguner/hermes-agent:

skills/research/autoresearch/
├── SKILL.md                    # Skill definition and trigger conditions
├── scripts/
│   ├── state.py                # Atomic JSON I/O, budget enforcement
│   ├── plan.py                 # Experiment CRUD (investigate/deepen/verify/synthesize)
│   ├── evaluate.py             # Scoring rubric + ML metric comparison
│   ├── workspace.py            # Git branch/merge/revert operations
│   ├── report.py               # Markdown report generation from results.log
│   ├── registry.py             # Multi-user run tracking
│   └── usage.py                # Token/cost tracking via SessionDB
├── templates/
│   ├── cron_prompt.md          # Main loop (4 phases: setup, planning, loop, synthesis)
│   ├── watchdog_prompt.md      # Progress monitor (every 15 min)
│   └── resume_prompt.md        # Resume from checkpoint
├── test_e2e.py                 # 32 assertions
└── test_integration.py         # 40 tests, 9 classes

HOW

Agent launches via: cronjob(action="create", name="research-", schedule="1m", skills=["autoresearch"], prompt=<filled_template>, deliver="origin")

Key implementation:

  • state.py: tempfile.mkstemp + os.replace for atomic writes. Budget checks compare elapsed time, tokens, experiments vs hard caps. control.json read before every loop iteration.
  • workspace.py: git init --initial-branch=main with fallback for older git. Outputs shell commands for agent to execute.
  • evaluate.py: two scoring paths - score(evidence, accuracy, depth, relevance, net_improvement) for knowledge, log-result-ml(metric_name, metric_value, prev_best) for ML.
  • report.py: parses results.log for merge/revert decisions, reads usage.json for token counts.

DEPENDENCIES

Zero new dependencies - all stdlib Python. Cron system already exists; messaging disabled for cron by design (handled by watchdog pattern).

Alternatives Considered

Using delegate_task instead of cronjob for background runs - ruled out because delegate_task blocks the conversation while cron is fire-and-forget, and cron has built-in auto-delivery of the final result.

Feature Type

Skill

Scope

Research, ML, Knowledge

Contribution

  • I'd like to implement this myself and submit a PR

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havetool/skillsSkills system (list, view, manage)type/featureNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions