Skip to content

feat(skills): add autoresearch — autonomous git-based experiment loop#5175

Open
tugrulguner wants to merge 3 commits into
NousResearch:mainfrom
tugrulguner:feat/autoresearch-v2
Open

feat(skills): add autoresearch — autonomous git-based experiment loop#5175
tugrulguner wants to merge 3 commits into
NousResearch:mainfrom
tugrulguner:feat/autoresearch-v2

Conversation

@tugrulguner

@tugrulguner tugrulguner commented Apr 5, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the autoresearch pattern as a native Hermes skill: autonomous background research using a branch → experiment → evaluate → merge/revert git loop. Zero external dependencies.

What It Does

  1. Initialize — Creates run directory with state files (config, status, plan, control, checkpoint)
  2. Plan — Breaks research goal into experiments (investigate, deepen, verify, synthesize)
  3. Loop — For each experiment:
    • git checkout -b exp_N from main
    • Do the work (modify code, gather data, train ML)
    • Evaluate: deterministic metric (ML) or self-eval rubric (knowledge)
    • Merge if improved / Revert if worse
  4. Report — Markdown report with findings, experiment log, token usage

Supported Flows

Flow 1 — ML/Code Optimization: Iterate on train.py, try different models/hyperparameters, keep only what beats baseline. Evaluation is deterministic: metric comparison (val_bpb, accuracy, latency). Example: "Optimize this random forest classifier on the wine dataset."

Flow 2 — Knowledge Research: Build research.md via web research. Self-eval rubric: evidence (1-5), accuracy (1-5), depth (1-5), relevance (1-5), net improvement (1-5). Total >= 13 with evidence >= 3 and relevance >= 3 → merge. Example: "Research the AI coding agents market — pricing, features, market share."

Flow 3 — Code Analysis/Security Audit: Analyze a codebase for issues using static analysis. Experiments add findings with file paths, line numbers, and code snippets. Only merges with concrete evidence. Example: "Audit this repo for OWASP top 10 vulnerabilities."

Flow 4 — Recurring Competitive Intelligence: Weekly cron that checks for competitor changes, merges only genuinely new findings. Same knowledge rubric but experiments focus on delta from last run. Example: "Track weekly changes in competitor pricing pages."

Flow 5 — Product Requirements / Spec Refinement: Start with a rough PRD, iteratively refine via research on feasibility, competitors, and market gaps. Each experiment deepens a section. Example: "Refine this PRD for a developer tools product."

Architecture

skills/research/autoresearch/
├── SKILL.md                    # Skill definition and trigger conditions
├── scripts/                    # 8 helper scripts (stdlib Python only)
│   ├── _util.py                # Shared atomic write, JSON I/O, HERMES_HOME
│   ├── state.py                # Atomic JSON I/O, budget enforcement
│   ├── plan.py                 # Experiment CRUD (investigate/deepen/verify/synthesize)
│   ├── evaluate.py             # Scoring rubric + ML metric comparison
│   ├── workspace.py            # Git branch/merge/revert operations
│   ├── report.py               # Markdown report generation from results.log
│   ├── registry.py             # Multi-user run tracking
│   └── usage.py                # Token/cost tracking via SessionDB
├── templates/
│   ├── cron_prompt.md          # Main loop (4 phases: setup, planning, loop, synthesis)
│   ├── watchdog_prompt.md      # Progress monitor (every 15 min)
│   └── resume_prompt.md        # Resume from checkpoint
├── test_e2e.py                 # 32 assertions
└── test_integration.py         # 44 tests, 10 classes

Testing

  • test_integration.py: 44 tests, 10 classes — ALL PASSING
  • test_e2e.py: 32 assertions — ALL PASSING
  • Full loop simulation: 83 assertions (init → plan → 6 experiments → pause/resume → budget enforcement → report) — ALL PASSING
  • P1 fix tests: 17 assertions (token auto-read from usage.json, tier param enforcement) — ALL PASSING
  • Mode 1 (ML): Wine dataset RF baseline 100%, 6 experiments all correctly evaluated
  • Mode 2 (Knowledge): AI coding agents analysis, 6 experiments merged, 450-line report

What's tested: All helper scripts, the full Phase 1→4 loop (init, plan, branch, evaluate, merge/revert, pause/resume, budget enforcement, report generation, registry, usage tracking), template variable references, git history integrity.

What needs live validation: The actual cron scheduling + gateway delivery path — i.e. launching via cronjob(), watchdog firing every 15min, and final report auto-delivery. This requires a persistent Hermes gateway with cron enabled. The skill uses the existing cron system as-is (no modifications to it), so the integration risk is low.

How to Use

Launch from any Hermes Agent session:

cronjob(action="create", name="research-<id>", schedule="1m",
        skills=["autoresearch"], prompt=<filled_template>, deliver="origin")

Key Implementation Details

  • state.py: tempfile.mkstemp + os.replace for atomic writes. Budget checks auto-read usage.json for token enforcement. control.json read before every loop iteration.
  • workspace.py: git init with initial empty commit so main branch exists immediately. shlex.quote() on all shell-interpolated values. Outputs commands for agent to execute.
  • evaluate.py: Two scoring paths - score_knowledge(evidence, accuracy, depth, relevance, net_improvement) for knowledge, score_ml(metric_value, prev_best) for ML.
  • report.py: Parses results.log for merge/revert decisions, reads usage.json for token counts. Summary shows top findings when merged > 0.
  • registry.py / usage.py: Respect HERMES_HOME env var for non-default profiles and test environments.
  • cron_prompt.md: Passes --max-duration and --max-tokens to state.py init so Quick/Deep/Unlimited tiers get correct budget limits.
  • Zero dependencies: All stdlib Python. No changes to core Hermes.

Related

Contribution

  • I have read the contributing guidelines
  • I have tested my changes locally
  • My changes do not introduce any new warnings or errors
  • I have added tests that prove my fix/feature works
  • New and existing tests pass locally with my changes

Closes #5114

@tugrulguner

Copy link
Copy Markdown
Contributor Author

Q&A (from maintainer review)

Evaluation metric for text outputs

Karpathy's autoresearch uses val_bpb as a clear numerical objective. For research.md, the equivalent is a 5-criteria self-evaluation rubric in evaluate.py: evidence, accuracy, depth, relevance, and net improvement (each 1-5, total out of 25).

Honest answer: Yes, this is LLM self-evaluation. Unlike val_bpb, there is no ground-truth numerical metric for text quality. The rubric mitigates hallucination and lazy writing through hard gates:

  • Evidence >= 3 — Must cite real sources, provide concrete data/quotes. A merge with evidence=1 (no sourcing) is impossible — the scoring function auto-rejects it.
  • Net improvement >= 3 — Must add meaningful content. If the agent just rewrites without adding value, it scores 1 and auto-reverts.
  • Total >= 13 — The overall quality floor.

In practice this worked on our AI coding agents test where 6 experiments all scored 19-24/25 and were merged with real pricing data, feature matrices, and market analysis.

Literature research testing

Not yet. We have validated on Mode 1 (wine dataset ML optimization, 6 experiments) and Mode 2 (AI coding agents competitive analysis, 6 experiments). The system can handle literature research because the agent has browser and web tools, but PDF parsing, source deduplication, and academic citation formatting are untested. The rubric would need tweaking for academic work (e.g., evidence >= 4 minimum, require DOI/URL citations).

Scope of changes

Entirely contained within the skill. Zero changes to core Hermes:

  • No modifications to gateway, cron scheduler, agent core, or delegation system
  • No new Hermes tools or APIs
  • All 8 scripts are stdlib Python, zero external dependencies

If you delete skills/research/autoresearch/, Hermes behaves identically to before.

Iteration limits vs. context window

The git loop does not reduce token consumption per run — the agent might still burn through 50 iterations. The benefit is different:

  • State persistencecheckpoint.json, status.json, and results.log allow the agent to resume after a context reset. If the cron job hits 50 iterations and dies, the next run starts from checkpoint and knows exactly which experiments are done/merged/reverted.
  • Main branch as permanent memory — Even if context is lost, accumulated knowledge lives in the git history. A resumed agent reads research.md on main and picks up from there.
  • Mid-run replanning — Every 5 experiments the agent re-reads stats and plan, which re-focuses the context and avoids drift.

@tugrulguner tugrulguner force-pushed the feat/autoresearch-v2 branch from b7f5f82 to a8fe406 Compare April 5, 2026 13:33
Adds autonomous research skill inspired by Karpathy's autoresearch.
Branch → experiment → evaluate → merge/revert loop for ML optimization
and knowledge research. Zero external dependencies (stdlib Python only).

- 8 helper scripts: state, plan, evaluate, workspace, report, registry, usage, _util
- 3 prompt templates: cron agent, watchdog, resume
- Safety gates: time/token/experiment budgets, watchdog enforcement, stall detection
- Multi-user registry with HERMES_HOME support
- 44 integration tests + 32 e2e assertions, all passing
@tugrulguner tugrulguner force-pushed the feat/autoresearch-v2 branch from a8fe406 to bc827c0 Compare April 5, 2026 13:56
…main branch exists

workspace.py init() created the git repo but never committed, leaving
main as an unborn branch. merge/revert would fail with "pathspec 'main'
did not match" on the first experiment cycle. Add --allow-empty commit
so main exists immediately after init.
@alt-glitch alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have tool/skills Skills system (list, view, manage) labels May 1, 2026
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #5112 (closed) — same autoresearch skill PR. Also competing with open #7911.

1 similar comment
@alt-glitch

Copy link
Copy Markdown
Collaborator

Likely duplicate of #5112 (closed) — same autoresearch skill PR. Also competing with open #7911.

@tugrulguner

Copy link
Copy Markdown
Contributor Author

Likely duplicate of #5112 (closed) — same autoresearch skill PR. Also competing with open #7911.

True, there was something wrong with that PR so I closed it, this PR has been open for like a month :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P3 Low — cosmetic, nice to have tool/skills Skills system (list, view, manage) type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Autoresearch skill - autonomous git-based experiment loop for ML optimization and knowledge research

2 participants