feat(skills): add autoresearch — autonomous git-based experiment loop by tugrulguner · Pull Request #5112 · NousResearch/hermes-agent

tugrulguner · 2026-04-04T20:18:12Z

Summary

Adds the autoresearch pattern as a native Hermes skill: autonomous background research using a branch → experiment → evaluate → merge/revert git loop. Zero external dependencies.

What It Does

Initialize — Creates run directory with state files (config, status, plan, control, checkpoint)
Plan — Breaks research goal into experiments (investigate, deepen, verify, synthesize)
Loop — For each experiment:
- git checkout -b exp_N from main
- Do the work (modify code, gather data, train ML)
- Evaluate: deterministic metric (ML) or self-eval rubric (knowledge)
- Merge if improved / Revert if worse
Report — Markdown report with findings, experiment log, token usage

Architecture

skills/research/autoresearch/
|-- SKILL.md              Skill definition
|-- scripts/              7 helper scripts (stdlib Python only)
|   |-- state.py          Atomic JSON I/O, budget enforcement
|   |-- plan.py           Experiment CRUD
|   |-- evaluate.py       Scoring rubric + ML metric comparison
|   |-- workspace.py      Git branch/merge/revert
|   |-- report.py         Report generation
|   |-- registry.py       Multi-user tracking
|   |-- usage.py          Token/cost tracking
|-- templates/            Cron, watchdog, resume prompts
|-- test_e2e.py           32 assertions
|-- test_integration.py   40 tests, 9 classes

Testing

test_integration.py: 40 tests, 9 classes — ALL PASSING
test_e2e.py: 32 assertions — ALL PASSING
Mode 1 (ML): Wine dataset RF baseline 100%, 6 experiments all correctly evaluated
Mode 2 (Knowledge): AI coding agents analysis, 6 experiments merged, 450-line report

How to Use

Launch from any Hermes Agent session:

cronjob(action="create", name="research-<id>", schedule="1m",
        skills=["autoresearch"], prompt=<filled_template>, deliver="origin")

Q&A (from maintainer review)

Evaluation metric for text outputs

Karpathy's autoresearch uses val_bpb as a clear numerical objective. For research.md, the equivalent is a 5-criteria self-evaluation rubric in evaluate.py: evidence, accuracy, depth, relevance, and net improvement (each 1-5, total out of 25).

Honest answer: Yes, this is LLM self-evaluation. Unlike val_bpb, there is no ground-truth numerical metric for text quality. The rubric mitigates hallucination and lazy writing through hard gates:

Evidence >= 3 — Must cite real sources, provide concrete data/quotes. A merge with evidence=1 (no sourcing) is impossible — the scoring function auto-rejects it.
Net improvement >= 3 — Must add meaningful content. If the agent just rewrites without adding value, it scores 1 and auto-reverts.
Total >= 13 — The overall quality floor.

In practice this worked on our AI coding agents test where 6 experiments all scored 19-24/25 and were merged with real pricing data, feature matrices, and market analysis.

Literature research testing

Not yet. We have validated on Mode 1 (wine dataset ML optimization, 6 experiments) and Mode 2 (AI coding agents competitive analysis, 6 experiments). The system can handle literature research because the agent has browser and web tools, but PDF parsing, source deduplication, and academic citation formatting are untested. The rubric would need tweaking for academic work (e.g., evidence >= 4 minimum, require DOI/URL citations).

Scope of changes

Entirely contained within the skill. Zero changes to core Hermes:

No modifications to gateway, cron scheduler, agent core, or delegation system
No new Hermes tools or APIs
All 7 scripts are stdlib Python, zero external dependencies

If you delete skills/research/autoresearch/, Hermes behaves identically to before.

Iteration limits vs. context window

The git loop does not reduce token consumption per run — the agent might still burn through 50 iterations. The benefit is different:

State persistence — checkpoint.json, status.json, and results.log allow the agent to resume after a context reset. If the cron job hits 50 iterations and dies, the next run starts from checkpoint and knows exactly which experiments are done/merged/reverted.
Main branch as permanent memory — Even if context is lost, accumulated knowledge lives in the git history. A resumed agent reads research.md on main and picks up from there.
Mid-run replanning — Every 5 experiments the agent re-reads stats and plan, which re-focuses the context and avoids drift.

tugrulguner force-pushed the feat/autoresearch-skill branch 3 times, most recently from c7da4a3 to 7778259 Compare April 4, 2026 20:43

tugrulguner force-pushed the feat/autoresearch-skill branch from a66f62d to 5b666ce Compare April 4, 2026 22:22

feat(skills): add autoresearch - autonomous git-based experiment loop

dc22a92

tugrulguner closed this Apr 4, 2026

tugrulguner force-pushed the feat/autoresearch-skill branch from 5b666ce to dc22a92 Compare April 4, 2026 22:32

tugrulguner mentioned this pull request Apr 5, 2026

feat(skills): add autoresearch — autonomous git-based experiment loop #5175

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): add autoresearch — autonomous git-based experiment loop#5112

feat(skills): add autoresearch — autonomous git-based experiment loop#5112
tugrulguner wants to merge 1 commit into
NousResearch:mainfrom
tugrulguner:feat/autoresearch-skill

tugrulguner commented Apr 4, 2026 •

edited

Loading

Uh oh!

tugrulguner commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tugrulguner commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What It Does

Architecture

Testing

How to Use

Related

Uh oh!

tugrulguner commented Apr 4, 2026

Q&A (from maintainer review)

Evaluation metric for text outputs

Literature research testing

Scope of changes

Iteration limits vs. context window

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tugrulguner commented Apr 4, 2026 •

edited

Loading