Skip to content

feat(skills): add autoresearch — autonomous git-based experiment loop#5112

Closed
tugrulguner wants to merge 1 commit into
NousResearch:mainfrom
tugrulguner:feat/autoresearch-skill
Closed

feat(skills): add autoresearch — autonomous git-based experiment loop#5112
tugrulguner wants to merge 1 commit into
NousResearch:mainfrom
tugrulguner:feat/autoresearch-skill

Conversation

@tugrulguner

@tugrulguner tugrulguner commented Apr 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds the autoresearch pattern as a native Hermes skill: autonomous background research using a branch → experiment → evaluate → merge/revert git loop. Zero external dependencies.

What It Does

  1. Initialize — Creates run directory with state files (config, status, plan, control, checkpoint)
  2. Plan — Breaks research goal into experiments (investigate, deepen, verify, synthesize)
  3. Loop — For each experiment:
    • git checkout -b exp_N from main
    • Do the work (modify code, gather data, train ML)
    • Evaluate: deterministic metric (ML) or self-eval rubric (knowledge)
    • Merge if improved / Revert if worse
  4. Report — Markdown report with findings, experiment log, token usage

Architecture

skills/research/autoresearch/
|-- SKILL.md              Skill definition
|-- scripts/              7 helper scripts (stdlib Python only)
|   |-- state.py          Atomic JSON I/O, budget enforcement
|   |-- plan.py           Experiment CRUD
|   |-- evaluate.py       Scoring rubric + ML metric comparison
|   |-- workspace.py      Git branch/merge/revert
|   |-- report.py         Report generation
|   |-- registry.py       Multi-user tracking
|   |-- usage.py          Token/cost tracking
|-- templates/            Cron, watchdog, resume prompts
|-- test_e2e.py           32 assertions
|-- test_integration.py   40 tests, 9 classes

Testing

  • test_integration.py: 40 tests, 9 classes — ALL PASSING
  • test_e2e.py: 32 assertions — ALL PASSING
  • Mode 1 (ML): Wine dataset RF baseline 100%, 6 experiments all correctly evaluated
  • Mode 2 (Knowledge): AI coding agents analysis, 6 experiments merged, 450-line report

How to Use

Launch from any Hermes Agent session:

cronjob(action="create", name="research-<id>", schedule="1m",
        skills=["autoresearch"], prompt=<filled_template>, deliver="origin")

Related

@tugrulguner tugrulguner force-pushed the feat/autoresearch-skill branch 3 times, most recently from c7da4a3 to 7778259 Compare April 4, 2026 20:43
@tugrulguner

Copy link
Copy Markdown
Contributor Author

Q&A (from maintainer review)

Evaluation metric for text outputs

Karpathy's autoresearch uses val_bpb as a clear numerical objective. For research.md, the equivalent is a 5-criteria self-evaluation rubric in evaluate.py: evidence, accuracy, depth, relevance, and net improvement (each 1-5, total out of 25).

Honest answer: Yes, this is LLM self-evaluation. Unlike val_bpb, there is no ground-truth numerical metric for text quality. The rubric mitigates hallucination and lazy writing through hard gates:

  • Evidence >= 3 — Must cite real sources, provide concrete data/quotes. A merge with evidence=1 (no sourcing) is impossible — the scoring function auto-rejects it.
  • Net improvement >= 3 — Must add meaningful content. If the agent just rewrites without adding value, it scores 1 and auto-reverts.
  • Total >= 13 — The overall quality floor.

In practice this worked on our AI coding agents test where 6 experiments all scored 19-24/25 and were merged with real pricing data, feature matrices, and market analysis.

Literature research testing

Not yet. We have validated on Mode 1 (wine dataset ML optimization, 6 experiments) and Mode 2 (AI coding agents competitive analysis, 6 experiments). The system can handle literature research because the agent has browser and web tools, but PDF parsing, source deduplication, and academic citation formatting are untested. The rubric would need tweaking for academic work (e.g., evidence >= 4 minimum, require DOI/URL citations).

Scope of changes

Entirely contained within the skill. Zero changes to core Hermes:

  • No modifications to gateway, cron scheduler, agent core, or delegation system
  • No new Hermes tools or APIs
  • All 7 scripts are stdlib Python, zero external dependencies

If you delete skills/research/autoresearch/, Hermes behaves identically to before.

Iteration limits vs. context window

The git loop does not reduce token consumption per run — the agent might still burn through 50 iterations. The benefit is different:

  • State persistencecheckpoint.json, status.json, and results.log allow the agent to resume after a context reset. If the cron job hits 50 iterations and dies, the next run starts from checkpoint and knows exactly which experiments are done/merged/reverted.
  • Main branch as permanent memory — Even if context is lost, accumulated knowledge lives in the git history. A resumed agent reads research.md on main and picks up from there.
  • Mid-run replanning — Every 5 experiments the agent re-reads stats and plan, which re-focuses the context and avoids drift.

@tugrulguner tugrulguner force-pushed the feat/autoresearch-skill branch from a66f62d to 5b666ce Compare April 4, 2026 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant