feat(skills): add autoresearch — autonomous git-based experiment loop by tugrulguner · Pull Request #5175 · NousResearch/hermes-agent

tugrulguner · 2026-04-05T04:32:04Z

Summary

Adds the autoresearch pattern as a native Hermes skill: autonomous background research using a branch → experiment → evaluate → merge/revert git loop. Zero external dependencies.

What It Does

Initialize — Creates run directory with state files (config, status, plan, control, checkpoint)
Plan — Breaks research goal into experiments (investigate, deepen, verify, synthesize)
Loop — For each experiment:
- git checkout -b exp_N from main
- Do the work (modify code, gather data, train ML)
- Evaluate: deterministic metric (ML) or self-eval rubric (knowledge)
- Merge if improved / Revert if worse
Report — Markdown report with findings, experiment log, token usage

Supported Flows

Flow 1 — ML/Code Optimization: Iterate on train.py, try different models/hyperparameters, keep only what beats baseline. Evaluation is deterministic: metric comparison (val_bpb, accuracy, latency). Example: "Optimize this random forest classifier on the wine dataset."

Flow 2 — Knowledge Research: Build research.md via web research. Self-eval rubric: evidence (1-5), accuracy (1-5), depth (1-5), relevance (1-5), net improvement (1-5). Total >= 13 with evidence >= 3 and relevance >= 3 → merge. Example: "Research the AI coding agents market — pricing, features, market share."

Flow 3 — Code Analysis/Security Audit: Analyze a codebase for issues using static analysis. Experiments add findings with file paths, line numbers, and code snippets. Only merges with concrete evidence. Example: "Audit this repo for OWASP top 10 vulnerabilities."

Flow 4 — Recurring Competitive Intelligence: Weekly cron that checks for competitor changes, merges only genuinely new findings. Same knowledge rubric but experiments focus on delta from last run. Example: "Track weekly changes in competitor pricing pages."

Flow 5 — Product Requirements / Spec Refinement: Start with a rough PRD, iteratively refine via research on feasibility, competitors, and market gaps. Each experiment deepens a section. Example: "Refine this PRD for a developer tools product."

Architecture

skills/research/autoresearch/
├── SKILL.md                    # Skill definition and trigger conditions
├── scripts/                    # 8 helper scripts (stdlib Python only)
│   ├── _util.py                # Shared atomic write, JSON I/O, HERMES_HOME
│   ├── state.py                # Atomic JSON I/O, budget enforcement
│   ├── plan.py                 # Experiment CRUD (investigate/deepen/verify/synthesize)
│   ├── evaluate.py             # Scoring rubric + ML metric comparison
│   ├── workspace.py            # Git branch/merge/revert operations
│   ├── report.py               # Markdown report generation from results.log
│   ├── registry.py             # Multi-user run tracking
│   └── usage.py                # Token/cost tracking via SessionDB
├── templates/
│   ├── cron_prompt.md          # Main loop (4 phases: setup, planning, loop, synthesis)
│   ├── watchdog_prompt.md      # Progress monitor (every 15 min)
│   └── resume_prompt.md        # Resume from checkpoint
├── test_e2e.py                 # 32 assertions
└── test_integration.py         # 44 tests, 10 classes

Testing

test_integration.py: 44 tests, 10 classes — ALL PASSING
test_e2e.py: 32 assertions — ALL PASSING
Full loop simulation: 83 assertions (init → plan → 6 experiments → pause/resume → budget enforcement → report) — ALL PASSING
P1 fix tests: 17 assertions (token auto-read from usage.json, tier param enforcement) — ALL PASSING
Mode 1 (ML): Wine dataset RF baseline 100%, 6 experiments all correctly evaluated
Mode 2 (Knowledge): AI coding agents analysis, 6 experiments merged, 450-line report

What's tested: All helper scripts, the full Phase 1→4 loop (init, plan, branch, evaluate, merge/revert, pause/resume, budget enforcement, report generation, registry, usage tracking), template variable references, git history integrity.

What needs live validation: The actual cron scheduling + gateway delivery path — i.e. launching via cronjob(), watchdog firing every 15min, and final report auto-delivery. This requires a persistent Hermes gateway with cron enabled. The skill uses the existing cron system as-is (no modifications to it), so the integration risk is low.

How to Use

Launch from any Hermes Agent session:

cronjob(action="create", name="research-<id>", schedule="1m",
        skills=["autoresearch"], prompt=<filled_template>, deliver="origin")

Key Implementation Details

state.py: tempfile.mkstemp + os.replace for atomic writes. Budget checks auto-read usage.json for token enforcement. control.json read before every loop iteration.
workspace.py: git init with initial empty commit so main branch exists immediately. shlex.quote() on all shell-interpolated values. Outputs commands for agent to execute.
evaluate.py: Two scoring paths - score_knowledge(evidence, accuracy, depth, relevance, net_improvement) for knowledge, score_ml(metric_value, prev_best) for ML.
report.py: Parses results.log for merge/revert decisions, reads usage.json for token counts. Summary shows top findings when merged > 0.
registry.py / usage.py: Respect HERMES_HOME env var for non-default profiles and test environments.
cron_prompt.md: Passes --max-duration and --max-tokens to state.py init so Quick/Deep/Unlimited tiers get correct budget limits.
Zero dependencies: All stdlib Python. No changes to core Hermes.

Contribution

I have read the contributing guidelines
I have tested my changes locally
My changes do not introduce any new warnings or errors
I have added tests that prove my fix/feature works
New and existing tests pass locally with my changes

Closes #5114

tugrulguner · 2026-04-05T04:33:46Z

Q&A (from maintainer review)

Evaluation metric for text outputs

Karpathy's autoresearch uses val_bpb as a clear numerical objective. For research.md, the equivalent is a 5-criteria self-evaluation rubric in evaluate.py: evidence, accuracy, depth, relevance, and net improvement (each 1-5, total out of 25).

Honest answer: Yes, this is LLM self-evaluation. Unlike val_bpb, there is no ground-truth numerical metric for text quality. The rubric mitigates hallucination and lazy writing through hard gates:

Evidence >= 3 — Must cite real sources, provide concrete data/quotes. A merge with evidence=1 (no sourcing) is impossible — the scoring function auto-rejects it.
Net improvement >= 3 — Must add meaningful content. If the agent just rewrites without adding value, it scores 1 and auto-reverts.
Total >= 13 — The overall quality floor.

In practice this worked on our AI coding agents test where 6 experiments all scored 19-24/25 and were merged with real pricing data, feature matrices, and market analysis.

Literature research testing

Not yet. We have validated on Mode 1 (wine dataset ML optimization, 6 experiments) and Mode 2 (AI coding agents competitive analysis, 6 experiments). The system can handle literature research because the agent has browser and web tools, but PDF parsing, source deduplication, and academic citation formatting are untested. The rubric would need tweaking for academic work (e.g., evidence >= 4 minimum, require DOI/URL citations).

Scope of changes

Entirely contained within the skill. Zero changes to core Hermes:

No modifications to gateway, cron scheduler, agent core, or delegation system
No new Hermes tools or APIs
All 8 scripts are stdlib Python, zero external dependencies

If you delete skills/research/autoresearch/, Hermes behaves identically to before.

Iteration limits vs. context window

The git loop does not reduce token consumption per run — the agent might still burn through 50 iterations. The benefit is different:

State persistence — checkpoint.json, status.json, and results.log allow the agent to resume after a context reset. If the cron job hits 50 iterations and dies, the next run starts from checkpoint and knows exactly which experiments are done/merged/reverted.
Main branch as permanent memory — Even if context is lost, accumulated knowledge lives in the git history. A resumed agent reads research.md on main and picks up from there.
Mid-run replanning — Every 5 experiments the agent re-reads stats and plan, which re-focuses the context and avoids drift.

Adds autonomous research skill inspired by Karpathy's autoresearch. Branch → experiment → evaluate → merge/revert loop for ML optimization and knowledge research. Zero external dependencies (stdlib Python only). - 8 helper scripts: state, plan, evaluate, workspace, report, registry, usage, _util - 3 prompt templates: cron agent, watchdog, resume - Safety gates: time/token/experiment budgets, watchdog enforcement, stall detection - Multi-user registry with HERMES_HOME support - 44 integration tests + 32 e2e assertions, all passing

…main branch exists workspace.py init() created the git repo but never committed, leaving main as an unborn branch. merge/revert would fail with "pathspec 'main' did not match" on the first experiment cycle. Add --allow-empty commit so main exists immediately after init.

alt-glitch · 2026-05-01T10:16:07Z

Likely duplicate of #5112 (closed) — same autoresearch skill PR. Also competing with open #7911.

alt-glitch · 2026-05-01T10:16:29Z

Likely duplicate of #5112 (closed) — same autoresearch skill PR. Also competing with open #7911.

tugrulguner · 2026-05-01T13:53:48Z

Likely duplicate of #5112 (closed) — same autoresearch skill PR. Also competing with open #7911.

True, there was something wrong with that PR so I closed it, this PR has been open for like a month :(

tugrulguner force-pushed the feat/autoresearch-v2 branch from b7f5f82 to a8fe406 Compare April 5, 2026 13:33

tugrulguner force-pushed the feat/autoresearch-v2 branch from a8fe406 to bc827c0 Compare April 5, 2026 13:56

tugrulguner added 2 commits April 5, 2026 13:50

Merge remote-tracking branch 'upstream/main' into feat/autoresearch-v2

916f86a

alt-glitch added type/feature New feature or request P3 Low — cosmetic, nice to have tool/skills Skills system (list, view, manage) labels May 1, 2026

alt-glitch mentioned this pull request May 1, 2026

[Feature]: Autoresearch skill - autonomous git-based experiment loop for ML optimization and knowledge research #5114

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): add autoresearch — autonomous git-based experiment loop#5175

feat(skills): add autoresearch — autonomous git-based experiment loop#5175
tugrulguner wants to merge 3 commits into
NousResearch:mainfrom
tugrulguner:feat/autoresearch-v2

tugrulguner commented Apr 5, 2026 •

edited

Loading

Uh oh!

tugrulguner commented Apr 5, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

tugrulguner commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tugrulguner commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What It Does

Supported Flows

Architecture

Testing

How to Use

Key Implementation Details

Related

Contribution

Uh oh!

tugrulguner commented Apr 5, 2026

Q&A (from maintainer review)

Evaluation metric for text outputs

Literature research testing

Scope of changes

Iteration limits vs. context window

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

alt-glitch commented May 1, 2026

Uh oh!

tugrulguner commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tugrulguner commented Apr 5, 2026 •

edited

Loading