feat(skills): add autoresearch — autonomous git-based experiment loop#5175
feat(skills): add autoresearch — autonomous git-based experiment loop#5175tugrulguner wants to merge 3 commits into
Conversation
Q&A (from maintainer review)Evaluation metric for text outputsKarpathy's autoresearch uses Honest answer: Yes, this is LLM self-evaluation. Unlike
In practice this worked on our AI coding agents test where 6 experiments all scored 19-24/25 and were merged with real pricing data, feature matrices, and market analysis. Literature research testingNot yet. We have validated on Mode 1 (wine dataset ML optimization, 6 experiments) and Mode 2 (AI coding agents competitive analysis, 6 experiments). The system can handle literature research because the agent has browser and web tools, but PDF parsing, source deduplication, and academic citation formatting are untested. The rubric would need tweaking for academic work (e.g., evidence >= 4 minimum, require DOI/URL citations). Scope of changesEntirely contained within the skill. Zero changes to core Hermes:
If you delete Iteration limits vs. context windowThe git loop does not reduce token consumption per run — the agent might still burn through 50 iterations. The benefit is different:
|
b7f5f82 to
a8fe406
Compare
Adds autonomous research skill inspired by Karpathy's autoresearch. Branch → experiment → evaluate → merge/revert loop for ML optimization and knowledge research. Zero external dependencies (stdlib Python only). - 8 helper scripts: state, plan, evaluate, workspace, report, registry, usage, _util - 3 prompt templates: cron agent, watchdog, resume - Safety gates: time/token/experiment budgets, watchdog enforcement, stall detection - Multi-user registry with HERMES_HOME support - 44 integration tests + 32 e2e assertions, all passing
a8fe406 to
bc827c0
Compare
…main branch exists workspace.py init() created the git repo but never committed, leaving main as an unborn branch. merge/revert would fail with "pathspec 'main' did not match" on the first experiment cycle. Add --allow-empty commit so main exists immediately after init.
Summary
Adds the autoresearch pattern as a native Hermes skill: autonomous background research using a branch → experiment → evaluate → merge/revert git loop. Zero external dependencies.
What It Does
git checkout -b exp_Nfrom mainSupported Flows
Flow 1 — ML/Code Optimization: Iterate on train.py, try different models/hyperparameters, keep only what beats baseline. Evaluation is deterministic: metric comparison (val_bpb, accuracy, latency). Example: "Optimize this random forest classifier on the wine dataset."
Flow 2 — Knowledge Research: Build research.md via web research. Self-eval rubric: evidence (1-5), accuracy (1-5), depth (1-5), relevance (1-5), net improvement (1-5). Total >= 13 with evidence >= 3 and relevance >= 3 → merge. Example: "Research the AI coding agents market — pricing, features, market share."
Flow 3 — Code Analysis/Security Audit: Analyze a codebase for issues using static analysis. Experiments add findings with file paths, line numbers, and code snippets. Only merges with concrete evidence. Example: "Audit this repo for OWASP top 10 vulnerabilities."
Flow 4 — Recurring Competitive Intelligence: Weekly cron that checks for competitor changes, merges only genuinely new findings. Same knowledge rubric but experiments focus on delta from last run. Example: "Track weekly changes in competitor pricing pages."
Flow 5 — Product Requirements / Spec Refinement: Start with a rough PRD, iteratively refine via research on feasibility, competitors, and market gaps. Each experiment deepens a section. Example: "Refine this PRD for a developer tools product."
Architecture
Testing
What's tested: All helper scripts, the full Phase 1→4 loop (init, plan, branch, evaluate, merge/revert, pause/resume, budget enforcement, report generation, registry, usage tracking), template variable references, git history integrity.
What needs live validation: The actual cron scheduling + gateway delivery path — i.e. launching via
cronjob(), watchdog firing every 15min, and final report auto-delivery. This requires a persistent Hermes gateway with cron enabled. The skill uses the existing cron system as-is (no modifications to it), so the integration risk is low.How to Use
Launch from any Hermes Agent session:
Key Implementation Details
Related
Contribution
Closes #5114