[Feature]: Autoresearch skill - autonomous git-based experiment loop for ML optimization and knowledge research

### Problem or Use Case

## What Happens Now
Hermes agents write and overwrite files (train.py, research.md) repeatedly without keeping track of whether each change actually improved anything. If you ask an agent to "research ML models" or "build a competitive analysis," it produces output but has no mechanism to:
1. Compare experiments against a baseline
2. Discard bad experiments and keep only what worked
3. Maintain progress across long-running background runs
4. Self-evaluate the quality of its own additions before merging

## Why I want it
An agent should be able to run multiple experiments, objectively evaluate each one, and keep only the improvements - like a human researcher who tries approaches and only writes up the ones that work. The main branch should always hold the best version, preventing the destructive overwriting loop.

### AC

- [ ] Agent initializes research workspace with state files (config, status, plan, control, checkpoint)
- [ ] Agent creates experiments, branches from main for each, evaluates via deterministic metric (ML) or structured rubric (knowledge)
- [ ] Experiments that improve the target metric/score are merged into main; experiments that regress are reverted
- [ ] Research runs autonomously via cron without blocking the active chat
- [ ] User can pause/stop/resume mid-run via control.json
- [ ] Budget enforcement (time, tokens, experiment hard cap) prevents runaway runs
- [ ] Watchdog cron monitors progress and alerts on stalls (>30 min without update)
- [ ] Final report auto-delivers with findings, experiment log, and token usage

#### Open questions
- Should the self-evaluation rubric use a different threshold for domain-specific research (e.g., security audit needs evidence=4 minimum)?
- Should parallel experiments run via delegate_task for faster iteration?

### Proposed Solution

A git-based branch -> experiment -> evaluate -> merge/revert loop inspired by Karpathy's autoresearch. For each experiment, the agent:
1. git checkout -b exp_N from main
2. Does the work (modifies code, gathers data, trains models)
3. Evaluates via metric or rubric
4. If improved -> git merge exp_N into main (new baseline)
5. If worse -> git checkout -f main (discard branch, previous best preserved)

Supported flows:

Flow 1: ML/Code Optimization - Iterate on train.py, try different models/hyperparameters, keep only what beats baseline. Deterministic: metric comparison only.

Flow 2: Knowledge Research - Build research.md via web research. Self-eval rubric: evidence (1-5), accuracy (1-5), depth (1-5), relevance (1-5), net improvement (1-5). Total >= 13 with evidence >= 3 and relevance >= 3 -> merge.

Flow 3: Code Analysis/Security Audit - Use tree-sitter and static analysis to find issues. Must include file paths, line numbers, code snippets to merge.

Flow 4: Recurring Competitive Intelligence - Weekly cron that checks for competitor changes, merges only genuinely new findings.

Flow 5: Product Requirements / Spec Refinement - Start with rough PRD, iteratively refine via research on feasibility, competitors, market gaps.

### Implementation Notes

#### WHERE
Already implemented on the fork branch feat/autoresearch-skill at tugrulguner/hermes-agent:

```
skills/research/autoresearch/
├── SKILL.md                    # Skill definition and trigger conditions
├── scripts/
│   ├── state.py                # Atomic JSON I/O, budget enforcement
│   ├── plan.py                 # Experiment CRUD (investigate/deepen/verify/synthesize)
│   ├── evaluate.py             # Scoring rubric + ML metric comparison
│   ├── workspace.py            # Git branch/merge/revert operations
│   ├── report.py               # Markdown report generation from results.log
│   ├── registry.py             # Multi-user run tracking
│   └── usage.py                # Token/cost tracking via SessionDB
├── templates/
│   ├── cron_prompt.md          # Main loop (4 phases: setup, planning, loop, synthesis)
│   ├── watchdog_prompt.md      # Progress monitor (every 15 min)
│   └── resume_prompt.md        # Resume from checkpoint
├── test_e2e.py                 # 32 assertions
└── test_integration.py         # 40 tests, 9 classes
```

#### HOW
Agent launches via: cronjob(action="create", name="research-<id>", schedule="1m", skills=["autoresearch"], prompt=<filled_template>, deliver="origin")

Key implementation:
- state.py: tempfile.mkstemp + os.replace for atomic writes. Budget checks compare elapsed time, tokens, experiments vs hard caps. control.json read before every loop iteration.
- workspace.py: git init --initial-branch=main with fallback for older git. Outputs shell commands for agent to execute.
- evaluate.py: two scoring paths - score(evidence, accuracy, depth, relevance, net_improvement) for knowledge, log-result-ml(metric_name, metric_value, prev_best) for ML.
- report.py: parses results.log for merge/revert decisions, reads usage.json for token counts.

#### DEPENDENCIES
Zero new dependencies - all stdlib Python. Cron system already exists; messaging disabled for cron by design (handled by watchdog pattern).

### Alternatives Considered

Using delegate_task instead of cronjob for background runs - ruled out because delegate_task blocks the conversation while cron is fire-and-forget, and cron has built-in auto-delivery of the final result.

### Feature Type

Skill

### Scope

Research, ML, Knowledge

### Contribution

- [x] I'd like to implement this myself and submit a PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Autoresearch skill - autonomous git-based experiment loop for ML optimization and knowledge research #5114

Problem or Use Case

What Happens Now

Why I want it

AC

Open questions

Proposed Solution

Implementation Notes

WHERE

HOW

DEPENDENCIES

Alternatives Considered

Feature Type

Scope

Contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature]: Autoresearch skill - autonomous git-based experiment loop for ML optimization and knowledge research #5114

Description

Problem or Use Case

What Happens Now

Why I want it

AC

Open questions

Proposed Solution

Implementation Notes

WHERE

HOW

DEPENDENCIES

Alternatives Considered

Feature Type

Scope

Contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions