Feature: YC-Bench long-horizon agent benchmark environment

## Overview

[YC-Bench](https://github.com/collinear-ai/yc-bench) by [Collinear AI](https://collinear.ai/) (founded by Nazneen Rajani, former Robustness Research Lead at Hugging Face) is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialization across 7 skill domains.

Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures **long-term strategic coherence** — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns. This fills a capability gap in our evaluation suite: we can benchmark per-task execution (TerminalBench2) but not sustained multi-turn decision-making.

Initial results show frontier models still struggle: Claude 3.5 Sonnet survives only 1/3 Hard runs, while Gemini 1.5 Flash swept Nightmare difficulty. The benchmark reveals that the core gap for agents is "not reasoning ability but temporal coherence."

---

## Research Findings

### How YC-Bench Works

YC-Bench is a discrete-event simulation backed by SQLite. The agent starts with $250K seed capital and 5 employees, then manages:

- **7 skill domains**: system, research, data, frontend, backend, training, hardware
- **Prestige system** (1.0-10.0): Gates access to higher-paying tasks. Prestige in each domain must be earned incrementally.
- **Employee management**: Hire/assign employees with domain-specific skill rates. Salary bumps compound after each completed task.
- **Task pipeline**: Browse market → accept task → assign employees → dispatch → advance time → repeat
- **Financial pressure**: Biweekly payroll. Bankruptcy = game over.

**Agent interface**: The agent gets exactly ONE tool — `run_command(command: str)` — which executes `yc-bench <subcommand>` CLI calls that return JSON. Available commands:

| Category | Commands |
|----------|----------|
| Observe | `company status`, `employee list`, `market browse [--domain/--prestige-lte/--reward-min]`, `task list/inspect`, `finance ledger`, `report monthly` |
| Act | `task accept`, `task assign`, `task dispatch`, `task cancel`, `sim resume` |
| Memory | `scratchpad write/append/read/clear` |

**Scoring** is multi-dimensional:
- Survival (binary): Did the company avoid bankruptcy?
- Final funds: Total capital at horizon end
- Task win rate: % of accepted tasks completed on-time (>58% survives, <40% bankrupt)
- Prestige profile: Radar chart of domain-specific prestige levels
- API cost: USD spent on LLM calls

**9 difficulty presets** from `tutorial` (1yr, 3 employees, easy) to `default` (3yr, 10 employees, hardened), testing progressively harder skills:
- Tutorial/Easy: basic game loop, don't over-parallelize
- Medium: prestige climbing + domain specialization
- Hard: precise ETA computation + deadline reasoning
- Nightmare: sustained perfection under compounding payroll pressure

**Determinism**: SHA256-based RNG streams with seed control. Same seed + same preset = same world. Market task replacements are also deterministic.

### Key Design Decisions

1. **CLI-only interface**: All interaction through subprocess JSON commands — no direct DB or API access. This maps perfectly to Hermes Agent's `terminal` tool.
2. **SQLite state persistence**: Full game state in a single DB file per run. Enables replay, analysis, and clean isolation.
3. **Scratchpad for persistent memory**: The agent gets a scratchpad that persists through context truncation — tests whether agents proactively use external memory for strategic planning (only Claude 3.5 Sonnet did in their benchmarks).
4. **Business calendar**: Weekday-only 9AM-6PM simulation with proper payroll boundaries. Adds realistic time management pressure.
5. **Compounding mechanics**: Salary bumps, prestige gating, and task deadlines create cascading consequences from early decisions.

---

## Current State in Hermes Agent

### Existing Benchmark Infrastructure
- `environments/benchmarks/terminalbench_2/` — eval-only benchmark for per-task coding challenges
- `environments/hermes_base_env.py` — abstract base class (`HermesAgentBaseEnv`)
- `environments/agent_loop.py` — reusable multi-turn agent engine (`HermesAgentLoop`)
- `environments/tool_context.py` — per-rollout tool access for verification

### Pattern for New Benchmarks
```
HermesAgentBaseEnv -> YCBenchEvalEnv
  - config_init() -> default config + server configs
  - setup() -> load evaluation matrix (presets x seeds)
  - evaluate() -> run all eval items
  - rollout_and_score_eval() -> per-item: init sim, run agent loop, extract score
```

### Gap
No existing benchmark tests long-horizon strategic decision-making. TerminalBench2 evaluates single-task execution. YC-Bench would measure a fundamentally different capability dimension.

---

## Implementation Plan

### Classification

This is an **Atropos environment** (neither a skill nor a tool). It lives under `environments/benchmarks/yc_bench/` following the exact pattern of TerminalBench2.

### Integration Strategy: Direct CLI (bypass yc-bench's agent loop)

The key architectural insight: yc-bench's agent interacts with the simulation **entirely through CLI subprocess calls**. All state lives in SQLite. This means we can completely bypass their built-in agent loop and LiteLLM integration, having our own `HermesAgentLoop` drive the interaction instead.

```
Hermes Agent (via HermesAgentLoop)
  -> terminal tool -> subprocess("yc-bench market browse") -> JSON output
  -> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON output  
  -> terminal tool -> subprocess("yc-bench sim resume") -> JSON output (advance time)
  -> ... (100-500 turns per run)
```

### What We'd Need

1. **yc-bench** installed as a pip dependency (or cloned and installed editable)
2. **Environment class** (`yc_bench_env.py`) extending `HermesAgentBaseEnv`
3. **Config class** with yc-bench-specific fields (presets, seeds, horizon, max_turns)
4. **System prompt** adapted from yc-bench's `agent/prompt.py` — explains the CEO role and CLI commands
5. **Scoring function** that reads the SQLite DB after each run to extract results
6. **Evaluation matrix** defining which (preset, seed) combinations to run
7. **default.yaml** and **run_eval.sh** convenience scripts

### Phased Rollout

**Phase 1: Eval-only benchmark environment**
- `environments/benchmarks/yc_bench/yc_bench_env.py` — eval environment
- `environments/benchmarks/yc_bench/default.yaml` — default config (medium preset, 3 seeds)
- `environments/benchmarks/yc_bench/run_eval.sh` — convenience script
- Scoring: survival (0.0/1.0) + normalized funds (0.0-1.0 scale relative to initial capital)
- Aggregate score: `0.5 * survival + 0.5 * normalized_funds` (tunable)
- Per-difficulty and per-seed results in JSONL log
- Support `fast_test` preset for quick validation runs (50-turn cap)

**Phase 2: Multi-difficulty evaluation suite**
- Run all presets (easy -> nightmare) as a difficulty ladder
- Aggregate scoring across difficulty tiers
- Radar chart output of domain-specific prestige profiles
- Comparison framework against yc-bench's published baseline results (Claude, GPT-4o, Gemini)
- wandb integration for tracking runs

**Phase 3: RL training environment (stretch)**
- Convert from eval-only to training-capable environment
- Dense reward signal: per-task completion events, financial health checkpoints
- Episode termination on bankruptcy -> negative reward trajectory
- Curriculum learning: start on `easy`, advance to `hard` as agent improves
- This would be a novel use of yc-bench — they only do eval, not RL

---

## Pros & Cons

### Pros
- **Fills a real gap**: No existing Hermes benchmark tests long-horizon strategic coherence
- **Clean integration**: CLI-only interface maps directly to our `terminal` tool — no custom Python hooks needed
- **Deterministic and reproducible**: Same seed = same world, enabling apples-to-apples model comparisons
- **Difficulty ladder**: 9 presets from trivial to brutal, useful for progressive evaluation
- **Tests scratchpad/memory usage**: Reveals whether agents proactively use persistent memory — directly relevant to Hermes's memory system
- **Compounding decisions**: Tests whether agents can reason about cascading consequences, not just one-shot tasks
- **Active development**: Maintained by a credible AI research team (ex-HuggingFace, PhD researchers)
- **RL training potential**: Could become a long-horizon RL training environment (novel contribution)

### Cons / Risks
- **Cost per run**: Each evaluation run is 100-500+ LLM turns. At ~$0.01-0.05/turn, a single Hard preset run could cost $5-25 in API calls. A full eval suite (9 presets x 3 seeds = 27 runs) could cost $100-500+.
- **Wall-clock time**: Each run takes 30-60+ minutes. Full eval suite = 15-30 hours.
- **External dependency**: yc-bench is young (12 GitHub stars) and may change APIs or become unmaintained. We'd be coupling to their CLI interface.
- **Narrow domain**: Simulates a specific business scenario — may not generalize to broader agent capabilities. However, the underlying skills tested (resource allocation, temporal reasoning, financial planning) are broadly applicable.
- **Additional dependencies**: Adds sqlalchemy, matplotlib, litellm to the environments dependency chain (litellm may already be present via tinker-atropos).
- **Small eval matrix**: Only ~30 meaningful (preset x seed) combinations vs. TerminalBench2's hundreds of tasks. Statistical power is limited.
- **Scoring subjectivity**: Unlike binary pass/fail, the multi-dimensional scoring (survival, funds, prestige) requires design decisions about weighting and normalization.

---

## Open Questions

1. **Which presets to include in default eval?** Running all 9 presets x 3 seeds is expensive. Suggest `fast_test` (quick validation) + `medium` + `hard` (3 seeds each = 9 runs) as default.
2. **Scoring normalization**: How to normalize final funds into a 0-1 score? Options: (a) log-scale relative to initial capital, (b) relative to bot baseline performance, (c) percentile against published results.
3. **Max turns cap**: yc-bench's `fast_test` preset caps at 50 turns. For our eval, should we impose a global turn cap (e.g., 200) to control cost?
4. **Terminal backend**: Should each run use Modal sandboxing (like TerminalBench2) for isolation, or is local execution sufficient given that state lives in separate SQLite files?
5. **Context management**: yc-bench's own agent loop truncates to last 20 conversation rounds. Should we replicate this truncation strategy in our `HermesAgentLoop`, or let Hermes's native context compression handle it?
6. **Should we vendor the system prompt?** Adapting yc-bench's CEO system prompt into our environment keeps us decoupled from upstream changes, but adds maintenance burden.

---

## References

- [collinear-ai/yc-bench GitHub](https://github.com/collinear-ai/yc-bench)
- [Collinear AI](https://collinear.ai/) — company behind yc-bench
- [Twitter announcement thread](https://x.com/CollinearAI/status/2027531502234570768)
- [Nazneen Rajani (founder)](https://www.nazneenrajani.com/) — ex-HuggingFace Robustness Lead, MIT Innovators Under 35
- [TerminalBench2 env](https://github.com/NousResearch/hermes-agent/tree/main/environments/benchmarks/terminalbench_2) — existing benchmark pattern to follow
- [Hermes Agent environments/](https://github.com/NousResearch/hermes-agent/tree/main/environments) — Atropos RL integration layer


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: YC-Bench long-horizon agent benchmark environment #340

Overview

Research Findings

How YC-Bench Works

Key Design Decisions

Current State in Hermes Agent

Existing Benchmark Infrastructure

Pattern for New Benchmarks

Gap

Implementation Plan

Classification

Integration Strategy: Direct CLI (bypass yc-bench's agent loop)

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Category	Commands
Observe	`company status`, `employee list`, `market browse [--domain/--prestige-lte/--reward-min]`, `task list/inspect`, `finance ledger`, `report monthly`
Act	`task accept`, `task assign`, `task dispatch`, `task cancel`, `sim resume`
Memory	`scratchpad write/append/read/clear`

Feature: YC-Bench long-horizon agent benchmark environment #340

Description

Overview

Research Findings

How YC-Bench Works

Key Design Decisions

Current State in Hermes Agent

Existing Benchmark Infrastructure

Pattern for New Benchmarks

Gap

Implementation Plan

Classification

Integration Strategy: Direct CLI (bypass yc-bench's agent loop)

What We'd Need

Phased Rollout

Pros & Cons

Pros

Cons / Risks

Open Questions

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions