Overview
YC-Bench by Collinear AI (founded by Nazneen Rajani, former Robustness Research Lead at Hugging Face) is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialization across 7 skill domains.
Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures long-term strategic coherence — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns. This fills a capability gap in our evaluation suite: we can benchmark per-task execution (TerminalBench2) but not sustained multi-turn decision-making.
Initial results show frontier models still struggle: Claude 3.5 Sonnet survives only 1/3 Hard runs, while Gemini 1.5 Flash swept Nightmare difficulty. The benchmark reveals that the core gap for agents is "not reasoning ability but temporal coherence."
Research Findings
How YC-Bench Works
YC-Bench is a discrete-event simulation backed by SQLite. The agent starts with $250K seed capital and 5 employees, then manages:
- 7 skill domains: system, research, data, frontend, backend, training, hardware
- Prestige system (1.0-10.0): Gates access to higher-paying tasks. Prestige in each domain must be earned incrementally.
- Employee management: Hire/assign employees with domain-specific skill rates. Salary bumps compound after each completed task.
- Task pipeline: Browse market → accept task → assign employees → dispatch → advance time → repeat
- Financial pressure: Biweekly payroll. Bankruptcy = game over.
Agent interface: The agent gets exactly ONE tool — run_command(command: str) — which executes yc-bench <subcommand> CLI calls that return JSON. Available commands:
| Category |
Commands |
| Observe |
company status, employee list, market browse [--domain/--prestige-lte/--reward-min], task list/inspect, finance ledger, report monthly |
| Act |
task accept, task assign, task dispatch, task cancel, sim resume |
| Memory |
scratchpad write/append/read/clear |
Scoring is multi-dimensional:
- Survival (binary): Did the company avoid bankruptcy?
- Final funds: Total capital at horizon end
- Task win rate: % of accepted tasks completed on-time (>58% survives, <40% bankrupt)
- Prestige profile: Radar chart of domain-specific prestige levels
- API cost: USD spent on LLM calls
9 difficulty presets from tutorial (1yr, 3 employees, easy) to default (3yr, 10 employees, hardened), testing progressively harder skills:
- Tutorial/Easy: basic game loop, don't over-parallelize
- Medium: prestige climbing + domain specialization
- Hard: precise ETA computation + deadline reasoning
- Nightmare: sustained perfection under compounding payroll pressure
Determinism: SHA256-based RNG streams with seed control. Same seed + same preset = same world. Market task replacements are also deterministic.
Key Design Decisions
- CLI-only interface: All interaction through subprocess JSON commands — no direct DB or API access. This maps perfectly to Hermes Agent's
terminal tool.
- SQLite state persistence: Full game state in a single DB file per run. Enables replay, analysis, and clean isolation.
- Scratchpad for persistent memory: The agent gets a scratchpad that persists through context truncation — tests whether agents proactively use external memory for strategic planning (only Claude 3.5 Sonnet did in their benchmarks).
- Business calendar: Weekday-only 9AM-6PM simulation with proper payroll boundaries. Adds realistic time management pressure.
- Compounding mechanics: Salary bumps, prestige gating, and task deadlines create cascading consequences from early decisions.
Current State in Hermes Agent
Existing Benchmark Infrastructure
environments/benchmarks/terminalbench_2/ — eval-only benchmark for per-task coding challenges
environments/hermes_base_env.py — abstract base class (HermesAgentBaseEnv)
environments/agent_loop.py — reusable multi-turn agent engine (HermesAgentLoop)
environments/tool_context.py — per-rollout tool access for verification
Pattern for New Benchmarks
HermesAgentBaseEnv -> YCBenchEvalEnv
- config_init() -> default config + server configs
- setup() -> load evaluation matrix (presets x seeds)
- evaluate() -> run all eval items
- rollout_and_score_eval() -> per-item: init sim, run agent loop, extract score
Gap
No existing benchmark tests long-horizon strategic decision-making. TerminalBench2 evaluates single-task execution. YC-Bench would measure a fundamentally different capability dimension.
Implementation Plan
Classification
This is an Atropos environment (neither a skill nor a tool). It lives under environments/benchmarks/yc_bench/ following the exact pattern of TerminalBench2.
Integration Strategy: Direct CLI (bypass yc-bench's agent loop)
The key architectural insight: yc-bench's agent interacts with the simulation entirely through CLI subprocess calls. All state lives in SQLite. This means we can completely bypass their built-in agent loop and LiteLLM integration, having our own HermesAgentLoop drive the interaction instead.
Hermes Agent (via HermesAgentLoop)
-> terminal tool -> subprocess("yc-bench market browse") -> JSON output
-> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON output
-> terminal tool -> subprocess("yc-bench sim resume") -> JSON output (advance time)
-> ... (100-500 turns per run)
What We'd Need
- yc-bench installed as a pip dependency (or cloned and installed editable)
- Environment class (
yc_bench_env.py) extending HermesAgentBaseEnv
- Config class with yc-bench-specific fields (presets, seeds, horizon, max_turns)
- System prompt adapted from yc-bench's
agent/prompt.py — explains the CEO role and CLI commands
- Scoring function that reads the SQLite DB after each run to extract results
- Evaluation matrix defining which (preset, seed) combinations to run
- default.yaml and run_eval.sh convenience scripts
Phased Rollout
Phase 1: Eval-only benchmark environment
environments/benchmarks/yc_bench/yc_bench_env.py — eval environment
environments/benchmarks/yc_bench/default.yaml — default config (medium preset, 3 seeds)
environments/benchmarks/yc_bench/run_eval.sh — convenience script
- Scoring: survival (0.0/1.0) + normalized funds (0.0-1.0 scale relative to initial capital)
- Aggregate score:
0.5 * survival + 0.5 * normalized_funds (tunable)
- Per-difficulty and per-seed results in JSONL log
- Support
fast_test preset for quick validation runs (50-turn cap)
Phase 2: Multi-difficulty evaluation suite
- Run all presets (easy -> nightmare) as a difficulty ladder
- Aggregate scoring across difficulty tiers
- Radar chart output of domain-specific prestige profiles
- Comparison framework against yc-bench's published baseline results (Claude, GPT-4o, Gemini)
- wandb integration for tracking runs
Phase 3: RL training environment (stretch)
- Convert from eval-only to training-capable environment
- Dense reward signal: per-task completion events, financial health checkpoints
- Episode termination on bankruptcy -> negative reward trajectory
- Curriculum learning: start on
easy, advance to hard as agent improves
- This would be a novel use of yc-bench — they only do eval, not RL
Pros & Cons
Pros
- Fills a real gap: No existing Hermes benchmark tests long-horizon strategic coherence
- Clean integration: CLI-only interface maps directly to our
terminal tool — no custom Python hooks needed
- Deterministic and reproducible: Same seed = same world, enabling apples-to-apples model comparisons
- Difficulty ladder: 9 presets from trivial to brutal, useful for progressive evaluation
- Tests scratchpad/memory usage: Reveals whether agents proactively use persistent memory — directly relevant to Hermes's memory system
- Compounding decisions: Tests whether agents can reason about cascading consequences, not just one-shot tasks
- Active development: Maintained by a credible AI research team (ex-HuggingFace, PhD researchers)
- RL training potential: Could become a long-horizon RL training environment (novel contribution)
Cons / Risks
- Cost per run: Each evaluation run is 100-500+ LLM turns. At ~$0.01-0.05/turn, a single Hard preset run could cost $5-25 in API calls. A full eval suite (9 presets x 3 seeds = 27 runs) could cost $100-500+.
- Wall-clock time: Each run takes 30-60+ minutes. Full eval suite = 15-30 hours.
- External dependency: yc-bench is young (12 GitHub stars) and may change APIs or become unmaintained. We'd be coupling to their CLI interface.
- Narrow domain: Simulates a specific business scenario — may not generalize to broader agent capabilities. However, the underlying skills tested (resource allocation, temporal reasoning, financial planning) are broadly applicable.
- Additional dependencies: Adds sqlalchemy, matplotlib, litellm to the environments dependency chain (litellm may already be present via tinker-atropos).
- Small eval matrix: Only ~30 meaningful (preset x seed) combinations vs. TerminalBench2's hundreds of tasks. Statistical power is limited.
- Scoring subjectivity: Unlike binary pass/fail, the multi-dimensional scoring (survival, funds, prestige) requires design decisions about weighting and normalization.
Open Questions
- Which presets to include in default eval? Running all 9 presets x 3 seeds is expensive. Suggest
fast_test (quick validation) + medium + hard (3 seeds each = 9 runs) as default.
- Scoring normalization: How to normalize final funds into a 0-1 score? Options: (a) log-scale relative to initial capital, (b) relative to bot baseline performance, (c) percentile against published results.
- Max turns cap: yc-bench's
fast_test preset caps at 50 turns. For our eval, should we impose a global turn cap (e.g., 200) to control cost?
- Terminal backend: Should each run use Modal sandboxing (like TerminalBench2) for isolation, or is local execution sufficient given that state lives in separate SQLite files?
- Context management: yc-bench's own agent loop truncates to last 20 conversation rounds. Should we replicate this truncation strategy in our
HermesAgentLoop, or let Hermes's native context compression handle it?
- Should we vendor the system prompt? Adapting yc-bench's CEO system prompt into our environment keeps us decoupled from upstream changes, but adds maintenance burden.
References
Overview
YC-Bench by Collinear AI (founded by Nazneen Rajani, former Robustness Research Lead at Hugging Face) is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialization across 7 skill domains.
Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures long-term strategic coherence — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns. This fills a capability gap in our evaluation suite: we can benchmark per-task execution (TerminalBench2) but not sustained multi-turn decision-making.
Initial results show frontier models still struggle: Claude 3.5 Sonnet survives only 1/3 Hard runs, while Gemini 1.5 Flash swept Nightmare difficulty. The benchmark reveals that the core gap for agents is "not reasoning ability but temporal coherence."
Research Findings
How YC-Bench Works
YC-Bench is a discrete-event simulation backed by SQLite. The agent starts with $250K seed capital and 5 employees, then manages:
Agent interface: The agent gets exactly ONE tool —
run_command(command: str)— which executesyc-bench <subcommand>CLI calls that return JSON. Available commands:company status,employee list,market browse [--domain/--prestige-lte/--reward-min],task list/inspect,finance ledger,report monthlytask accept,task assign,task dispatch,task cancel,sim resumescratchpad write/append/read/clearScoring is multi-dimensional:
9 difficulty presets from
tutorial(1yr, 3 employees, easy) todefault(3yr, 10 employees, hardened), testing progressively harder skills:Determinism: SHA256-based RNG streams with seed control. Same seed + same preset = same world. Market task replacements are also deterministic.
Key Design Decisions
terminaltool.Current State in Hermes Agent
Existing Benchmark Infrastructure
environments/benchmarks/terminalbench_2/— eval-only benchmark for per-task coding challengesenvironments/hermes_base_env.py— abstract base class (HermesAgentBaseEnv)environments/agent_loop.py— reusable multi-turn agent engine (HermesAgentLoop)environments/tool_context.py— per-rollout tool access for verificationPattern for New Benchmarks
Gap
No existing benchmark tests long-horizon strategic decision-making. TerminalBench2 evaluates single-task execution. YC-Bench would measure a fundamentally different capability dimension.
Implementation Plan
Classification
This is an Atropos environment (neither a skill nor a tool). It lives under
environments/benchmarks/yc_bench/following the exact pattern of TerminalBench2.Integration Strategy: Direct CLI (bypass yc-bench's agent loop)
The key architectural insight: yc-bench's agent interacts with the simulation entirely through CLI subprocess calls. All state lives in SQLite. This means we can completely bypass their built-in agent loop and LiteLLM integration, having our own
HermesAgentLoopdrive the interaction instead.What We'd Need
yc_bench_env.py) extendingHermesAgentBaseEnvagent/prompt.py— explains the CEO role and CLI commandsPhased Rollout
Phase 1: Eval-only benchmark environment
environments/benchmarks/yc_bench/yc_bench_env.py— eval environmentenvironments/benchmarks/yc_bench/default.yaml— default config (medium preset, 3 seeds)environments/benchmarks/yc_bench/run_eval.sh— convenience script0.5 * survival + 0.5 * normalized_funds(tunable)fast_testpreset for quick validation runs (50-turn cap)Phase 2: Multi-difficulty evaluation suite
Phase 3: RL training environment (stretch)
easy, advance tohardas agent improvesPros & Cons
Pros
terminaltool — no custom Python hooks neededCons / Risks
Open Questions
fast_test(quick validation) +medium+hard(3 seeds each = 9 runs) as default.fast_testpreset caps at 50 turns. For our eval, should we impose a global turn cap (e.g., 200) to control cost?HermesAgentLoop, or let Hermes's native context compression handle it?References