Skip to content

Feature: YC-Bench long-horizon agent benchmark environment #340

@teknium1

Description

@teknium1

Overview

YC-Bench by Collinear AI (founded by Nazneen Rajani, former Robustness Research Lead at Hugging Face) is a deterministic, long-horizon benchmark that tests LLM agents' ability to act as a tech startup CEO. The agent manages a simulated company over 1-3 years, making compounding decisions about resource allocation, cash flow, task management, and prestige specialization across 7 skill domains.

Unlike TerminalBench2 (which evaluates per-task coding ability with binary pass/fail), YC-Bench measures long-term strategic coherence — whether an agent can maintain consistent strategy, manage compounding consequences, and adapt plans over hundreds of turns. This fills a capability gap in our evaluation suite: we can benchmark per-task execution (TerminalBench2) but not sustained multi-turn decision-making.

Initial results show frontier models still struggle: Claude 3.5 Sonnet survives only 1/3 Hard runs, while Gemini 1.5 Flash swept Nightmare difficulty. The benchmark reveals that the core gap for agents is "not reasoning ability but temporal coherence."


Research Findings

How YC-Bench Works

YC-Bench is a discrete-event simulation backed by SQLite. The agent starts with $250K seed capital and 5 employees, then manages:

  • 7 skill domains: system, research, data, frontend, backend, training, hardware
  • Prestige system (1.0-10.0): Gates access to higher-paying tasks. Prestige in each domain must be earned incrementally.
  • Employee management: Hire/assign employees with domain-specific skill rates. Salary bumps compound after each completed task.
  • Task pipeline: Browse market → accept task → assign employees → dispatch → advance time → repeat
  • Financial pressure: Biweekly payroll. Bankruptcy = game over.

Agent interface: The agent gets exactly ONE tool — run_command(command: str) — which executes yc-bench <subcommand> CLI calls that return JSON. Available commands:

Category Commands
Observe company status, employee list, market browse [--domain/--prestige-lte/--reward-min], task list/inspect, finance ledger, report monthly
Act task accept, task assign, task dispatch, task cancel, sim resume
Memory scratchpad write/append/read/clear

Scoring is multi-dimensional:

  • Survival (binary): Did the company avoid bankruptcy?
  • Final funds: Total capital at horizon end
  • Task win rate: % of accepted tasks completed on-time (>58% survives, <40% bankrupt)
  • Prestige profile: Radar chart of domain-specific prestige levels
  • API cost: USD spent on LLM calls

9 difficulty presets from tutorial (1yr, 3 employees, easy) to default (3yr, 10 employees, hardened), testing progressively harder skills:

  • Tutorial/Easy: basic game loop, don't over-parallelize
  • Medium: prestige climbing + domain specialization
  • Hard: precise ETA computation + deadline reasoning
  • Nightmare: sustained perfection under compounding payroll pressure

Determinism: SHA256-based RNG streams with seed control. Same seed + same preset = same world. Market task replacements are also deterministic.

Key Design Decisions

  1. CLI-only interface: All interaction through subprocess JSON commands — no direct DB or API access. This maps perfectly to Hermes Agent's terminal tool.
  2. SQLite state persistence: Full game state in a single DB file per run. Enables replay, analysis, and clean isolation.
  3. Scratchpad for persistent memory: The agent gets a scratchpad that persists through context truncation — tests whether agents proactively use external memory for strategic planning (only Claude 3.5 Sonnet did in their benchmarks).
  4. Business calendar: Weekday-only 9AM-6PM simulation with proper payroll boundaries. Adds realistic time management pressure.
  5. Compounding mechanics: Salary bumps, prestige gating, and task deadlines create cascading consequences from early decisions.

Current State in Hermes Agent

Existing Benchmark Infrastructure

  • environments/benchmarks/terminalbench_2/ — eval-only benchmark for per-task coding challenges
  • environments/hermes_base_env.py — abstract base class (HermesAgentBaseEnv)
  • environments/agent_loop.py — reusable multi-turn agent engine (HermesAgentLoop)
  • environments/tool_context.py — per-rollout tool access for verification

Pattern for New Benchmarks

HermesAgentBaseEnv -> YCBenchEvalEnv
  - config_init() -> default config + server configs
  - setup() -> load evaluation matrix (presets x seeds)
  - evaluate() -> run all eval items
  - rollout_and_score_eval() -> per-item: init sim, run agent loop, extract score

Gap

No existing benchmark tests long-horizon strategic decision-making. TerminalBench2 evaluates single-task execution. YC-Bench would measure a fundamentally different capability dimension.


Implementation Plan

Classification

This is an Atropos environment (neither a skill nor a tool). It lives under environments/benchmarks/yc_bench/ following the exact pattern of TerminalBench2.

Integration Strategy: Direct CLI (bypass yc-bench's agent loop)

The key architectural insight: yc-bench's agent interacts with the simulation entirely through CLI subprocess calls. All state lives in SQLite. This means we can completely bypass their built-in agent loop and LiteLLM integration, having our own HermesAgentLoop drive the interaction instead.

Hermes Agent (via HermesAgentLoop)
  -> terminal tool -> subprocess("yc-bench market browse") -> JSON output
  -> terminal tool -> subprocess("yc-bench task accept --task-id X") -> JSON output  
  -> terminal tool -> subprocess("yc-bench sim resume") -> JSON output (advance time)
  -> ... (100-500 turns per run)

What We'd Need

  1. yc-bench installed as a pip dependency (or cloned and installed editable)
  2. Environment class (yc_bench_env.py) extending HermesAgentBaseEnv
  3. Config class with yc-bench-specific fields (presets, seeds, horizon, max_turns)
  4. System prompt adapted from yc-bench's agent/prompt.py — explains the CEO role and CLI commands
  5. Scoring function that reads the SQLite DB after each run to extract results
  6. Evaluation matrix defining which (preset, seed) combinations to run
  7. default.yaml and run_eval.sh convenience scripts

Phased Rollout

Phase 1: Eval-only benchmark environment

  • environments/benchmarks/yc_bench/yc_bench_env.py — eval environment
  • environments/benchmarks/yc_bench/default.yaml — default config (medium preset, 3 seeds)
  • environments/benchmarks/yc_bench/run_eval.sh — convenience script
  • Scoring: survival (0.0/1.0) + normalized funds (0.0-1.0 scale relative to initial capital)
  • Aggregate score: 0.5 * survival + 0.5 * normalized_funds (tunable)
  • Per-difficulty and per-seed results in JSONL log
  • Support fast_test preset for quick validation runs (50-turn cap)

Phase 2: Multi-difficulty evaluation suite

  • Run all presets (easy -> nightmare) as a difficulty ladder
  • Aggregate scoring across difficulty tiers
  • Radar chart output of domain-specific prestige profiles
  • Comparison framework against yc-bench's published baseline results (Claude, GPT-4o, Gemini)
  • wandb integration for tracking runs

Phase 3: RL training environment (stretch)

  • Convert from eval-only to training-capable environment
  • Dense reward signal: per-task completion events, financial health checkpoints
  • Episode termination on bankruptcy -> negative reward trajectory
  • Curriculum learning: start on easy, advance to hard as agent improves
  • This would be a novel use of yc-bench — they only do eval, not RL

Pros & Cons

Pros

  • Fills a real gap: No existing Hermes benchmark tests long-horizon strategic coherence
  • Clean integration: CLI-only interface maps directly to our terminal tool — no custom Python hooks needed
  • Deterministic and reproducible: Same seed = same world, enabling apples-to-apples model comparisons
  • Difficulty ladder: 9 presets from trivial to brutal, useful for progressive evaluation
  • Tests scratchpad/memory usage: Reveals whether agents proactively use persistent memory — directly relevant to Hermes's memory system
  • Compounding decisions: Tests whether agents can reason about cascading consequences, not just one-shot tasks
  • Active development: Maintained by a credible AI research team (ex-HuggingFace, PhD researchers)
  • RL training potential: Could become a long-horizon RL training environment (novel contribution)

Cons / Risks

  • Cost per run: Each evaluation run is 100-500+ LLM turns. At ~$0.01-0.05/turn, a single Hard preset run could cost $5-25 in API calls. A full eval suite (9 presets x 3 seeds = 27 runs) could cost $100-500+.
  • Wall-clock time: Each run takes 30-60+ minutes. Full eval suite = 15-30 hours.
  • External dependency: yc-bench is young (12 GitHub stars) and may change APIs or become unmaintained. We'd be coupling to their CLI interface.
  • Narrow domain: Simulates a specific business scenario — may not generalize to broader agent capabilities. However, the underlying skills tested (resource allocation, temporal reasoning, financial planning) are broadly applicable.
  • Additional dependencies: Adds sqlalchemy, matplotlib, litellm to the environments dependency chain (litellm may already be present via tinker-atropos).
  • Small eval matrix: Only ~30 meaningful (preset x seed) combinations vs. TerminalBench2's hundreds of tasks. Statistical power is limited.
  • Scoring subjectivity: Unlike binary pass/fail, the multi-dimensional scoring (survival, funds, prestige) requires design decisions about weighting and normalization.

Open Questions

  1. Which presets to include in default eval? Running all 9 presets x 3 seeds is expensive. Suggest fast_test (quick validation) + medium + hard (3 seeds each = 9 runs) as default.
  2. Scoring normalization: How to normalize final funds into a 0-1 score? Options: (a) log-scale relative to initial capital, (b) relative to bot baseline performance, (c) percentile against published results.
  3. Max turns cap: yc-bench's fast_test preset caps at 50 turns. For our eval, should we impose a global turn cap (e.g., 200) to control cost?
  4. Terminal backend: Should each run use Modal sandboxing (like TerminalBench2) for isolation, or is local execution sufficient given that state lives in separate SQLite files?
  5. Context management: yc-bench's own agent loop truncates to last 20 conversation rounds. Should we replicate this truncation strategy in our HermesAgentLoop, or let Hermes's native context compression handle it?
  6. Should we vendor the system prompt? Adapting yc-bench's CEO system prompt into our environment keeps us decoupled from upstream changes, but adds maintenance burden.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions