Feature: Best-of-N Competitive Evaluation — Judge-Based Selection from Parallel Agent Outputs (inspired by Blackbox AI Chairman)


## Overview

[Blackbox AI](https://www.blackbox.ai/)'s flagship architectural feature is the **"Chairman" pattern**: dispatch the same coding task to N different agents/models in parallel, then have an independent judge LLM evaluate all outputs side-by-side and select the best implementation. This is distinct from consensus voting (#412, where agents vote on each other's work) and single-agent quality gating (#356, where a judge evaluates one output against criteria). The Chairman pattern is specifically about **competitive evaluation** — N implementations of the same task, ranked by an independent evaluator.

Blackbox runs this pattern at scale (30M+ users, Fortune 500 companies) via their [Multi-Agent Execution API](https://docs.blackbox.ai/features/blackbox-cloud-multi-agent), dispatching to Claude, Codex, Gemini, and Blackbox agents simultaneously. Their [CLI](https://github.com/blackboxaicode/cli) also runs this locally with a built-in judge mechanism.

This issue proposes implementing the Best-of-N competitive evaluation pattern natively in Hermes Agent by extending \`delegate_task\` batch mode with an optional judge step — enabling any Hermes user to run N implementations (with different models, prompts, or approaches) and get the best one back without manual comparison.

**Related issues:**
- #344 — Multi-Agent Architecture (provides the workflow DAG infrastructure this builds on)
- #412 — Consensus & Voting Engine (complementary — voting is agents deciding together; this is an external judge deciding for them)
- #356 — Acceptance Criteria & Independent Judge (single-agent quality gate; this extends the judge concept to multi-output comparison)
- #413 — Cross-CLI Agent Orchestration (Best-of-N is especially powerful when different CLI agents compete)
- #475 — Blackbox CLI Skill (the platform that inspired this pattern)

---

## Research Findings

### How Blackbox's Chairman Pattern Works

Blackbox's multi-agent execution follows this pipeline:

1. **Dispatch** — Same prompt sent to 2-5 agents simultaneously (e.g., Claude Sonnet 4.5, GPT-5.2 Codex, Gemini 2.5 Pro, Blackbox Pro)
2. **Parallel execution** — Each agent works independently in its own sandbox, producing a complete implementation
3. **Side-by-side monitoring** — Users can watch execution logs from all agents in real-time
4. **Judge evaluation** — An "AI Judge" (specialized LLM) analyzes all outputs for:
   - Code quality and correctness
   - Efficiency and performance characteristics
   - Error handling and edge case coverage
   - Adherence to the original prompt
   - Overall implementation approach
5. **Winner selection** — The judge picks the best implementation, which becomes the final PR/result

**API Example:**
\`\`\`bash
curl -X POST https://cloud.blackbox.ai/api/tasks \
  -H "Authorization: Bearer bb_xxxxx" \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Implement JWT authentication with refresh tokens",
    "repoUrl": "https://github.com/org/repo.git",
    "selectedAgents": [
      {"agent": "claude", "model": "blackboxai/anthropic/claude-sonnet-4.5"},
      {"agent": "codex", "model": "gpt-5.2-codex"},
      {"agent": "gemini", "model": "gemini-2.5-pro"}
    ]
  }'
\`\`\`

### Key Design Decisions in Blackbox's Approach

1. **Independent judge, not self-evaluation** — The judge is a separate LLM call, not one of the competing agents. This prevents bias.
2. **Same prompt, different models** — The variation comes from model diversity, not prompt variation. Different models bring different architectural intuitions.
3. **Winner-take-all** — One implementation is selected; others are discarded. This is simpler than trying to merge the best parts from each.
4. **Credit cost scales linearly** — Each agent execution costs credits independently. Users trade cost for quality.

### Why This Matters: Best-of-N is a Known Quality Multiplier

The Best-of-N pattern is well-established in ML as a simple but effective quality improvement technique:

- **LLM research** — Best-of-N sampling with a reward model consistently outperforms single-shot generation (Stiennon et al., 2020; Nakano et al., 2022)
- **Code generation** — AlphaCode generates thousands of candidates and filters with execution-based evaluation; even small N (3-5) gives significant quality gains
- **Agent reliability** — Multiple independent attempts at a task followed by evaluation is more robust than a single attempt with self-correction

The key insight: **generating N independent solutions and picking the best one is often more effective than iterating on a single solution**, especially when the task has multiple valid approaches.

---

## Current State in Hermes Agent

**\`delegate_task\` batch mode (today):**
\`\`\`python
delegate_task(tasks=[
    {"goal": "Implement feature using approach A"},
    {"goal": "Implement feature using approach B"},
    {"goal": "Implement feature using approach C"},
])
\`\`\`
- Returns all 3 results to the parent agent
- Parent must manually read, compare, and decide
- No structured evaluation criteria
- No judge step — just "here are 3 summaries, you figure it out"

**\`mixture_of_agents\`:**
- Queries 4 frontier models in parallel, aggregator **synthesizes** (merges perspectives)
- This is synthesis, not competitive selection — fundamentally different goal
- One-shot reasoning, not multi-step agent execution with tool use

**Gap:** No mechanism for "run N competing implementations → evaluate with criteria → return the winner." The parent agent currently does this manually, which means:
- The evaluation happens in the parent's context window (expensive, takes up space)
- No structured rubric — the parent applies ad-hoc judgment
- N full implementation summaries must fit in context simultaneously
- Can't apply domain-specific evaluation (run tests, check types, measure performance)

---

## Implementation Plan

### Skill vs. Tool Classification

This should be a **codebase change** extending \`delegate_tool.py\` because:
- It requires deterministic evaluation logic (scoring rubric, winner selection) that must execute precisely
- It extends the core delegation infrastructure, not an external CLI wrapper
- It involves multi-step orchestration (fan-out → collect → judge → return) that can't be expressed as skill instructions
- The judge step needs access to all agent outputs simultaneously with structured comparison

### What We'd Need

1. **\`judge\` parameter on \`delegate_task\`** — When \`True\` or a criteria string, adds a judge evaluation step after batch completion
2. **Judge prompt template** — Structured evaluation rubric that the judge LLM uses to compare outputs
3. **Evaluation dimensions** — Configurable criteria (correctness, efficiency, completeness, code quality, etc.)
4. **Model diversity support** — Allow each batch task to specify a different \`model\` override so competing agents use different LLMs
5. **Winner extraction** — Parse the judge's evaluation to identify the winning output and return only that one (with evaluation reasoning)

### Phased Rollout

**Phase 1: Simple Best-of-N with Judge (Standalone)**

Add an optional \`judge\` parameter to \`delegate_task\` batch mode:

\`\`\`python
delegate_task(
    tasks=[
        {"goal": "Implement JWT auth for Express API", "model": "claude-sonnet-4.5"},
        {"goal": "Implement JWT auth for Express API", "model": "gpt-5.2"},
        {"goal": "Implement JWT auth for Express API", "model": "gemini-2.5-pro"},
    ],
    judge=True  # or judge="Evaluate for security, test coverage, and code clarity"
)
```

Implementation:
- After all N tasks complete, construct a judge prompt with all N outputs
- Judge prompt: "You are evaluating N implementations of the same task. Compare them on [criteria]. Select the best one and explain why."
- Use the parent's configured model (or a configurable judge model) for evaluation
- Return: winning output's summary + judge's reasoning + scores for all candidates

**Phase 2: Structured Evaluation & Model Diversity**
- Rubric-based scoring: judge rates each output on multiple dimensions (1-10)
- Configurable judge model: \`judge={"model": "claude-opus-4.5", "criteria": [...]}\`
- Integration with #413 (Cross-CLI Orchestration): compete Blackbox vs Claude Code vs Codex on the same task
- Evaluation augmentation: optionally run tests, type checks, or linters on each output before judging

**Phase 3: Adaptive Best-of-N**
- Dynamic N: start with 2, add more agents if the first round produces close scores
- Cost-aware: estimate token cost per agent and let users set a budget
- History-based: track which models win most often for which task types, use for future routing (#157)
- Integration with #344 workflow DAG: Best-of-N as a step type in multi-agent workflows

### Minimal Implementation Sketch

\`\`\`python
# In delegate_tool.py, after batch results are collected:

if judge_config:
    judge_prompt = build_judge_prompt(
        task_description=common_goal,
        implementations=[
            {"agent": i, "model": t.get("model", "default"), "output": result}
            for i, (t, result) in enumerate(zip(tasks, results))
        ],
        criteria=judge_config.get("criteria", DEFAULT_CRITERIA),
    )
    
    # Run judge as a lightweight subagent (no tools needed, just reasoning)
    judge_result = run_judge(judge_prompt, model=judge_config.get("model"))
    
    # Extract winner index and reasoning
    winner = parse_judge_verdict(judge_result)
    return {
        "winner": results[winner.index],
        "reasoning": winner.reasoning,
        "scores": winner.scores,  # Per-candidate scores
        "all_results": results if judge_config.get("keep_all") else None,
    }
\`\`\`

---

## Pros & Cons

### Pros
- **Proven quality multiplier** — Best-of-N is one of the simplest and most reliable ways to improve LLM output quality, backed by extensive research
- **Model diversity leverage** — Different LLMs have different strengths; competitive evaluation exploits this without the user needing to know which model is best for what
- **Minimal infrastructure** — Builds on existing \`delegate_task\` batch mode; the new code is primarily the judge prompt and winner extraction logic
- **Composable** — Works with native Hermes subagents today, extends to external CLI agents when #413 lands
- **Transparent** — Judge reasoning explains why one implementation won, which is valuable for learning and debugging
- **Cost-quality tradeoff knob** — Users choose N based on their cost tolerance; N=2 is cheap, N=5 is thorough

### Cons / Risks
- **Cost multiplication** — Running N agents costs N times more tokens/time. For expensive models, this adds up.
- **Judge reliability** — The judge LLM may have its own biases (e.g., preferring verbose code). Calibrating the judge is non-trivial.
- **Diminishing returns** — For simple tasks, N=1 with a good model is usually sufficient. Best-of-N shines on ambiguous/complex tasks.
- **Latency** — Total time = max(all agent times) + judge time. For time-sensitive workflows, this may be too slow.
- **Context pressure** — The judge needs to see N full outputs simultaneously. For large implementations, this could exceed context limits. May need summary-based judging for large outputs.
- **"Same task" assumption** — The pattern works best when all agents receive identical prompts. If prompts vary, the comparison becomes less fair.

---

## Open Questions

- Should the judge use the parent's model or a dedicated judge model? Using a different (potentially stronger) model for judging avoids self-preference bias.
- How do we handle large outputs that exceed the judge's context window? Options: summary-based judging, chunked evaluation, or execution-based evaluation (run tests instead of reading code).
- Should we support "top-K" in addition to "winner-take-all"? Some workflows might want the top 2-3 candidates, not just the best one.
- How does this interact with #412 (Consensus)? Could agents vote first (cheap), then a judge evaluates the top candidates (expensive)?
- Should the evaluation include execution-based signals (do the implementations actually work)? This would require running tests, which adds complexity but significantly improves evaluation quality.

---

## References

- [Blackbox AI Multi-Agent Execution](https://docs.blackbox.ai/features/blackbox-cloud-multi-agent) — Production implementation of the Chairman/Judge pattern
- [Blackbox AI Multi-Agent Task API](https://docs.blackbox.ai/api-reference/multi-agent-task) — API reference for dispatching to 2-5 agents with judge selection
- [Blackbox CLI (GitHub)](https://github.com/blackboxaicode/cli) — Open-source CLI with built-in judge mechanism
- [Blackbox AI Review 2026 (Banani)](https://www.banani.co/blog/blackbox-ai-review) — Independent review covering Chairman workflow
- Stiennon et al., "Learning to summarize from human feedback" (2020) — Best-of-N with reward models
- Li et al., "Competition-level code generation with AlphaCode" (2022) — Massive Best-of-N for code
- Related Hermes issues: #344, #412, #356, #413, #475
ISSUE_EOF; __hermes_rc=$?; printf '__HERMES_FENCE_a9f7b3__'; exit $__hermes_rc


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Best-of-N Competitive Evaluation — Judge-Based Selection from Parallel Agent Outputs (inspired by Blackbox AI Chairman) #479

Overview

Research Findings

How Blackbox's Chairman Pattern Works

Key Design Decisions in Blackbox's Approach

Why This Matters: Best-of-N is a Known Quality Multiplier

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feature: Best-of-N Competitive Evaluation — Judge-Based Selection from Parallel Agent Outputs (inspired by Blackbox AI Chairman) #479

Description

Overview

Research Findings

How Blackbox's Chairman Pattern Works

Key Design Decisions in Blackbox's Approach

Why This Matters: Best-of-N is a Known Quality Multiplier

Current State in Hermes Agent

Implementation Plan

Skill vs. Tool Classification

What We'd Need

Phased Rollout

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions