Skip to content

Feature: Best-of-N Competitive Evaluation — Judge-Based Selection from Parallel Agent Outputs (inspired by Blackbox AI Chairman) #479

@teknium1

Description

@teknium1

Overview

Blackbox AI's flagship architectural feature is the "Chairman" pattern: dispatch the same coding task to N different agents/models in parallel, then have an independent judge LLM evaluate all outputs side-by-side and select the best implementation. This is distinct from consensus voting (#412, where agents vote on each other's work) and single-agent quality gating (#356, where a judge evaluates one output against criteria). The Chairman pattern is specifically about competitive evaluation — N implementations of the same task, ranked by an independent evaluator.

Blackbox runs this pattern at scale (30M+ users, Fortune 500 companies) via their Multi-Agent Execution API, dispatching to Claude, Codex, Gemini, and Blackbox agents simultaneously. Their CLI also runs this locally with a built-in judge mechanism.

This issue proposes implementing the Best-of-N competitive evaluation pattern natively in Hermes Agent by extending `delegate_task` batch mode with an optional judge step — enabling any Hermes user to run N implementations (with different models, prompts, or approaches) and get the best one back without manual comparison.

Related issues:


Research Findings

How Blackbox's Chairman Pattern Works

Blackbox's multi-agent execution follows this pipeline:

  1. Dispatch — Same prompt sent to 2-5 agents simultaneously (e.g., Claude Sonnet 4.5, GPT-5.2 Codex, Gemini 2.5 Pro, Blackbox Pro)
  2. Parallel execution — Each agent works independently in its own sandbox, producing a complete implementation
  3. Side-by-side monitoring — Users can watch execution logs from all agents in real-time
  4. Judge evaluation — An "AI Judge" (specialized LLM) analyzes all outputs for:
    • Code quality and correctness
    • Efficiency and performance characteristics
    • Error handling and edge case coverage
    • Adherence to the original prompt
    • Overall implementation approach
  5. Winner selection — The judge picks the best implementation, which becomes the final PR/result

API Example:
```bash
curl -X POST https://cloud.blackbox.ai/api/tasks
-H "Authorization: Bearer bb_xxxxx"
-H "Content-Type: application/json"
-d '{
"prompt": "Implement JWT authentication with refresh tokens",
"repoUrl": "https://github.com/org/repo.git",
"selectedAgents": [
{"agent": "claude", "model": "blackboxai/anthropic/claude-sonnet-4.5"},
{"agent": "codex", "model": "gpt-5.2-codex"},
{"agent": "gemini", "model": "gemini-2.5-pro"}
]
}'
```

Key Design Decisions in Blackbox's Approach

  1. Independent judge, not self-evaluation — The judge is a separate LLM call, not one of the competing agents. This prevents bias.
  2. Same prompt, different models — The variation comes from model diversity, not prompt variation. Different models bring different architectural intuitions.
  3. Winner-take-all — One implementation is selected; others are discarded. This is simpler than trying to merge the best parts from each.
  4. Credit cost scales linearly — Each agent execution costs credits independently. Users trade cost for quality.

Why This Matters: Best-of-N is a Known Quality Multiplier

The Best-of-N pattern is well-established in ML as a simple but effective quality improvement technique:

  • LLM research — Best-of-N sampling with a reward model consistently outperforms single-shot generation (Stiennon et al., 2020; Nakano et al., 2022)
  • Code generation — AlphaCode generates thousands of candidates and filters with execution-based evaluation; even small N (3-5) gives significant quality gains
  • Agent reliability — Multiple independent attempts at a task followed by evaluation is more robust than a single attempt with self-correction

The key insight: generating N independent solutions and picking the best one is often more effective than iterating on a single solution, especially when the task has multiple valid approaches.


Current State in Hermes Agent

`delegate_task` batch mode (today):
```python
delegate_task(tasks=[
{"goal": "Implement feature using approach A"},
{"goal": "Implement feature using approach B"},
{"goal": "Implement feature using approach C"},
])
```

  • Returns all 3 results to the parent agent
  • Parent must manually read, compare, and decide
  • No structured evaluation criteria
  • No judge step — just "here are 3 summaries, you figure it out"

`mixture_of_agents`:

  • Queries 4 frontier models in parallel, aggregator synthesizes (merges perspectives)
  • This is synthesis, not competitive selection — fundamentally different goal
  • One-shot reasoning, not multi-step agent execution with tool use

Gap: No mechanism for "run N competing implementations → evaluate with criteria → return the winner." The parent agent currently does this manually, which means:

  • The evaluation happens in the parent's context window (expensive, takes up space)
  • No structured rubric — the parent applies ad-hoc judgment
  • N full implementation summaries must fit in context simultaneously
  • Can't apply domain-specific evaluation (run tests, check types, measure performance)

Implementation Plan

Skill vs. Tool Classification

This should be a codebase change extending `delegate_tool.py` because:

  • It requires deterministic evaluation logic (scoring rubric, winner selection) that must execute precisely
  • It extends the core delegation infrastructure, not an external CLI wrapper
  • It involves multi-step orchestration (fan-out → collect → judge → return) that can't be expressed as skill instructions
  • The judge step needs access to all agent outputs simultaneously with structured comparison

What We'd Need

  1. `judge` parameter on `delegate_task` — When `True` or a criteria string, adds a judge evaluation step after batch completion
  2. Judge prompt template — Structured evaluation rubric that the judge LLM uses to compare outputs
  3. Evaluation dimensions — Configurable criteria (correctness, efficiency, completeness, code quality, etc.)
  4. Model diversity support — Allow each batch task to specify a different `model` override so competing agents use different LLMs
  5. Winner extraction — Parse the judge's evaluation to identify the winning output and return only that one (with evaluation reasoning)

Phased Rollout

Phase 1: Simple Best-of-N with Judge (Standalone)

Add an optional `judge` parameter to `delegate_task` batch mode:

```python
delegate_task(
tasks=[
{"goal": "Implement JWT auth for Express API", "model": "claude-sonnet-4.5"},
{"goal": "Implement JWT auth for Express API", "model": "gpt-5.2"},
{"goal": "Implement JWT auth for Express API", "model": "gemini-2.5-pro"},
],
judge=True # or judge="Evaluate for security, test coverage, and code clarity"
)


Implementation:
- After all N tasks complete, construct a judge prompt with all N outputs
- Judge prompt: "You are evaluating N implementations of the same task. Compare them on [criteria]. Select the best one and explain why."
- Use the parent's configured model (or a configurable judge model) for evaluation
- Return: winning output's summary + judge's reasoning + scores for all candidates

**Phase 2: Structured Evaluation & Model Diversity**
- Rubric-based scoring: judge rates each output on multiple dimensions (1-10)
- Configurable judge model: \`judge={"model": "claude-opus-4.5", "criteria": [...]}\`
- Integration with #413 (Cross-CLI Orchestration): compete Blackbox vs Claude Code vs Codex on the same task
- Evaluation augmentation: optionally run tests, type checks, or linters on each output before judging

**Phase 3: Adaptive Best-of-N**
- Dynamic N: start with 2, add more agents if the first round produces close scores
- Cost-aware: estimate token cost per agent and let users set a budget
- History-based: track which models win most often for which task types, use for future routing (#157)
- Integration with #344 workflow DAG: Best-of-N as a step type in multi-agent workflows

### Minimal Implementation Sketch

\`\`\`python
# In delegate_tool.py, after batch results are collected:

if judge_config:
    judge_prompt = build_judge_prompt(
        task_description=common_goal,
        implementations=[
            {"agent": i, "model": t.get("model", "default"), "output": result}
            for i, (t, result) in enumerate(zip(tasks, results))
        ],
        criteria=judge_config.get("criteria", DEFAULT_CRITERIA),
    )
    
    # Run judge as a lightweight subagent (no tools needed, just reasoning)
    judge_result = run_judge(judge_prompt, model=judge_config.get("model"))
    
    # Extract winner index and reasoning
    winner = parse_judge_verdict(judge_result)
    return {
        "winner": results[winner.index],
        "reasoning": winner.reasoning,
        "scores": winner.scores,  # Per-candidate scores
        "all_results": results if judge_config.get("keep_all") else None,
    }
\`\`\`

---

## Pros & Cons

### Pros
- **Proven quality multiplier** — Best-of-N is one of the simplest and most reliable ways to improve LLM output quality, backed by extensive research
- **Model diversity leverage** — Different LLMs have different strengths; competitive evaluation exploits this without the user needing to know which model is best for what
- **Minimal infrastructure** — Builds on existing \`delegate_task\` batch mode; the new code is primarily the judge prompt and winner extraction logic
- **Composable** — Works with native Hermes subagents today, extends to external CLI agents when #413 lands
- **Transparent** — Judge reasoning explains why one implementation won, which is valuable for learning and debugging
- **Cost-quality tradeoff knob** — Users choose N based on their cost tolerance; N=2 is cheap, N=5 is thorough

### Cons / Risks
- **Cost multiplication** — Running N agents costs N times more tokens/time. For expensive models, this adds up.
- **Judge reliability** — The judge LLM may have its own biases (e.g., preferring verbose code). Calibrating the judge is non-trivial.
- **Diminishing returns** — For simple tasks, N=1 with a good model is usually sufficient. Best-of-N shines on ambiguous/complex tasks.
- **Latency** — Total time = max(all agent times) + judge time. For time-sensitive workflows, this may be too slow.
- **Context pressure** — The judge needs to see N full outputs simultaneously. For large implementations, this could exceed context limits. May need summary-based judging for large outputs.
- **"Same task" assumption** — The pattern works best when all agents receive identical prompts. If prompts vary, the comparison becomes less fair.

---

## Open Questions

- Should the judge use the parent's model or a dedicated judge model? Using a different (potentially stronger) model for judging avoids self-preference bias.
- How do we handle large outputs that exceed the judge's context window? Options: summary-based judging, chunked evaluation, or execution-based evaluation (run tests instead of reading code).
- Should we support "top-K" in addition to "winner-take-all"? Some workflows might want the top 2-3 candidates, not just the best one.
- How does this interact with #412 (Consensus)? Could agents vote first (cheap), then a judge evaluates the top candidates (expensive)?
- Should the evaluation include execution-based signals (do the implementations actually work)? This would require running tests, which adds complexity but significantly improves evaluation quality.

---

## References

- [Blackbox AI Multi-Agent Execution](https://docs.blackbox.ai/features/blackbox-cloud-multi-agent) — Production implementation of the Chairman/Judge pattern
- [Blackbox AI Multi-Agent Task API](https://docs.blackbox.ai/api-reference/multi-agent-task) — API reference for dispatching to 2-5 agents with judge selection
- [Blackbox CLI (GitHub)](https://github.com/blackboxaicode/cli) — Open-source CLI with built-in judge mechanism
- [Blackbox AI Review 2026 (Banani)](https://www.banani.co/blog/blackbox-ai-review) — Independent review covering Chairman workflow
- Stiennon et al., "Learning to summarize from human feedback" (2020) — Best-of-N with reward models
- Li et al., "Competition-level code generation with AlphaCode" (2022) — Massive Best-of-N for code
- Related Hermes issues: #344, #412, #356, #413, #475
ISSUE_EOF; __hermes_rc=$?; printf '__HERMES_FENCE_a9f7b3__'; exit $__hermes_rc

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions