Run evals at K=10 and K=20 to bridge K=5 (ours) vs K=100 (literature) gap

External research review by u/mxriverlynn pointed out that the academic evidence for divergent-convergent separation (A4: *CreativeDC*, arXiv 2512.23601, reporting 51.5–63.5% novelty improvement and 72% diversity advantage) is measured at K=100 parallel samples. ADHD's current evals are at K=5. The K-gap is real and the paper implicitly leans on A4-style numbers without running at A4-style K.

**Action:**
- Extend the eval harness to support configurable K (already trivially possible via `framesPerRun`).
- Re-run the same six-problem suite at K=5, K=10, K=20 with the same LLM-as-judge methodology.
- Add a new table to `EVALS.md` reporting win-rate as a function of K.
- If the win rate flattens or degrades above K=5, document that honestly. If it scales, ADHD's positioning gains a quantitative claim.

**Cost note:** at K=20, the per-run LLM call count is ~25 calls. Across six problems, ~150 calls per condition. Three conditions (K=5/10/20) = ~450 calls. Roughly $15-30 in API costs at current Sonnet pricing. Feasible.

**Risks worth measuring:**
- Critic context overload at high K (already tracked as #7) becomes the dominant cost/quality bottleneck before win-rate gains.
- Likely interaction: K helps until critic saturates, then degrades. Finding the inflection point is itself a finding.

---

*Raised by u/mxriverlynn in [adhd-application-to-han.md](https://github.com/testdouble/han/blob/adhd-swarm-research/docs/research/adhd-application-to-han.md), validation point V8.*

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run evals at K=10 and K=20 to bridge K=5 (ours) vs K=100 (literature) gap #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Run evals at K=10 and K=20 to bridge K=5 (ours) vs K=100 (literature) gap #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions