External research review by u/mxriverlynn pointed out that the academic evidence for divergent-convergent separation (A4: CreativeDC, arXiv 2512.23601, reporting 51.5–63.5% novelty improvement and 72% diversity advantage) is measured at K=100 parallel samples. ADHD's current evals are at K=5. The K-gap is real and the paper implicitly leans on A4-style numbers without running at A4-style K.
Action:
- Extend the eval harness to support configurable K (already trivially possible via
framesPerRun).
- Re-run the same six-problem suite at K=5, K=10, K=20 with the same LLM-as-judge methodology.
- Add a new table to
EVALS.md reporting win-rate as a function of K.
- If the win rate flattens or degrades above K=5, document that honestly. If it scales, ADHD's positioning gains a quantitative claim.
Cost note: at K=20, the per-run LLM call count is ~25 calls. Across six problems, ~150 calls per condition. Three conditions (K=5/10/20) = ~450 calls. Roughly $15-30 in API costs at current Sonnet pricing. Feasible.
Risks worth measuring:
Raised by u/mxriverlynn in adhd-application-to-han.md, validation point V8.
External research review by u/mxriverlynn pointed out that the academic evidence for divergent-convergent separation (A4: CreativeDC, arXiv 2512.23601, reporting 51.5–63.5% novelty improvement and 72% diversity advantage) is measured at K=100 parallel samples. ADHD's current evals are at K=5. The K-gap is real and the paper implicitly leans on A4-style numbers without running at A4-style K.
Action:
framesPerRun).EVALS.mdreporting win-rate as a function of K.Cost note: at K=20, the per-run LLM call count is ~25 calls. Across six problems, ~150 calls per condition. Three conditions (K=5/10/20) = ~450 calls. Roughly $15-30 in API costs at current Sonnet pricing. Feasible.
Risks worth measuring:
Raised by u/mxriverlynn in adhd-application-to-han.md, validation point V8.