Skip to content

Run evals at K=10 and K=20 to bridge K=5 (ours) vs K=100 (literature) gap #18

@UditAkhourii

Description

@UditAkhourii

External research review by u/mxriverlynn pointed out that the academic evidence for divergent-convergent separation (A4: CreativeDC, arXiv 2512.23601, reporting 51.5–63.5% novelty improvement and 72% diversity advantage) is measured at K=100 parallel samples. ADHD's current evals are at K=5. The K-gap is real and the paper implicitly leans on A4-style numbers without running at A4-style K.

Action:

  • Extend the eval harness to support configurable K (already trivially possible via framesPerRun).
  • Re-run the same six-problem suite at K=5, K=10, K=20 with the same LLM-as-judge methodology.
  • Add a new table to EVALS.md reporting win-rate as a function of K.
  • If the win rate flattens or degrades above K=5, document that honestly. If it scales, ADHD's positioning gains a quantitative claim.

Cost note: at K=20, the per-run LLM call count is ~25 calls. Across six problems, ~150 calls per condition. Three conditions (K=5/10/20) = ~450 calls. Roughly $15-30 in API costs at current Sonnet pricing. Feasible.

Risks worth measuring:


Raised by u/mxriverlynn in adhd-application-to-han.md, validation point V8.

Metadata

Metadata

Assignees

No one assigned

    Labels

    evalsEval harness and methodologymethodologyMethodology critique or clarificationpaperPreprint paper updates

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions