[Non Record] Online Curriculum Learning by SPThole · Pull Request #737 · openai/parameter-golf

SPThole · 2026-03-25T15:58:54Z

Summary

Implements online sequence-level curriculum learning that scores and filters sequences within each batch by unigram entropy, following a V-shaped difficulty schedule aligned with LR warmdown and SWA phases. Zero extra parameters. Built upon PR #623.

Motivation

Standard training feeds random batches regardless of training phase. In a 600-second window (~1100 steps), the model benefits from different data at different stages:

Early training (high LR): easy sequences → stable gradients, fast initial convergence
Mid training: hard sequences → push the model's frontier while LR is still meaningful
Late training (SWA region): easy sequences → coherent checkpoint averaging

Method

Per-sequence difficulty score — unigram entropy:

H(s) = -Σ p_s(t) · log₂(p_s(t)) for each sequence s of length 2048

V-shaped target — maps training progress to difficulty percentile d ∈ [0,1]:

d(step) = step / (0.45 · T) if step ≤ 0.45·T
d(step) = 1 - (step/T - 0.45) / (1 - 0.45) otherwise

Selection: Load 2× sequences per batch, sort by entropy, select the half centered around percentile d(step). The V-shape completes within each batch — no dependence on shard ordering.

Results

val_bpb: 1.3557 (post int6+zstd, 1×H100, seed=42)
Pre-quant: 1.3280 | Quant penalty: 0.0277
1,021 steps in 600s (588 ms/step) | 15.25MB artifact
Run on 1×H100 due to compute constraints

Observation

Worse than baseline (1.3345). The 2× oversampling adds ~50ms/step overhead (588ms vs 540ms), costing ~80 training steps. The curriculum signal doesn't compensate for lost steps. Implication: curriculum at this scale must be zero-overhead (precomputed ordering, not runtime filtering).

Two-stage investigation into training data selection for Parameter Golf: Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most reliable (rho=0.984). But all 80 shards have nearly identical bigram statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise). Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard variance is 535x larger than between-shard. Selected top 12% by bigram CE and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006). Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero. Conclusion: On FineWeb (already filtered), hard data selection trades diversity for match quality, and diversity wins. Corroborated by PRs openai#737, openai#623, openai#333 and Sachdeva et al. (ICLR 2025). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

SPThole and others added 7 commits March 24, 2026 18:09

updated sub

2e1278e

updt readme

db9ea39

Update README.md

a029a31

Merge branch 'openai:main' into main

2c2d807

added qk init non record one H100

d7ec40c

removing non record

3019166

updated another non record

bca5086

abaybektursun mentioned this pull request Mar 25, 2026

Non-record: Data ordering & selection — negative result on FineWeb #772

Open

himanshudongre mentioned this pull request Mar 26, 2026

Record: Two-Pass N-gram Rescoring (val_bpb 0.1434) #846

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Non Record] Online Curriculum Learning #737

[Non Record] Online Curriculum Learning #737
SPThole wants to merge 7 commits intoopenai:mainfrom
SPThole:non_record_2

SPThole commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SPThole commented Mar 25, 2026

Summary

Motivation

Method

Results

Observation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant