Skip to content

[Non Record] Online Curriculum Learning #737

Open
SPThole wants to merge 7 commits intoopenai:mainfrom
SPThole:non_record_2
Open

[Non Record] Online Curriculum Learning #737
SPThole wants to merge 7 commits intoopenai:mainfrom
SPThole:non_record_2

Conversation

@SPThole
Copy link
Copy Markdown

@SPThole SPThole commented Mar 25, 2026

Summary

Implements online sequence-level curriculum learning that scores and filters sequences within each batch by unigram entropy, following a V-shaped difficulty schedule aligned with LR warmdown and SWA phases. Zero extra parameters. Built upon PR #623.

Motivation

Standard training feeds random batches regardless of training phase. In a 600-second window (~1100 steps), the model benefits from different data at different stages:

  • Early training (high LR): easy sequences → stable gradients, fast initial convergence
  • Mid training: hard sequences → push the model's frontier while LR is still meaningful
  • Late training (SWA region): easy sequences → coherent checkpoint averaging

Method

Per-sequence difficulty score — unigram entropy:

H(s) = -Σ p_s(t) · log₂(p_s(t)) for each sequence s of length 2048

V-shaped target — maps training progress to difficulty percentile d ∈ [0,1]:

d(step) = step / (0.45 · T) if step ≤ 0.45·T
d(step) = 1 - (step/T - 0.45) / (1 - 0.45) otherwise

Selection: Load 2× sequences per batch, sort by entropy, select the half centered around percentile d(step). The V-shape completes within each batch — no dependence on shard ordering.

Results

  • val_bpb: 1.3557 (post int6+zstd, 1×H100, seed=42)
  • Pre-quant: 1.3280 | Quant penalty: 0.0277
  • 1,021 steps in 600s (588 ms/step) | 15.25MB artifact
  • Run on 1×H100 due to compute constraints

Observation

Worse than baseline (1.3345). The 2× oversampling adds ~50ms/step overhead (588ms vs 540ms), costing ~80 training steps. The curriculum signal doesn't compensate for lost steps. Implication: curriculum at this scale must be zero-overhead (precomputed ordering, not runtime filtering).

abaybektursun added a commit to abaybektursun/parameter-golf that referenced this pull request Mar 25, 2026
Two-stage investigation into training data selection for Parameter Golf:

Stage 1 (shard-level): 8 scoring methods, validated M5 (val-CE) as most
reliable (rho=0.984). But all 80 shards have nearly identical bigram
statistics (CE spread: 0.018 bits). Shard reordering: -0.001 BPB (noise).

Stage 2 (chunk-level): Scored 244K chunks at 32K granularity. Within-shard
variance is 535x larger than between-shard. Selected top 12% by bigram CE
and by 17M-param neural proxy. Both made val_bpb worse (+0.007, +0.006).

Curriculum learning (8xH100, 3 seeds): Hardest-first ordering by model
perplexity. Mean delta: -0.0006, one seed regressed. 95% CI spans zero.

Conclusion: On FineWeb (already filtered), hard data selection trades
diversity for match quality, and diversity wins. Corroborated by PRs openai#737,
openai#623, openai#333 and Sachdeva et al. (ICLR 2025).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant