Record: Seq2048 training + eval (val_bpb=1.2101) by ibarrajo · Pull Request #136 · openai/parameter-golf

ibarrajo · 2026-03-19T22:23:29Z

Seq2048 Training + Eval (val_bpb: 1.2101)

val_bpb: 1.2101 (post-quant int8+zlib roundtrip) | 15.87 MB | 8xH100 SXM, 11,417 steps in 600s

Approach

One change to the baseline: training and evaluating at sequence length 2048 instead of 1024. The model learns real long-range dependencies during training rather than relying on RoPE position extrapolation at eval time.

Why This Works

At seq1024, the model only sees 1024-token windows during training. At eval time with longer context, the model extrapolates RoPE positions it never trained on — the attention patterns are untested. Training at seq2048 means the model has practiced using 2048 tokens of context, so eval at 2048 is interpolation, not extrapolation.

Each training step still processes the same total tokens (524K) — just in 256 sequences of 2048 instead of 512 sequences of 1024. Step time is identical.

Results

Metric	Value
Post-quant val_bpb	1.2101
Pre-quant val_bpb	1.2033
Improvement vs baseline	-0.0143 BPB
Quantization gap	0.0068 BPB
Training steps	11,417 (wallclock capped at 600s)
Step avg	52.56ms
Artifact size	15,869,065 bytes

Development Context

This was validated through systematic experimentation:

20+ configs tested on 1xH100 (SwiGLU, recurrence, MTP, QAT, TTT, batch sizes, seq lengths)
seq2048 was the single biggest win (-0.023 BPB on 1xH100), larger than SwiGLU (-0.017) or QAT
SwiGLU showed negligible improvement on 8xH100 (1.2130 vs 1.2101) because the larger model gets fewer steps
Full experiment log: 26+ runs documented in ops/FINDINGS.md

Command

NCCL_IB_DISABLE=1 RUN_ID=sub_8x_relu2_seq2048 \
TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

🤖 Generated with Claude Code

Training and evaluating at sequence length 2048 instead of 1024. No architecture changes — same 9-layer 512-dim baseline. 8xH100 SXM, 11,417 steps in 600s, 15.87MB artifact. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442) flagged as potentially invalid for adapting on eval tokens BEFORE scoring them. Added correct score-then-adapt protocol with implementation guide. https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y

notapplica mentioned this pull request Mar 20, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

leloykun mentioned this pull request Mar 22, 2026

Invalid submissions due to information leakage during TTT #402

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Seq2048 training + eval (val_bpb=1.2101)#136

Record: Seq2048 training + eval (val_bpb=1.2101)#136
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:submission/seq2048

ibarrajo commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ibarrajo commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Seq2048 Training + Eval (val_bpb: 1.2101)

Approach

Why This Works

Results

Development Context

Command

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ibarrajo commented Mar 19, 2026 •

edited

Loading