Skip to content

Record: Seq2048 training + eval (val_bpb=1.2101)#136

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:submission/seq2048
Open

Record: Seq2048 training + eval (val_bpb=1.2101)#136
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:submission/seq2048

Conversation

@ibarrajo
Copy link
Copy Markdown

@ibarrajo ibarrajo commented Mar 19, 2026

Seq2048 Training + Eval (val_bpb: 1.2101)

val_bpb: 1.2101 (post-quant int8+zlib roundtrip) | 15.87 MB | 8xH100 SXM, 11,417 steps in 600s

Approach

One change to the baseline: training and evaluating at sequence length 2048 instead of 1024. The model learns real long-range dependencies during training rather than relying on RoPE position extrapolation at eval time.

Why This Works

At seq1024, the model only sees 1024-token windows during training. At eval time with longer context, the model extrapolates RoPE positions it never trained on — the attention patterns are untested. Training at seq2048 means the model has practiced using 2048 tokens of context, so eval at 2048 is interpolation, not extrapolation.

Each training step still processes the same total tokens (524K) — just in 256 sequences of 2048 instead of 512 sequences of 1024. Step time is identical.

Results

Metric Value
Post-quant val_bpb 1.2101
Pre-quant val_bpb 1.2033
Improvement vs baseline -0.0143 BPB
Quantization gap 0.0068 BPB
Training steps 11,417 (wallclock capped at 600s)
Step avg 52.56ms
Artifact size 15,869,065 bytes

Development Context

This was validated through systematic experimentation:

  • 20+ configs tested on 1xH100 (SwiGLU, recurrence, MTP, QAT, TTT, batch sizes, seq lengths)
  • seq2048 was the single biggest win (-0.023 BPB on 1xH100), larger than SwiGLU (-0.017) or QAT
  • SwiGLU showed negligible improvement on 8xH100 (1.2130 vs 1.2101) because the larger model gets fewer steps
  • Full experiment log: 26+ runs documented in ops/FINDINGS.md

Command

NCCL_IB_DISABLE=1 RUN_ID=sub_8x_relu2_seq2048 \
TRAIN_SEQ_LEN=2048 EVAL_SEQ_LEN=2048 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

🤖 Generated with Claude Code

Training and evaluating at sequence length 2048 instead of 1024.
No architecture changes — same 9-layer 512-dim baseline.

8xH100 SXM, 11,417 steps in 600s, 15.87MB artifact.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ThomAub pushed a commit to ThomAub/parameter-golf that referenced this pull request Mar 22, 2026
Many TTT submissions (openai#136, openai#152, openai#254, openai#264, openai#338, openai#398, openai#417, openai#421, openai#442)
flagged as potentially invalid for adapting on eval tokens BEFORE scoring them.
Added correct score-then-adapt protocol with implementation guide.

https://claude.ai/code/session_01M5XTtyz2Zdq5BDeh9qNn9y
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant