Skip to content

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219

Open
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1
Open

Record: PR #1105 + window attn + mixed seq_len — 1.1084 bpb (3-seed mean) 1.1084 bpb#1219
Gusanidas wants to merge 1 commit intoopenai:mainfrom
Gusanidas:apr_1

Conversation

@Gusanidas
Copy link
Copy Markdown

Based on PR #1105 (abaybektursun) with this changes:

  • Causal n-gram fix (within_hint/word_hint prefix-only)
  • Window attention (size=512) on layers 2,4,6,8,10 via FA3
  • Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
  • Train-data GPTQ calibration (14s vs 220s AR self-gen)
  • Auto eval_seq_len detection from max train seq_len
  • Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
seed 1337: 1.1077
seed 42: 1.1083
seed 7: 1.1091
mean: 1.1084 (vs leader 1.1147)

It has plenty of room to be further optimized

Based on PR openai#1105 (abaybektursun) with improvements:
- Window attention (size=512) on layers 2,4,6,8,10 via FA3
- Mixed seq_len training: 5 GPUs at 2048x36 + 3 GPUs at 6144x10
- Train-data GPTQ calibration (14s vs 220s AR self-gen)
- Auto eval_seq_len detection from max train seq_len
- Causal n-gram fix (within_hint/word_hint prefix-only)
- Sliding window eval at seq_len=6144, stride=128

3-seed results (sliding window bpb):
  seed 1337: 1.1077
  seed 42:   1.1083
  seed 7:    1.1091
  mean:      1.1084 (vs leader 1.1147)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant