Keep tied embeddings in fp32 by LJX2017 · Pull Request #10 · openai/parameter-golf

LJX2017 · 2026-03-18T19:40:52Z

Summary

keep tok_emb.weight as an fp32 master parameter in both the CUDA and MLX trainers
cast embedding activations and tied-head weights back to bf16 only at compute time
align tied embeddings with the existing fp32-master treatment already used for linear weights

Why

The tied embedding is one of the highest-leverage parameters in this baseline because it is both the input embedding table and the output head. The baseline currently trains it directly in bf16, unlike the linear weights, which keep fp32 master weights and cast on use.

Local test

I ran the MLX path locally on Apple Silicon with a fixed smoke config and a patched subset validation harness (20 steps, 4x256 model, first 16 validation sequences) to compare directionally identical runs.

Baseline log:

pre-quant: val_bpb 3.7256
int8 roundtrip: val_bpb 3.73939058
quantized artifact: 1824906 bytes

This patch:

pre-quant: val_bpb 3.7250
int8 roundtrip: val_bpb 3.73832186
quantized artifact: 1825080 bytes

So the local smoke improved both pre-quant and post-quant validation while keeping the compressed artifact essentially unchanged.

LJX2017 · 2026-03-18T19:43:37Z

sorry folks codex opened this PR without asking for my confirmation :(((

- add the PR openai#10 tied-embedding training nuance to project memory so this branch is tracked as training-side plus export-side precision handling - add the Issue openai#43 tokenizer-artifact accounting note so tokenizer work is not under-ranked by an overly strict byte model - extend ideas.md research memory with the PR openai#1-openai#35 audit and issue audit so future research passes do not repeat low-signal early PR review - update the ranked backlog wording to reflect the stronger tokenizer and tied-embedding evidence

- add an opt-in TIED_EMB_FP32_MASTER path that keeps the tied embedding/output head parameter in fp32 during training while casting only for compute - record corrected local long-context probes showing TRAIN_SEQ_LEN=2048 is feasible on the 4060 but looks slower and worse than the matched 1024 reference on the short proxy - record that the training-side tied-embedding fp32-master variant did not produce a free local win, while exporter-side tied-embedding protection remains the stronger sub-signal - update AGENTS.md and ideas.md so future loops treat long-context as strategically real but locally unattractive here, and avoid over-trusting the PR openai#10 nuance without better eval evidence

Keep tied embeddings in fp32

8b73fd8

LJX2017 closed this Mar 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keep tied embeddings in fp32#10

Keep tied embeddings in fp32#10
LJX2017 wants to merge 1 commit intoopenai:mainfrom
LJX2017:codex/fp32-tied-embeddings

LJX2017 commented Mar 18, 2026

Uh oh!

LJX2017 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LJX2017 commented Mar 18, 2026

Summary

Why

Local test

Uh oh!

LJX2017 commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant