Closed
Conversation
Author
|
sorry folks codex opened this PR without asking for my confirmation :((( |
South-33
added a commit
to South-33/parameter-golf
that referenced
this pull request
Mar 19, 2026
- add the PR openai#10 tied-embedding training nuance to project memory so this branch is tracked as training-side plus export-side precision handling - add the Issue openai#43 tokenizer-artifact accounting note so tokenizer work is not under-ranked by an overly strict byte model - extend ideas.md research memory with the PR openai#1-openai#35 audit and issue audit so future research passes do not repeat low-signal early PR review - update the ranked backlog wording to reflect the stronger tokenizer and tied-embedding evidence
South-33
added a commit
to South-33/parameter-golf
that referenced
this pull request
Mar 19, 2026
- add an opt-in TIED_EMB_FP32_MASTER path that keeps the tied embedding/output head parameter in fp32 during training while casting only for compute - record corrected local long-context probes showing TRAIN_SEQ_LEN=2048 is feasible on the 4060 but looks slower and worse than the matched 1024 reference on the short proxy - record that the training-side tied-embedding fp32-master variant did not produce a free local win, while exporter-side tied-embedding protection remains the stronger sub-signal - update AGENTS.md and ideas.md so future loops treat long-context as strategically real but locally unattractive here, and avoid over-trusting the PR openai#10 nuance without better eval evidence
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tok_emb.weightas an fp32 master parameter in both the CUDA and MLX trainersWhy
The tied embedding is one of the highest-leverage parameters in this baseline because it is both the input embedding table and the output head. The baseline currently trains it directly in bf16, unlike the linear weights, which keep fp32 master weights and cast on use.
Local test
I ran the MLX path locally on Apple Silicon with a fixed smoke config and a patched subset validation harness (
20steps,4x256model, first16validation sequences) to compare directionally identical runs.Baseline log:
val_bpb 3.7256val_bpb 3.739390581824906bytesThis patch:
val_bpb 3.7250val_bpb 3.738321861825080bytesSo the local smoke improved both pre-quant and post-quant validation while keeping the compressed artifact essentially unchanged.