Notable Non-Record: Universal Transformer — 1.2249 BPB — Depth Recurrence with Iteration Embeddings#1110
Open
gowtham0992 wants to merge 2 commits intoopenai:mainfrom
Open
Notable Non-Record: Universal Transformer — 1.2249 BPB — Depth Recurrence with Iteration Embeddings#1110gowtham0992 wants to merge 2 commits intoopenai:mainfrom
gowtham0992 wants to merge 2 commits intoopenai:mainfrom
Conversation
…ence with Iteration Embeddings
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Notable Non-Record: Universal Transformer — 1.2249 BPB — Depth Recurrence with Iteration Embeddings
3 Unique Blocks × 4 Iterations = 12 Effective Layers + Per-Iteration Embeddings + 70% Param Savings + 4.95 MB Artifact
val_bpb: 1.2249 (seed=42) | 4.95 MB artifact | 8×H100 SXM, 555s training + 81s eval
Results (seed=42, 8×H100 SXM)
What's Novel vs Existing Depth Recurrence (PR #363)
PR #363 (Evangeline Kamin) already explored depth recurrence extensively with 250+ hours of experiments. Our submission adds two features from the original Universal Transformer paper (Dehghani et al., 2018) that PR #363 did not implement:
Per-iteration learnable embeddings (timestep encoding): A learnable vector added to the hidden state before each block execution, telling the model which iteration it's on. Without this, all iterations are computationally identical — the model has no way to differentiate pass 1 from pass 4.
Per-iteration learnable scales: Modulate the residual contribution per effective layer, allowing different iterations to have different impact magnitudes.
Comparison with PR #363
Our BPB is worse because we have far fewer unique parameters (8.1M vs ~15M). But our artifact is 3x smaller, and the iteration embeddings are a principled addition from the paper.
Architecture
effective_layer % 3selects the shared blockblock(0) is block(3) is block(6) is block(9)Command
Compliance
train_gpt.pyReferences
Included Files
train_gpt.py— full training scripttrain_seed42.txt— training logsubmission.json— metadatarun.sh— reproduction scriptrequirements.txt— dependencies