Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470
Open
leofeasby wants to merge 1 commit intoopenai:mainfrom
Open
Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470leofeasby wants to merge 1 commit intoopenai:mainfrom
leofeasby wants to merge 1 commit intoopenai:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a non-record submission to the 16MB track.
We study a shared-weight transformer in which a single transformer block is reused across depth (9 passes), forming a recurrent-style stack with U-Net skip connections.
Result:
The model reaches 1.1454 val_bpb after ~2.3 hours on 8×H100, with loss still decreasing at the end of training. Training terminated due to schedule constraints (LR→0), not convergence.
Key observation:
The majority of improvement occurs during extended warmdown. The model continues improving steadily throughout the low-LR phase, with no plateau observed within the explored horizon.
This behaviour is consistent with a regime in which performance is strongly influenced by schedule alignment, potentially more so than parameter capacity for this architecture. We do not claim this as a universal property, but as an observed characteristic of this shared-weight setup.
Notable components:
WARMDOWN_START_STEP) to decouple schedule from wallclockThis submission targets long-horizon optimisation behaviour rather than the 10-minute constraint, and aims to highlight differences in convergence dynamics between shared-weight and standard transformers.