Skip to content

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470

Open
leofeasby wants to merge 1 commit intoopenai:mainfrom
leofeasby:shared-weight-nonrecord-clean
Open

Non-record: Shared-weight transformer with extended warmdown (1.1454 val_bpb)#470
leofeasby wants to merge 1 commit intoopenai:mainfrom
leofeasby:shared-weight-nonrecord-clean

Conversation

@leofeasby
Copy link
Copy Markdown

This is a non-record submission to the 16MB track.

We study a shared-weight transformer in which a single transformer block is reused across depth (9 passes), forming a recurrent-style stack with U-Net skip connections.

Result:
The model reaches 1.1454 val_bpb after ~2.3 hours on 8×H100, with loss still decreasing at the end of training. Training terminated due to schedule constraints (LR→0), not convergence.

Key observation:
The majority of improvement occurs during extended warmdown. The model continues improving steadily throughout the low-LR phase, with no plateau observed within the explored horizon.

This behaviour is consistent with a regime in which performance is strongly influenced by schedule alignment, potentially more so than parameter capacity for this architecture. We do not claim this as a universal property, but as an observed characteristic of this shared-weight setup.

Notable components:

  • Shared-core transformer (full weight sharing across depth)
  • Per-layer scaling (attention, MLP, residual mixing) to break symmetry
  • U-Net style skip connections across passes
  • Step-based warmdown control (WARMDOWN_START_STEP) to decouple schedule from wallclock

This submission targets long-horizon optimisation behaviour rather than the 10-minute constraint, and aims to highlight differences in convergence dynamics between shared-weight and standard transformers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant