Skip to content

Add depth recurrence + SwiGLU submission (Apple M3 8GB)#8

Closed
iranzithierry wants to merge 2 commits intoopenai:mainfrom
iranzithierry:main
Closed

Add depth recurrence + SwiGLU submission (Apple M3 8GB)#8
iranzithierry wants to merge 2 commits intoopenai:mainfrom
iranzithierry:main

Conversation

@iranzithierry
Copy link
Copy Markdown

Summary

  • Non-record submission exploring depth recurrence (weight sharing) and SwiGLU MLPs
  • 4 unique transformer blocks reused 3× = 12 effective layers at 640 dim
  • SwiGLU MLP replaces relu² for better parameter efficiency
  • Per-recurrence learnable gate scalars for specialization
  • Trained on Apple M3 with 8GB RAM (hardware-limited, results are directional)

Changes

  • records/track_non_record_16mb/2026-03-18_M3_DepthRecurrence_SwiGLU/train_gpt_mlx.py
  • records/track_non_record_16mb/2026-03-18_M3_DepthRecurrence_SwiGLU/submission.json
  • records/track_non_record_16mb/2026-03-18_M3_DepthRecurrence_SwiGLU/README.md

Limitations

Trained on consumer hardware (M3/8GB) — score reflects hardware constraints, not the approach's ceiling. The same script can be run on 8×H100 for competitive results.

Add a non-record leaderboard submission exploring depth recurrence and SwiGLU MLPs trained on an Apple M3 (8GB). Includes README with architecture/hyperparameter notes, submission.json metadata, and a full training script (train_gpt_mlx.py) implementing: 4 unique transformer blocks reused 3× (12 effective layers), SwiGLU MLP, wider 640-dim model, per-recurrence gates, U-Net skips, gradient clipping, split optimizers (Muon + Adam), token streaming, and int8+zlib quantization/roundtrip. Notes limitations from hardware and guidance to run on larger hardware for competitive results.
@0hq
Copy link
Copy Markdown
Collaborator

0hq commented Mar 18, 2026

You need a train.log and a val_bpb for a non-record submission!

@0hq 0hq closed this Mar 19, 2026
gb250e referenced this pull request in gb250e/parameter-golf Mar 21, 2026
dhruvjatkar pushed a commit to dhruvjatkar/parameter-golf that referenced this pull request Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
dhruvjatkar pushed a commit to dhruvjatkar/parameter-golf that referenced this pull request Mar 25, 2026
PR openai#672 maxes TTT at 30 epochs (590s/600s eval budget), so all future
improvements must be orthogonal to TTT. This update:
- Sets 1.0781 BPB (PR openai#672) as the new target to beat
- Reorders Top 8 directions: XSA-all confirmed at #1, Full GPTQ #2,
  SwiGLU #3, Muon-VS #4, aggressive quant #5, MASA openai#6,
  depth recurrence openai#7 with int6 risk warning, AdEMAMix openai#8
- Deprioritizes TTT-related directions already exploited by PR openai#672
- Collapses ~1000 lines of stale Round 0-3.9 session logs into a
  concise historical summary
- Removes resolved blockers (flash_attn, SSH hangs, local runtime)
- Adds fresh Round 1 section with 5 submitted experiments

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants