Non-record: Random Linear Maps + Learned Adapters (val_bpb=1.93, 2.4MB artifact)#874
Open
fielding wants to merge 5 commits intoopenai:mainfrom
Open
Non-record: Random Linear Maps + Learned Adapters (val_bpb=1.93, 2.4MB artifact)#874fielding wants to merge 5 commits intoopenai:mainfrom
fielding wants to merge 5 commits intoopenai:mainfrom
Conversation
size-matched control note to README
Author
|
Updated to include 3 full runs on the h100 along with my exploratory runs. |
update submission
distillation file, fix artifact bytes and README consistency
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Random Linear Maps + Learned Adapters
val_bpb: 1.93 (10 min, 8xH100 SXM) | 1.607 (3.75hr, 4xH200) | 2.4 MB artifact
The Idea
What if 90% of your model's weights were just noise?
Each linear layer gets a random base weight matrix. They are generated from a fixed seed at init time, like
42or1337. Those base weights cost zero bytes in the artifact because they're regenerated from the seed at eval. Only small rank-16 adapters (LoRA-style A and B matrices) are learned and stored. Think of it like giving someone a house made of random LEGO bricks and a small bag of correct ones... they have to figure out which random bricks are useful and nudge the rest into place with the adapters.A 512-dim, 5-layer model normally stores around 25M parameters. This approach stores 2.2M. The other 90% are deterministic noise from a seed. The artifact is ~2.4 MB (10 min, 8xH100) or 1.9 MB (3.75hr long run), well under the 16 MB budget.
Results
Depth Sweep (fixed 40-min training budget, rank=16, 512-dim)
4-5 layers is the sweet spot. Shallower models train faster and rack up more steps, which matters more than depth when the base weights carry no learned information. Go too shallow (3L) and there isn't enough compositional depth for language. Go too deep (20L) and gradients straight up can't propagate through that many random projections. The model with 20 layers learned nothing.
Rank Sweep (768-dim, 11L, fixed 40-min budget)
This one's counterintuitive. Smaller adapters win. Rank 16 crushes rank 32 and 64 because the larger adapters need more training steps to converge, and the fixed time budget punishes them for it. This sweep was run at 768-dim/11L, a harder setting than the depth sweep. The directional finding (smaller rank wins) should hold at 512-dim, but the absolute BPB numbers aren't comparable to the depth table.
Scaling with Training Time (5L, rank=16, 512-dim)
Sliding BPB keeps improving with more steps, though with diminishing returns. The float BPB at 200K (1.66) is slightly worse than 50K (1.64), likely from training instability mid-run (loss spiked around step 104K) that the warmdown didn't fully recover from. Sliding window evaluation smooths this out, which is why the sliding BPB still improved. The model hasn't fully converged.
Architecture
persistent=Falsemeans the base weight never hits the state_dict. At load time,__init__regenerates it from the seed. Each layer gets a unique seed from its index.Every attention projection (Q, K, V, output) and MLP layer (fc, proj) uses RandomLinearWithAdapter. Embeddings, norms, and other small parameters are fully learned.
Model Configuration
What I Found
Depth has a sweet spot with random projections. 4-5 layers wins at a fixed time budget. More depth means fewer training steps, and step count is king when base weights carry no information. 20 layers learned nothing... literally.
Smaller adapters optimize better. Rank 16 beats 32 and 64. There's a capacity-optimization tradeoff here... bigger adapters have more capacity but need more steps to figure out how to use it.
Random projections can do language modeling. A ~2.4 MB model with 90% random weights hits 1.93 BPB in 10 minutes on 8xH100 (1.607 with extended training). The naive baseline (fully learned, 13.5 MB) hits 1.224 BPB. The gap is real, but the fact that it works at all is the interesting part. A natural follow-up is comparing against a size-matched fully learned model at ~2.4 MB to isolate the contribution of random maps vs model capacity. That experiment is planned but not yet run.
The artifact is hilariously small. ~2.4 MB is 15% of the 16 MB budget. You could fit six of these models in one artifact. Ensembles, multi-model voting, whatever you want... there's room.
3-Seed Validation (8xH100 SXM, 600s)
Seeds used for random base weight generation are derived from SEED + layer_index. They were chosen arbitrarily, not searched.
Run Commands