Skip to content

Non-record: MoE exploration + multi-bit quantization analysis#480

Open
imyesung wants to merge 3 commits intoopenai:mainfrom
imyesung:moe-quant-analysis
Open

Non-record: MoE exploration + multi-bit quantization analysis#480
imyesung wants to merge 3 commits intoopenai:mainfrom
imyesung:moe-quant-analysis

Conversation

@imyesung
Copy link
Copy Markdown

@imyesung imyesung commented Mar 23, 2026

Summary

Non-record submission with two negative results under the 16MB artifact cap:

  • Preliminary MoE negative result: a 2-expert soft-routing MoE (2 × 1.5x MLP) underperforms the dense control throughout the observed training window. I added moe_train_partial.log, the surviving partial 8xH100 SXM log; the RunPod pod died at step 2000, so the MoE conclusion should be interpreted as preliminary rather than a fully converged final result.
  • Leaderboard-relevant multi-bit quantization comparison: the dense control reaches 1.1456 val_bpb, which is within 0.0028 BPB of the March 20, 2026 leaderboard leader (1.1428). On that same trained dense model, int5 MLP costs +0.0068 BPB while int4 MLP costs +0.0655 BPB, making aggressive quantization unattractive for MoE expansion at this scale.

Included evidence

  • README.md with updated explanation and MoE-vs-dense checkpoint table
  • submission.json with updated metadata
  • train.log for the dense control / quantization comparison
  • moe_train_partial.log for the surviving MoE run
  • train_gpt.py
  • quant_comparison.png

Quantization Comparison Results

Config Attn MLP Artifact val_bpb vs baseline
attn6_mlp6 int6 int6 15.14 MB 1.1456 baseline
attn6_mlp5 int6 int5 13.39 MB 1.1524 +0.0068
attn6_mlp4 int6 int4 11.51 MB 1.2111 +0.0655
attn5_mlp5 int5 int5 13.05 MB 1.1559 +0.0103
attn5_mlp4 int5 int4 11.29 MB 1.2183 +0.0727

MoE Observed Checkpoints

Step Dense control val_bpb MoE val_bpb Delta
500 1.4058 1.4115 +0.0057
1000 1.3286 1.3386 +0.0100
1500 1.3024 1.3163 +0.0139
2000 1.2709 1.2866 +0.0157

…n analysis

Negative result showing MoE is structurally disadvantaged below 500M params
under 16MB constraint. Multi-bit quantization comparison (int4/5/6) on same
trained dense model demonstrates int4 MLP incurs +0.065 BPB degradation,
closing the MoE parameter expansion path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant