This repository implements the Default MoE from the paper "Dense Backpropagation Improves Training for Sparse Mixture-of-Experts"
This repository builds off the GPT-NeoX library, specifically a pull request implementing Dropless MoE. The main changes include:
- Default MoE implementation with EMA update and filling in missing expert outputs.
- New "default_vector" MoE config type that adds a buffer containing default expert outputs.
- Additional arguments in MoE forward pass for computing default vector update.
- Minor changes to integrate load balancing loss and first layer not using MoE
Install PyTorch, DeepSpeed, and the GPT-NeoX requirements:
pip install --no-cache-dir --no-build-isolation torch==2.4.1 --index-url https://download.pytorch.org/whl/cu124
pip install --no-cache-dir --no-build-isolation -r gpt-neox/requirements/requirements.txt
pip install --no-cache-dir --no-build-isolation deepspeed==0.14.4
Edit the index-url for PyTorch based on your GPU architecture.
The implementation we build off of uses MegaBlocks for dropless MoEs:
pip install --no-cache-dir --no-build-isolation megablocks==0.5.1
pip install --no-cache-dir --no-build-isolation grouped_gemm 0.1.6
See an example config in configs/default-moe-2B.yml. You will need to edit the train data paths to match your own dataset.
Launch a distributed training run with the following command:
python deepy.py train.py configs/default-moe-2B.yml
See eval.py and generate.py from the original GPT-NeoX library for examples of evaluation and generation with trained model checkpoints.