Skip to content

Conversation

@siddharth9820
Copy link
Contributor

@siddharth9820 siddharth9820 commented Jun 27, 2022

This is the same fix as this PR, but for cpu-offloading. Here's the corresponding PR in the Megatron-Deepspeed repo that adds cpu-offloading support - link

Here is the loss curve after the fix compared with zero-stage 0 for the following setting -

Base Model - 1.3B
Number of Experts - 8
Batch Size - 256
Machine - Azure A100 40GB
Number of GPUs - 8
Dataset - BookCorpus

image

@jeffra jeffra merged commit b3388e1 into master Jul 7, 2022
@jeffra jeffra deleted the siddharth/moe-z2+offload-bug-fix branch July 7, 2022 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants