Skip to content

Conversation

@siddharth9820
Copy link
Contributor

Post the optimizer step, Z2 uses incorrect partition ids for copying updated fp-32 params to their fp-16 counterparts. This leads to only a fraction of the non-expert parameters being trained. For reference, I have the loss curves before and after the bug fix.

Base Model - 1.3B
Number of Experts - 8
Batch Size - 256
Machine - Azure A100 40GB
Number of GPUs - 8
Dataset - BookCorpus

image

@siddharth9820 siddharth9820 changed the title Wrong partition_id in the fp32_param -> fp16 param copying in Z2 for MoE Wrong partition_id while copying fp32_params -> fp16 params in Z2 for MoE Jun 27, 2022
@siddharth9820
Copy link
Contributor Author

Here's the comparison with ZeRO disabled.
image

Copy link
Collaborator

@jeffra jeffra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent catch @siddharth9820, we discussed offline the existing issue w. the offload code path as well. We discussed this as a follow up PR though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants