Skip to content

[debug] DebugUnderflowOverflow doesn't work with DP#12816

Merged
stas00 merged 1 commit intohuggingface:masterfrom
stas00:debug-underflow-dp
Jul 21, 2021
Merged

[debug] DebugUnderflowOverflow doesn't work with DP#12816
stas00 merged 1 commit intohuggingface:masterfrom
stas00:debug-underflow-dp

Conversation

@stas00
Copy link
Copy Markdown
Contributor

@stas00 stas00 commented Jul 21, 2021

As reported in #12815 DebugUnderflowOverflow breaks under DP since the model gets new references to model sub-modules/params on replication and the old references are needed to track the model layer names.

It might be possible to think of some workaround, most likely overriding torch.nn.parallel.data_parallel.replicate to refresh the model references after the replication, but at the moment this is not required, since DDP works just fine. (or single GPU).

So this PR adds a clean assert when DP is used, instead of a confusing exception. Update docs.

Fixes: #12815

@sgugger

Copy link
Copy Markdown
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@stas00 stas00 merged commit cf0755a into huggingface:master Jul 21, 2021
@stas00 stas00 deleted the debug-underflow-dp branch July 21, 2021 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DebugUnderflowOverflow crashes with Multi-GPU training

2 participants