Skip to content

[Feature] Detect anomalous parameters#1547

Merged
ZwwWayne merged 11 commits intoopen-mmlab:masterfrom
jshilong:debug_dead_lock
Dec 14, 2021
Merged

[Feature] Detect anomalous parameters#1547
ZwwWayne merged 11 commits intoopen-mmlab:masterfrom
jshilong:debug_dead_lock

Conversation

@jshilong
Copy link
Copy Markdown
Collaborator

@jshilong jshilong commented Dec 1, 2021

Motivation

Sometimes there are some anomalous parameters in the training phase, mainly two cases

  • Some parameters are not be used in a forward pass due to our wrong implementation, which causes the error when doing loss.backward()
  • Some parameters are not be used to produce loss, which causes the deadlock when doing loss.backward()

In both cases,anomalous parameters are not included in the computational graph that is with loss as the root.

Modification

I add a debug option named detect_anomalous_params to OptimizerHook, which can help you find anomalous parameters

BC-breaking (Optional)

None

@jshilong jshilong requested a review from ZwwWayne December 1, 2021 11:20
@jshilong jshilong changed the title [Feature]Detect anomalous parameters [Feature] Detect anomalous parameters Dec 1, 2021
@jshilong
Copy link
Copy Markdown
Collaborator Author

jshilong commented Dec 2, 2021

Ping @ZwwWayne

@zhouzaida
Copy link
Copy Markdown
Collaborator

After resolving the above comments and the CI passes, the PR can be merged.

@ZwwWayne
Copy link
Copy Markdown
Collaborator

ZwwWayne commented Dec 8, 2021

Can we check whether it slows down the training speed if we set detect_anomalous_params=True?
This PR can be merged after checking the influence on training speed and updating this information in docstring.

@zhouzaida
Copy link
Copy Markdown
Collaborator

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

It probably slows down the training speed but the flag should only be opened for debugging.

@zhouzaida
Copy link
Copy Markdown
Collaborator

Please merge the upstream master branch to this PR which will resolve the error when building documentation.

@jshilong
Copy link
Copy Markdown
Collaborator Author

jshilong commented Dec 9, 2021

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

It probably slows down the training speed but the flag should only be opened for debugging.

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

It is only used for debugging

@jshilong
Copy link
Copy Markdown
Collaborator Author

jshilong commented Dec 9, 2021

Can we check whether it slows down the training speed if we set detect_anomalous_params=True? This PR can be merged after checking the influence on training speed and updating this information in docstring.

I will add a notice in the docstr to indicate this is only used for debugging

@zhouzaida
Copy link
Copy Markdown
Collaborator

I will add a notice in the docstr to indicate this is only used for debugging

Hi @jshilong , we will merge the PR after you add a notice in the docstring.

@jshilong
Copy link
Copy Markdown
Collaborator Author

I will add a notice in the docstr to indicate this is only used for debugging

Hi @jshilong , we will merge the PR after you add a notice in the docstring.

Done

@ZwwWayne ZwwWayne merged commit 22e73d6 into open-mmlab:master Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants