vision classification QAT tutorial: fix for DDP by vkuzo · Pull Request #2191 · pytorch/vision

vkuzo · 2020-05-07T00:23:33Z

Stack from ghstack:

vision classification QAT tutorial: fix for DDP #2191 vision classification QAT tutorial: fix for DDP

Summary:

Makes the classification QAT tutorial not crash when used
with DDP. There were two issues:

the model was moved to GPU before the observers were added, and they
are created on CPU. In the context of this repo, the fix is to finalize
the model before moving to GPU. We can potentially follow up with a
better error message in the future, in a separate PR.
the QAT conversion was running on the DDP'ed model, which had various
problems. The fix is to unwrap the model from DDP before cloning it for
evaluation.

There is still work to do on verifying that BN is working correctly in
QAT + DDP, but saving that for a separate PR.

Test Plan:

python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir}

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Makes the classification QAT tutorial not crash when used with DDP. There were two issues: 1. the model was moved to GPU before the observers were added, and they are created on CPU. In the context of this repo, the fix is to finalize the model before moving to GPU. We can potentially follow up with a better error message in the future, in a separate PR. 2. the QAT conversion was running on the DDP'ed model, which had various problems. The fix is to unwrap the model from DDP before cloning it for evaluation. There is still work to do on verifying that BN is working correctly in QAT + DDP, but saving that for a separate PR. Test Plan: ``` python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir} ``` Reviewers: Subscribers: Tasks: Tags: [ghstack-poisoned]

Summary: Makes the classification QAT tutorial not crash when used with DDP. There were two issues: 1. the model was moved to GPU before the observers were added, and they are created on CPU. In the context of this repo, the fix is to finalize the model before moving to GPU. We can potentially follow up with a better error message in the future, in a separate PR. 2. the QAT conversion was running on the DDP'ed model, which had various problems. The fix is to unwrap the model from DDP before cloning it for evaluation. There is still work to do on verifying that BN is working correctly in QAT + DDP, but saving that for a separate PR. Test Plan: ``` python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir} ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: 7738a37 Pull Request resolved: #2191

codecov-io · 2020-05-07T00:54:27Z

Codecov Report

Merging #2191 into gh/vkuzo/1/base will not change coverage.
The diff coverage is n/a.

@@               Coverage Diff               @@
##           gh/vkuzo/1/base   #2191   +/-   ##
===============================================
  Coverage             0.42%   0.42%           
===============================================
  Files                   92      92           
  Lines                 7448    7448           
  Branches              1138    1138           
===============================================
  Hits                    32      32           
  Misses                7408    7408           
  Partials                 8       8

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c95855...c33e889. Read the comment docs.

fmassa

Thanks!

fmassa · 2020-05-11T12:54:15Z

@vkuzo The PR you sent went to the wrong branch -- ghstack only works for the PyTorch repo. Can you send another PR targetting the master branch?

vkuzo · 2020-05-12T20:31:26Z

@vkuzo The PR you sent went to the wrong branch -- ghstack only works for the PyTorch repo. Can you send another PR targetting the master branch?

ah, sorry about that, didn't realize ghstack doesn't work with this repo. Will do!

Summary: Redo of #2191 Makes the classification QAT tutorial not crash when used with DDP. There were two issues: 1. the model was moved to GPU before the observers were added, and they are created on CPU. In the context of this repo, the fix is to finalize the model before moving to GPU. We can potentially follow up with a better error message in the future, in a separate PR. 2. the QAT conversion was running on the DDP'ed model, which had various problems. The fix is to unwrap the model from DDP before cloning it for evaluation. There is still work to do on verifying that BN is working correctly in QAT + DDP, but saving that for a separate PR. Test Plan: ``` python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir} ``` Reviewers: Subscribers: Tasks: Tags:

vkuzo mentioned this pull request May 7, 2020

Broadcasting does not work for Quantization aware training with multiple GPUs pytorch/pytorch#37270

Closed

vkuzo requested a review from raghuramank100 May 7, 2020 00:27

zhangguanheng66 added the module: models.quantization Issues related to the quantizable/quantized models label May 7, 2020

fmassa approved these changes May 11, 2020

View reviewed changes

fmassa merged commit 4e39a43 into gh/vkuzo/1/base May 11, 2020

fmassa deleted the gh/vkuzo/1/head branch May 11, 2020 12:52

vkuzo mentioned this pull request May 18, 2020

vision classification QAT tutorial: fix for DDP (redo) #2230

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vision classification QAT tutorial: fix for DDP#2191

vision classification QAT tutorial: fix for DDP#2191
fmassa merged 1 commit intogh/vkuzo/1/basefrom
gh/vkuzo/1/head

vkuzo commented May 7, 2020 •

edited

Loading

Uh oh!

codecov-io commented May 7, 2020 •

edited

Loading

Uh oh!

fmassa left a comment

Uh oh!

fmassa commented May 11, 2020

Uh oh!

vkuzo commented May 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

vkuzo commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-io commented May 7, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

fmassa commented May 11, 2020

Uh oh!

vkuzo commented May 12, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

vkuzo commented May 7, 2020 •

edited

Loading

codecov-io commented May 7, 2020 •

edited

Loading