Skip to content

vision classification QAT tutorial: fix for DDP#2191

Merged
fmassa merged 1 commit intogh/vkuzo/1/basefrom
gh/vkuzo/1/head
May 11, 2020
Merged

vision classification QAT tutorial: fix for DDP#2191
fmassa merged 1 commit intogh/vkuzo/1/basefrom
gh/vkuzo/1/head

Conversation

@vkuzo
Copy link
Copy Markdown
Contributor

@vkuzo vkuzo commented May 7, 2020

Stack from ghstack:

Summary:

Makes the classification QAT tutorial not crash when used
with DDP. There were two issues:

  1. the model was moved to GPU before the observers were added, and they
    are created on CPU. In the context of this repo, the fix is to finalize
    the model before moving to GPU. We can potentially follow up with a
    better error message in the future, in a separate PR.
  2. the QAT conversion was running on the DDP'ed model, which had various
    problems. The fix is to unwrap the model from DDP before cloning it for
    evaluation.

There is still work to do on verifying that BN is working correctly in
QAT + DDP, but saving that for a separate PR.

Test Plan:

python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir}

Reviewers:

Subscribers:

Tasks:

Tags:

Summary:

Makes the classification QAT tutorial not crash when used
with DDP. There were two issues:
1. the model was moved to GPU before the observers were added, and they
are created on CPU. In the context of this repo, the fix is to finalize
the model before moving to GPU. We can potentially follow up with a
better error message in the future, in a separate PR.
2. the QAT conversion was running on the DDP'ed model, which had various
problems.  The fix is to unwrap the model from DDP before cloning it for
evaluation.

There is still work to do on verifying that BN is working correctly in
QAT + DDP, but saving that for a separate PR.

Test Plan:

```
python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir}
```

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]
vkuzo added a commit that referenced this pull request May 7, 2020
Summary:

Makes the classification QAT tutorial not crash when used
with DDP. There were two issues:
1. the model was moved to GPU before the observers were added, and they
are created on CPU. In the context of this repo, the fix is to finalize
the model before moving to GPU. We can potentially follow up with a
better error message in the future, in a separate PR.
2. the QAT conversion was running on the DDP'ed model, which had various
problems.  The fix is to unwrap the model from DDP before cloning it for
evaluation.

There is still work to do on verifying that BN is working correctly in
QAT + DDP, but saving that for a separate PR.

Test Plan:

```
python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir}
```

Reviewers:

Subscribers:

Tasks:

Tags:

ghstack-source-id: 7738a37
Pull Request resolved: #2191
@vkuzo vkuzo requested a review from raghuramank100 May 7, 2020 00:27
@codecov-io
Copy link
Copy Markdown

codecov-io commented May 7, 2020

Codecov Report

Merging #2191 into gh/vkuzo/1/base will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           gh/vkuzo/1/base   #2191   +/-   ##
===============================================
  Coverage             0.42%   0.42%           
===============================================
  Files                   92      92           
  Lines                 7448    7448           
  Branches              1138    1138           
===============================================
  Hits                    32      32           
  Misses                7408    7408           
  Partials                 8       8           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c95855...c33e889. Read the comment docs.

@zhangguanheng66 zhangguanheng66 added the module: models.quantization Issues related to the quantizable/quantized models label May 7, 2020
Copy link
Copy Markdown
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@fmassa fmassa merged commit 4e39a43 into gh/vkuzo/1/base May 11, 2020
@fmassa fmassa deleted the gh/vkuzo/1/head branch May 11, 2020 12:52
@fmassa
Copy link
Copy Markdown
Member

fmassa commented May 11, 2020

@vkuzo The PR you sent went to the wrong branch -- ghstack only works for the PyTorch repo. Can you send another PR targetting the master branch?

@vkuzo
Copy link
Copy Markdown
Contributor Author

vkuzo commented May 12, 2020

@vkuzo The PR you sent went to the wrong branch -- ghstack only works for the PyTorch repo. Can you send another PR targetting the master branch?

ah, sorry about that, didn't realize ghstack doesn't work with this repo. Will do!

vkuzo added a commit that referenced this pull request May 18, 2020
Summary:

Redo of #2191

Makes the classification QAT tutorial not crash when used
with DDP. There were two issues:

1. the model was moved to GPU before the observers were added, and they
are created on CPU. In the context of this repo, the fix is to finalize
the model before moving to GPU. We can potentially follow up with a
better error message in the future, in a separate PR.
2. the QAT conversion was running on the DDP'ed model, which had various
problems. The fix is to unwrap the model from DDP before cloning it for
evaluation.

There is still work to do on verifying that BN is working correctly in
QAT + DDP, but saving that for a separate PR.

Test Plan:

```
python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir}
```

Reviewers:

Subscribers:

Tasks:

Tags:
fmassa pushed a commit that referenced this pull request May 18, 2020
Summary:

Redo of #2191

Makes the classification QAT tutorial not crash when used
with DDP. There were two issues:

1. the model was moved to GPU before the observers were added, and they
are created on CPU. In the context of this repo, the fix is to finalize
the model before moving to GPU. We can potentially follow up with a
better error message in the future, in a separate PR.
2. the QAT conversion was running on the DDP'ed model, which had various
problems. The fix is to unwrap the model from DDP before cloning it for
evaluation.

There is still work to do on verifying that BN is working correctly in
QAT + DDP, but saving that for a separate PR.

Test Plan:

```
python -m torch.distributed.launch --use_env references/classification/train_quantization.py --data-path {path_to_imagenet_1k} --output_dir {output_dir}
```

Reviewers:

Subscribers:

Tasks:

Tags:
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: models.quantization Issues related to the quantizable/quantized models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants