torchvision QAT tutorial: update for QAT with DDP#2280
Conversation
fmassa
left a comment
There was a problem hiding this comment.
Thanks!
Wow, I just realized that there might have been an issue with the location of the .to(device) in the previous implementation.
Also, can't you convert the model to SyncBatchNorm before creating the optimizer? It could maybe be safer, if it works?
|
|
||
| model.to(device) | ||
| if args.distributed and args.sync_bn: | ||
| model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model) |
There was a problem hiding this comment.
At some point in time, converting to SyncBatchNorm would (erroneously) create some tensors on the CPU, despite the model being on the GPU. I suppose this has been fixed? Otherwise we would need to call model.to(device) again after SyncBatchNorm.
Now that I think a bit more about this, looks like finetuning was wrong because of the location of the to(device) before, because it was done after creating the optimizer?
There was a problem hiding this comment.
yep, the SyncBatchNorm device affinity was fixed in pytorch/pytorch#38729 and this diff assumes the fix is present in torch.
#2230 (by me) originally moved the to(device) down to unbreak the tutorial, because at that time device affinity was broken with QAT. This PR moves it back to the original place, since the bugs are fixed.
Summary: We've made two recent changes to QAT in PyTorch core: 1. add support for SyncBatchNorm 2. make eager mode QAT prepare scripts respect device affinity This PR updates the torchvision QAT reference script to take advantage of both of these. This should be landed after pytorch/pytorch#39337 (the last PT fix) to avoid compatibility issues. Test Plan: ``` python -m torch.distributed.launch --nproc_per_node 8 --use_env references/classification/train_quantization.py --data-path {imagenet1k_subset} --output-dir {tmp} --sync-bn ``` Reviewers: Subscribers: Tasks: Tags:
good point, fixed |
Summary:
We've made two recent changes to QAT in PyTorch core:
This PR updates the torchvision QAT reference script to take
advantage of both of these. This should be landed after
pytorch/pytorch#39337 (the last PT
fix) to avoid compatibility issues.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags: