[wip] Replace optimizers in torch.optim with the ones from torch.optim._multi_tensor#49039
[wip] Replace optimizers in torch.optim with the ones from torch.optim._multi_tensor#49039izdeby wants to merge 52 commits intogh/izdeby/69/basefrom
Conversation
[ghstack-poisoned]
💊 CI failures summary and remediationsAs of commit f456748 (more details on the Dr. CI page):
🕵️ 16 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
| Job | Step | Action |
|---|---|---|
| Ensure correct trailing newlines | 🔁 rerun |
ci.pytorch.org: 1 failed
This comment was automatically generated by Dr. CI (expand for details).
Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions to the (internal) Dr. CI Users group.
….optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
….optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
….optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
….optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
….optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
….optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
….optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
… torch.optim._multi_tensor" Differential Revision: [D25406490](https://our.internmc.facebook.com/intern/diff/D25406490) ------ ### Benchmark results SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True) Current: 201.63 ms Foreach: 56.99 ms Adam (weight_decay=1., amsgrad=True) Current: 233.27 ms Foreach: 46.89 ms AdamW (weight_decay=1., amsgrad=True) Current: 371.18 ms Foreach: 121.04 ms RMSprop (weight_decay=1, momentum=1, centered=True) Current: 364.88 ms Foreach: 47.52 ms Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50)) Current: 1.43 s Foreach: 1.26 s ASGD (weight_decay=1) Current: 165.39 ms Foreach: 40.61 ms Adamax (weight_decay=1) Current: 374.42 ms Foreach: 291.06 ms Adadelta (weight_decay=1) Current: 252.64 ms Foreach: 29.62 ms ### Benchmark script ``` import torch import torch.optim as optim import torch.nn as nn import torchvision import torch.utils.benchmark as benchmark_utils model = torchvision.models.resnet.resnet101(pretrained=True).to("cuda") targets = torch.randint(0, 1000, (100, 100), device="cuda") criterion = nn.CrossEntropyLoss() # optimizers params = dict(weight_decay=1) optimizer = optim.Adadelta(model.parameters(), **params) optimizer_mta = optim._multi_tensor.Adadelta(model.parameters(), **params) running_loss = 0.0 target = torch.empty(128, dtype=torch.long, device="cuda").random_(5) optimizer.zero_grad() inputs = torch.rand(128, 3, 100, 100, device="cuda" , requires_grad=True) outputs = model(inputs) loss = criterion(outputs, target) loss.backward() optimizer.step() running_loss += loss.item() def main(): timer = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer.step()", globals=globals(), label="str(optimizer)", ) print(f"autorange:\n{timer.blocked_autorange()}\n\n") timer_mta = benchmark_utils.Timer( stmt="torch.cuda.synchronize(); optimizer_mta.step()", globals=globals(), label="str(optimizer_mta)", ) print(f"autorange:\n{timer_mta.blocked_autorange()}\n\n") if __name__ == "__main__": main() ``` [ghstack-poisoned]
|
Hi @izdeby! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks! |
Stack from ghstack:
Differential Revision: D25406490
Benchmark results
SGD (lr=1e-3, momentum=1, dampening=0, weight_decay=1, nesterov=True)
Current: 201.63 ms
Foreach: 56.99 ms
Adam (weight_decay=1., amsgrad=True)
Current: 233.27 ms
Foreach: 46.89 ms
AdamW (weight_decay=1., amsgrad=True)
Current: 371.18 ms
Foreach: 121.04 ms
RMSprop (weight_decay=1, momentum=1, centered=True)
Current: 364.88 ms
Foreach: 47.52 ms
Rprop (lr=1e-2, etas=(0.5, 1.2), step_sizes=(1e-6, 50))
Current: 1.43 s
Foreach: 1.26 s
ASGD (weight_decay=1)
Current: 165.39 ms
Foreach: 40.61 ms
Adamax (weight_decay=1)
Current: 374.42 ms
Foreach: 291.06 ms
Adadelta (weight_decay=1)
Current: 252.64 ms
Foreach: 29.62 ms
Benchmark script