Use the latest philox_cuda_state API for stochastic rounding#493
Closed
jianyuh wants to merge 1 commit intopytorch:masterfrom
Closed
Use the latest philox_cuda_state API for stochastic rounding#493jianyuh wants to merge 1 commit intopytorch:masterfrom
jianyuh wants to merge 1 commit intopytorch:masterfrom
Conversation
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D26038596 |
Summary: Pull Request resolved: pytorch/pytorch#51004 Pull Request resolved: pytorch#493 Follow up on the failure case on FP16 stochastic rounding: - pytorch/pytorch#50148 - D26006041 (pytorch@ceb16c9) From Natalia: - pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Differential Revision: D26038596 fbshipit-source-id: 0154ba22688747e717d4c630e958938eff739b24
1ea2300 to
ecff64e
Compare
Contributor
|
This pull request was exported from Phabricator. Differential Revision: D26038596 |
Contributor
|
This pull request has been merged in d244512. |
facebook-github-bot
pushed a commit
to pytorch/pytorch
that referenced
this pull request
Feb 1, 2021
…ding (#51004) Summary: Pull Request resolved: #51004 Pull Request resolved: pytorch/FBGEMM#493 Follow up on the failure case on FP16 stochastic rounding: - #50148 - D26006041 From Natalia: - #50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Test Plan: CI Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
pytorch-bot Bot
pushed a commit
that referenced
this pull request
Feb 26, 2026
Summary: Pull Request resolved: pytorch/pytorch#51004 Pull Request resolved: #493 Follow up on the failure case on FP16 stochastic rounding: - pytorch/pytorch#50148 - D26006041 (05873a3) From Natalia: - pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
laurentdupin
pushed a commit
to laurentdupin/pytorch
that referenced
this pull request
Apr 24, 2026
…ding (pytorch#51004) Summary: Pull Request resolved: pytorch#51004 Pull Request resolved: pytorch/FBGEMM#493 Follow up on the failure case on FP16 stochastic rounding: - pytorch#50148 - D26006041 From Natalia: - pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Test Plan: CI Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Follow up on the failure case on FP16 stochastic rounding:
From Natalia:
Differential Revision: D26038596