Use the latest philox_cuda_state API for stochastic rounding by jianyuh · Pull Request #493 · pytorch/FBGEMM

jianyuh · 2021-01-24T06:47:37Z

Summary:
Follow up on the failure case on FP16 stochastic rounding:

[CUDA graphs] [JIT] Capture-safe RNG in nvfuser pytorch#50148
D26006041 (ceb16c9)

From Natalia:

philox_engine_inputs should also round increment to a multiple of 4 pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Differential Revision: D26038596

facebook-github-bot · 2021-01-24T06:47:56Z

This pull request was exported from Phabricator. Differential Revision: D26038596

Summary: Pull Request resolved: pytorch/pytorch#51004 Pull Request resolved: pytorch#493 Follow up on the failure case on FP16 stochastic rounding: - pytorch/pytorch#50148 - D26006041 (pytorch@ceb16c9) From Natalia: - pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Differential Revision: D26038596 fbshipit-source-id: 0154ba22688747e717d4c630e958938eff739b24

facebook-github-bot · 2021-01-25T01:19:20Z

This pull request was exported from Phabricator. Differential Revision: D26038596

facebook-github-bot · 2021-02-01T04:42:41Z

This pull request has been merged in d244512.

…ding (#51004) Summary: Pull Request resolved: #51004 Pull Request resolved: pytorch/FBGEMM#493 Follow up on the failure case on FP16 stochastic rounding: - #50148 - D26006041 From Natalia: - #50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Test Plan: CI Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4

Summary: Pull Request resolved: pytorch/pytorch#51004 Pull Request resolved: #493 Follow up on the failure case on FP16 stochastic rounding: - pytorch/pytorch#50148 - D26006041 (05873a3) From Natalia: - pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4

…ding (pytorch#51004) Summary: Pull Request resolved: pytorch#51004 Pull Request resolved: pytorch/FBGEMM#493 Follow up on the failure case on FP16 stochastic rounding: - pytorch#50148 - D26006041 From Natalia: - pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great. - instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture. Benchmark: - Before this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log PARSING BUCK FILES: FINISHED IN 0.4s CREATING ACTION GRAPH: FINISHED IN 0.0s DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED BUILD SUCCEEDED DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 607.48GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 220.85GB/s, T: 1139us ``` - After this Diff: ``` (base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $ buck run mode/opt //hpc/ops/[5/1935] ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log PARSING BUCK FILES: FINISHED IN 1.1s CREATING ACTION GRAPH: FINISHED IN 0.0s DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9) INFO:root:Embedding parameters: 0.41 GParam, 0.82GB INFO:root:Accessed weights per batch: 83.89MB INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW: 608.80GB/s, T: 138us INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW: 229.17GB/s, T: 1098us ``` Test Plan: CI Reviewed By: ngimel Differential Revision: D26038596 fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4

facebook-github-bot added cla signed fb-exported labels Jan 24, 2021

jianyuh force-pushed the export-D26038596 branch from 1ea2300 to ecff64e Compare January 25, 2021 01:19

facebook-github-bot closed this in d244512 Feb 1, 2021

facebook-github-bot added the Merged label Feb 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the latest philox_cuda_state API for stochastic rounding#493

Use the latest philox_cuda_state API for stochastic rounding#493
jianyuh wants to merge 1 commit intopytorch:masterfrom
jianyuh:export-D26038596

jianyuh commented Jan 24, 2021

Uh oh!

facebook-github-bot commented Jan 24, 2021

Uh oh!

facebook-github-bot commented Jan 25, 2021

Uh oh!

facebook-github-bot commented Feb 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jianyuh commented Jan 24, 2021

Uh oh!

facebook-github-bot commented Jan 24, 2021

Uh oh!

facebook-github-bot commented Jan 25, 2021

Uh oh!

facebook-github-bot commented Feb 1, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants