Skip to content

[fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding#51004

Closed
jianyuh wants to merge 1 commit intopytorch:masterfrom
jianyuh:export-D26038596
Closed

[fbgemm_gpu] Use the latest philox_cuda_state API for stochastic rounding#51004
jianyuh wants to merge 1 commit intopytorch:masterfrom
jianyuh:export-D26038596

Conversation

@jianyuh
Copy link
Copy Markdown
Member

@jianyuh jianyuh commented Jan 24, 2021

Summary:
Follow up on the failure case on FP16 stochastic rounding:

From Natalia:

Test Plan: CI

Differential Revision: D26038596

…ding

Summary:
Follow up on the failure case on FP16 stochastic rounding:
- pytorch#50148
- D26006041

From Natalia:
- pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Test Plan: CI

Differential Revision: D26038596

fbshipit-source-id: 9f12f9a18895e458c691227b454d4db8ad026d6c
@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 24, 2021

💊 CI failures summary and remediations

As of commit 4d88723 (more details on the Dr. CI page):


  • 1/1 failures possibly* introduced in this PR
    • 1/1 non-CircleCI failure(s)

This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request was exported from Phabricator. Differential Revision: D26038596

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 24, 2021

Codecov Report

Merging #51004 (4d88723) into master (1f40f2a) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #51004   +/-   ##
=======================================
  Coverage   80.91%   80.91%           
=======================================
  Files        1924     1924           
  Lines      210014   210014           
=======================================
  Hits       169924   169924           
  Misses      40090    40090           

jianyuh added a commit to jianyuh/FBGEMM that referenced this pull request Jan 25, 2021
Summary:
Pull Request resolved: pytorch/pytorch#51004

Pull Request resolved: pytorch#493

Follow up on the failure case on FP16 stochastic rounding:
- pytorch/pytorch#50148
- D26006041 (pytorch@ceb16c9)

From Natalia:
- pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Differential Revision: D26038596

fbshipit-source-id: 0154ba22688747e717d4c630e958938eff739b24
@facebook-github-bot
Copy link
Copy Markdown
Contributor

This pull request has been merged in d244512.

facebook-github-bot pushed a commit that referenced this pull request Feb 1, 2021
…ding (#51004)

Summary:
Pull Request resolved: #51004

Pull Request resolved: pytorch/FBGEMM#493

Follow up on the failure case on FP16 stochastic rounding:
- #50148
- D26006041

From Natalia:
- #50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
pytorch-bot Bot pushed a commit to pytorch/FBGEMM that referenced this pull request Feb 26, 2026
Summary:
Pull Request resolved: pytorch/pytorch#51004

Pull Request resolved: #493

Follow up on the failure case on FP16 stochastic rounding:
- pytorch/pytorch#50148
- D26006041 (05873a3)

From Natalia:
- pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ding (pytorch#51004)

Summary:
Pull Request resolved: pytorch#51004

Pull Request resolved: pytorch/FBGEMM#493

Follow up on the failure case on FP16 stochastic rounding:
- pytorch#50148
- D26006041

From Natalia:
- pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants