Skip to content

philox_engine_inputs should also round increment to a multiple of 4#50916

Closed
mcarilli wants to merge 1 commit intopytorch:masterfrom
mcarilli:philox_engine_inputs_increment
Closed

philox_engine_inputs should also round increment to a multiple of 4#50916
mcarilli wants to merge 1 commit intopytorch:masterfrom
mcarilli:philox_engine_inputs_increment

Conversation

@mcarilli
Copy link
Copy Markdown
Collaborator

philox_engine_inputs() is deprecated. Callers should refactor to use philox_cuda_state(), and afaik all call sites in aten have already been refactored, but in the meantime on behalf of other consumers (ie extensions, possibly some lingering call sites in jit), philox_engine_inputs should handle the increment the same way philox_cuda_state does.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

facebook-github-bot commented Jan 21, 2021

💊 CI failures summary and remediations

As of commit dda47b3 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions to the (internal) Dr. CI Users group.

@mcarilli mcarilli requested a review from ngimel January 21, 2021 22:45
@ejguan ejguan added module: cuda Related to torch.cuda, and CUDA support in general triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jan 21, 2021
@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 22, 2021

Codecov Report

Merging #50916 (dda47b3) into master (884fb48) will decrease coverage by 0.00%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #50916      +/-   ##
==========================================
- Coverage   81.02%   81.02%   -0.01%     
==========================================
  Files        1916     1916              
  Lines      209349   209349              
==========================================
- Hits       169624   169618       -6     
- Misses      39725    39731       +6     

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ngimel has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ngimel merged this pull request in ffc8a26.

jianyuh added a commit to jianyuh/FBGEMM that referenced this pull request Jan 24, 2021
Summary:
Follow up on the failure case on FP16 stochastic rounding:
- pytorch/pytorch#50148
- D26006041 (pytorch@ceb16c9)

From Natalia:
- pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Differential Revision: D26038596

fbshipit-source-id: f42bce9e5893eb9478b63cd4ca45c4da003dcf2d
jianyuh added a commit to jianyuh/FBGEMM that referenced this pull request Jan 25, 2021
Summary:
Pull Request resolved: pytorch/pytorch#51004

Pull Request resolved: pytorch#493

Follow up on the failure case on FP16 stochastic rounding:
- pytorch/pytorch#50148
- D26006041 (pytorch@ceb16c9)

From Natalia:
- pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Differential Revision: D26038596

fbshipit-source-id: 0154ba22688747e717d4c630e958938eff739b24
facebook-github-bot pushed a commit to pytorch/FBGEMM that referenced this pull request Feb 1, 2021
Summary:
Pull Request resolved: pytorch/pytorch#51004

Pull Request resolved: #493

Follow up on the failure case on FP16 stochastic rounding:
- pytorch/pytorch#50148
- D26006041 (ceb16c9)

From Natalia:
- pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
facebook-github-bot pushed a commit that referenced this pull request Feb 1, 2021
…ding (#51004)

Summary:
Pull Request resolved: #51004

Pull Request resolved: pytorch/FBGEMM#493

Follow up on the failure case on FP16 stochastic rounding:
- #50148
- D26006041

From Natalia:
- #50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
pytorch-bot Bot pushed a commit to pytorch/FBGEMM that referenced this pull request Feb 26, 2026
Summary:
Pull Request resolved: pytorch/pytorch#51004

Pull Request resolved: #493

Follow up on the failure case on FP16 stochastic rounding:
- pytorch/pytorch#50148
- D26006041 (05873a3)

From Natalia:
- pytorch/pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ytorch#50916)

Summary:
`philox_engine_inputs()` is deprecated.  Callers should refactor to use `philox_cuda_state()`, and afaik all call sites in aten have already been refactored, but in the meantime on behalf of other consumers (ie extensions, possibly some lingering call sites in jit), `philox_engine_inputs` should handle the increment the same way `philox_cuda_state` does.

Pull Request resolved: pytorch#50916

Reviewed By: mrshenli

Differential Revision: D26022618

Pulled By: ngimel

fbshipit-source-id: 17178ad099ddc17d3596b9508ae4dce729b44f57
laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
…ding (pytorch#51004)

Summary:
Pull Request resolved: pytorch#51004

Pull Request resolved: pytorch/FBGEMM#493

Follow up on the failure case on FP16 stochastic rounding:
- pytorch#50148
- D26006041

From Natalia:
- pytorch#50916 is the fix, philox_engine_inputs is deprecated btw so if you could refactor it to use philox_cuda_state that would be great.
- instructions to change the call https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/CUDAGeneratorImpl.h#L48-L83, it will be important to use philox_cuda_state with graph capture.

Benchmark:
- Before this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/benchmarks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee before_diff.log
PARSING BUCK FILES: FINISHED IN 0.4s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DOWNLOADED 0 ARTIFACTS, 0.00 BYTES, 0.0% CACHE MISS
BUILDING: FINISHED IN 5.3s (100%) 6474/6474 JOBS, 0 UPDATED
BUILD SUCCEEDED
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=False, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)
INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  607.48GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  220.85GB/s, T: 1139us
```

- After this Diff:
```
(base) [jianyuhuang@devgpu017.atn5.facebook.com: ~/fbsource/fbcode/hpc/ops/benchmarks] $  buck run mode/opt //hpc/ops/[5/1935]
ks:split_table_batched_embeddings_benchmark device -- --fp16 --stoc 2>&1 | tee after_diff.log
PARSING BUCK FILES: FINISHED IN 1.1s
CREATING ACTION GRAPH: FINISHED IN 0.0s
DEBUG:root:Using fused exact_row_wise_adagrad with optimizer_args=OptimizerArgs(stochastic_rounding=True, gradient_clipping=Fal
se, max_gradient=1.0, learning_rate=0.1, eps=0.1, beta1=0.9, beta2=0.999, weight_decay=0.0, eta=0.001, momentum=0.9)           INFO:root:Embedding parameters:  0.41 GParam,  0.82GB
INFO:root:Accessed weights per batch:  83.89MB
INFO:root:Forward, B: 512, E: 100000, T: 32, D: 128, L: 20, W: False, BW:  608.80GB/s, T: 138us
INFO:root:ForwardBackward, B: 512, E: 100000, T: 32, D: 128, L: 20, BW:  229.17GB/s, T: 1098us
```

Test Plan: CI

Reviewed By: ngimel

Differential Revision: D26038596

fbshipit-source-id: 5360395c1c3b1a062b38e5695239258e892c63c4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed Merged module: cuda Related to torch.cuda, and CUDA support in general open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants