make order btw div and mul in adgrad consistent by jspark1105 · Pull Request #286 · pytorch/FBGEMM

jspark1105 · 2020-02-04T16:16:34Z

Summary:
Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad

There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad.

The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row

And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit)

Differential Revision: D19342865

Summary: Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad. The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit) Differential Revision: D19342865 fbshipit-source-id: 1db79e9b66ab634ba6cf2b700c7cbb01bc0a7180

facebook-github-bot · 2020-02-04T16:17:09Z

This pull request was exported from Phabricator. Differential Revision: D19342865

Summary: Pull Request resolved: #32974 Pull Request resolved: pytorch/FBGEMM#286 Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad. The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit) Test Plan: CI Reviewed By: wx1988 Differential Revision: D19342865 fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341

facebook-github-bot · 2020-02-17T07:18:18Z

This pull request has been merged in 19c040c.

Summary: Pull Request resolved: pytorch#32974 Pull Request resolved: pytorch/FBGEMM#286 Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad. The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit) Test Plan: CI Reviewed By: wx1988 Differential Revision: D19342865 fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341

Summary: Pull Request resolved: pytorch/pytorch#32974 Pull Request resolved: #286 Re-attempt of D18805426 . Decided to be consistent with PyTorch Adagrad There was an inconsistency in the order of operation between scalar and SIMD code when we compute Adagrad. This diff make them consistent by doing w += lr * grad / (sqrt(moment) + epsilon) in Adagrad and w += lr / (sqrt(moment) + epsilon) * grad in RowWiseSparseAdagrad. The Adagrad order is consistent with PyTorch (see aten/src/ATen/native/cpu/PointwiseOpsKernel.cpp addcmul_cpu_kernel function). The RowWiseSparseAdagrad order is to make compute more efficient. In RowWiseSparseAdagrad, lr / (sqrt(moment) + epsilon) is shared among all elements in the row And, we're not going to use FMA to be consistent with PyTorch (even though it provides a little accuracy benefit) Reviewed By: wx1988 Differential Revision: D19342865 fbshipit-source-id: e950c16f2e1c4a2f2a3ef53b1705db373c67f341

facebook-github-bot added the fb-exported label Feb 4, 2020

facebook-github-bot closed this in 19c040c Feb 17, 2020

facebook-github-bot added the Merged label Feb 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make order btw div and mul in adgrad consistent#286

make order btw div and mul in adgrad consistent#286
jspark1105 wants to merge 1 commit intopytorch:masterfrom
jspark1105:export-D19342865

jspark1105 commented Feb 4, 2020

Uh oh!

facebook-github-bot commented Feb 4, 2020

Uh oh!

facebook-github-bot commented Feb 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jspark1105 commented Feb 4, 2020

Uh oh!

facebook-github-bot commented Feb 4, 2020

Uh oh!

facebook-github-bot commented Feb 17, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants