Skip to content

Fix precision issues in CPU remainder#38293

Closed
peterbell10 wants to merge 4 commits intopytorch:masterfrom
peterbell10:remainder-fix
Closed

Fix precision issues in CPU remainder#38293
peterbell10 wants to merge 4 commits intopytorch:masterfrom
peterbell10:remainder-fix

Conversation

@peterbell10
Copy link
Copy Markdown
Collaborator

@peterbell10 peterbell10 commented May 12, 2020

Together with #37758, this fixes #37743 and fixes #24861.

This follows the CUDA fix in #37758, vectorised using a blendv to replace the if conditionals.

Most of the complication is from remainder supporting at::Half where fmod doesn't. I've now got fmod working on Vec256<at::Half> as well as enabling half dispatch for fmod so it matches remainder.

I also added fmod support to Vec256<at::BFloat16> before realising that remainder doesn't support BFloat16 anyway. I could also enable BFloat16 if that's desirable. If not, I don't think Vec256<BFloat16> should be missing fmod anyway.

@dr-ci
Copy link
Copy Markdown

dr-ci Bot commented May 12, 2020

💊 CI failures summary and remediations

As of commit 91fa644 (more details on the Dr. CI page):


  • 2/2 failures possibly* introduced in this PR
    • 2/2 non-CircleCI failure(s)

ci.pytorch.org: 2 failed


This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.

Please report bugs/suggestions on the GitHub issue tracker.

See how this bot performed.

This comment has been revised 5 times.

std::integral_constant<bool,
std::is_floating_point<T>::value ||
std::is_same<T, at::Half>::value> {
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm... are you sure we don't already have a concept for this elsewhere in the codebase?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not as far as I could see. grepping for is_floating_point didn't show anything and I had a look in Half.h

static_assert(std::is_same<float_t_abs, T>::value, "float_t_abs must be T");
// Specifically deal with floating-point because the generic code above won't handle -0.0 (which should result in
// 0.0) properly.
return map(std::abs);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huh, what's going on here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no std::abs(at::Half) so this allows implicit conversions to float.

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 12, 2020

It would be great if we could have a little before and after perf comparison. The new implementation is going to be slower but it would be good to have a little napkin saying how much slower.

@ezyang ezyang requested a review from xwang233 May 12, 2020 15:32
@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 12, 2020

@xwang233 do you think you could help review this?

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 12, 2020

cc @andreaskoepf perhaps, too

@ailzhang ailzhang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 12, 2020
Comment thread aten/src/ATen/native/cpu/BinaryOpsKernel.cpp Outdated
@peterbell10
Copy link
Copy Markdown
Collaborator Author

My benchmark shows the new version is ~50-60% slower for both float and double tensors.

Copy link
Copy Markdown
Collaborator

@xwang233 xwang233 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Since the remainder tests are passed, this LGTM. I like the use of XOR instead of != in the mask.

Comment thread aten/src/ATen/native/cpu/BinaryOpsKernel.cpp
Vec256<BFloat16> expm1() const {
return map(Sleef_expm1f8_u10);
}
Vec256<BFloat16> fmod(const Vec256<BFloat16> & q) const {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the BFloat16 fmod specialization used somewhere? I saw only the dispatch code in BinaryOpsKernel.cpp for kHalf but not for kBFloat16?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not currently used, but since it's a floating point type it should have that operation available. I can also enable BFloat16 dispatch if that's desirable.

Vec256<scalar_t> r = a - b * (a / b).floor();
return r;
auto mod = a.fmod(b);
const auto zero = Vec256<scalar_t>(0);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that other functions in BinaryOpsKernel.cpp declare constants outside the lambda, e.g. before the cpu_kernel_vec() call, but I doubt this has a perf impact...

@ezyang
Copy link
Copy Markdown
Contributor

ezyang commented May 13, 2020

The perf loss is regrettable, but correctness first!

Copy link
Copy Markdown
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Copy Markdown
Contributor

@ezyang merged this pull request in 0a159b0.

laurentdupin pushed a commit to laurentdupin/pytorch that referenced this pull request Apr 24, 2026
Summary:
Together with pytorch#37758, this fixes pytorch#37743 and fixes pytorch#24861.

This follows the CUDA fix in pytorch#37758, vectorised using a `blendv` to replace the if conditionals.

Most of the complication is from `remainder` supporting `at::Half` where `fmod` doesn't. I've now got `fmod` working on `Vec256<at::Half>` as well as enabling half dispatch for `fmod` so it matches `remainder`.

I also added `fmod` support to `Vec256<at::BFloat16>` before realising that `remainder` doesn't support `BFloat16` anyway. I could also enable `BFloat16` if that's desirable. If not, I don't think `Vec256<BFloat16>` should be missing `fmod` anyway.
Pull Request resolved: pytorch#38293

Differential Revision: D21539801

Pulled By: ezyang

fbshipit-source-id: abac6a3ed2076932adc459174cd3d8d510f3e1d5
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

torch.remainder gives a remainder larger than the divisor Float overflow in torch.remainder

7 participants