Replace (val / 2) with (val * 0.5) in Jit#24584
Conversation
|
FWIW I tried to do this or a similar FP optimization in the past but didn't bother with it because it wasn't very clear how useful it is, the framework doesn't have a lot of FP code. But now we have WPF in .NET Core - its PresentationCore.dll and PresentationFramework.dll contain a bit more than 100 hits. So it's quite useful, while many developers do this optimization manually there are still enough opportunities. I haven't noticed any regressions. In theory this can block CSE because VN doesn't understand that |
|
@mikedn @tannergooding sorry for the delayed response. I've refactored it to |
|
Also, I did some testing locally, e.g. https://gist.github.com/EgorBo/866a49334291c1ac3b108eb9341681ae (similar for double and for non-power-of-two constants) |
|
@mikedn @tannergooding could you please take a look one more time? I think I handled all the cases. Also added a test. Also: should I take care about Big Endian? |
|
A small Roslyn-based script to find places where this optimization can be applied (found some in various math/graphics related C# repositories): https://gist.github.com/EgorBo/74b034fe1936c43fcd0b42934322557c |
Erm, why do you need such a contraption? Perhaps you're not aware how run JIT diffs? |
|
@mikedn I wanted to quickly find such places without even downloading repositories (via HttpClient) 🙂. Also I made a list of patterns that LLVM is able to optimize (InstCombine transforms) and looking for them in those repos. |
That's not going to find the interesting case, where expressions like Anyway, here's a x64 FX diff: As I mentioned in a previous post, it's a bit of a strange case because |
|
@mikedn yeah, but as you mentioned earlier there are cases in |
|
Unfortunately running diffs on wpf assemblies is a bit more tricky at the moment so I haven't done it again. The ~100 hits estimation from my previous post on the matter likely still stands. |
sandreenko
left a comment
There was a problem hiding this comment.
The change looks good, thanks @EgorBo.
However, I am not 100% sure that it is worth taking (with the current morph state and the lack of a separated expression transformation optimizer), @BruceForstall?
| return (bits < 0x7FF0000000000000) && (bits != 0) && ((bits & 0x7FF0000000000000) != 0); | ||
| } | ||
|
|
||
| bool FloatingPointUtils::isNormal(float x) |
There was a problem hiding this comment.
I do not like this bit checks, but without C++ isnormal I do not see any better solution.
|
For the non power of two float (32-bit) divides it would also always be a win to perform a reciprocal multiply operation using a (64-bit) multiply and conversion back to float (32-bit) Turns out it isn't a win :-( |
Why would you do that rather than just doing the 32-bit reciprocal multiplication? For cases where it is known to be equivalent, just keeping it as a single-precision float would be more efficient.. |
|
It was to cover the non power of two divide by constant case: However it turns out the the convert instructions are pretty slow, so I believe that this transformation loses: |
|
@briansull nice try anyway 🙂 |
| { | ||
| oper = GT_MUL; | ||
| tree->ChangeOper(oper); | ||
| op2->AsDblCon()->gtDconVal = 1.0 / divisor; |
There was a problem hiding this comment.
It might be worth noting that this is safe and doing the single operation in single precision isn't required.
The paper Innocuous Double Rounding of Basic Arithmetic Operations provides a proof that a single primitive operation done to at least twice the precision of the target format does not incur error due to double-rounding (hence for float divisor doing (float)(1.0 / divisor) is the same as (1.0f / divisor)).
This is not a safe thing to do across multiple operations (you must downcast back to float after each individual operation) nor is it safe if one of the inputs could not be exactly represented as a float (e.g. if you have double divisor and (float)divisor != divisor).
|
Thanks @EgorBo. |
vmulss/vmulsdhas better both latency and throughput thanvdivss/vdivsdat least for the hardware I have. e.g. on my MacBook's Haswell:So if a divisor is a constant power of two we can optimize it, e.g.:
See https://godbolt.org/z/rz9h4E (clang, gcc, msvc, x86, AMD64, AArch64 - everywhere this optimization is applied. Btw, LLVM also helps Mono to optimize this case for C#)
I wrote a small benchmark:
and the results are (Haswell):
/cc: @tannergooding