Turn float comparisons into primitive operations#9945
Conversation
|
Adding |
|
For reference, here's what currrent c compilers produce: https://godbolt.org/z/3xYK39 Yes, I took a look at that too, but the problem is that eq and neq need to be treated specially, because of NaN -- the problem is that there aren't any flags available that give you those conditions. So for simplicity, I wanted to generate the same code for all of the comparisons. It's also nice that icc produces this code for all of them. I can probably do some benchmarks --but I'm tempted to rely on icc to guide us a little bit. |
8cc270a to
98b2aa0
Compare
|
The code looks good to me, although I haven't reviewed the non-x86 backends in detail. I'm still not quite convinced by the output, though: The code generated by icc is highly dependent on the exact input. When I change the test functions to store their results in an integer (see https://godbolt.org/z/x4bbjd), icc often generates branching code. Whether it generates branches or not depends on the signedness of the result type (!), despite the answer always being 0 or 1. So I don't think we can use it as an oracle for good code. Since the output of Finally, would you mind re-running your original benchmark using an array whose elements are all equal? You've shown a significant speed-up in the case where the branch predictor performs badly (random data), but I'd like to also verify that there's not (much) slowdown in the more common case where it performs well. |
|
Sounds like we should be reporting bugs for the intel compiler 😄 I guess we'll have to do the combination of benchmarks in the following categories: Predictable: Yes/No Hopefully that will give us enough coverage to make a decision. |
|
I did some of these benchmarks. Sadly, I don't have a current non-laptop intel box available right now, so you'll have to live with the laptop Kaby lake benchmarks (the variance is pretty high on those). The Zen2 are super consistent though. |
|
And now some numbers for another intel processor (Gold 6244). the variance on intel processors seems to be higher, don't know why that is the case (I may try to get a much larger sample size to figure this one out). Note that "det" is the previous gist with the random replaced by a constant. |
|
I redid everything with 10 samples each now. I still don't trust the intel numbers, because of the high variance, if someone has a better method they can try, that would be awesome.
|
|
@stedolan : any thoughts about these results? |
|
ping @stedolan |
|
Sorry this took me so long to get to! Those benchmarks are pretty convincing, this looks like a clear improvement over what's there. |
a24349c to
0f35000
Compare
|
@alainfrisch , since you generally care a lot about float performance as well, do you want to review this patch? |
alainfrisch
left a comment
There was a problem hiding this comment.
I reviewed only the x86/amd64 backends (having no knowledge of other architectures), and the PR looks ok to me. I've left minor superficial remarks (plus one question). Thanks for providing the results with the micro-benchmark; they look great, and I'm in favor of merging the PR.
How much testing/benchmarking has been done on non Intel architectures?
asmcomp/amd64/arch.ml
Outdated
| (* Certain float conditions aren't represented directly in the opcode for | ||
| float comparison, so we have to swap the arguments. The swap information | ||
| is also needed downstream because one of the arguments is clobbered. *) | ||
| let float_cond_and_swap cond a0 a1 = |
There was a problem hiding this comment.
Nitpicking, but it seems to better to have something like:
let float_cond_and_need_swap = function
| CFeg -> EQf, false
| CFneq -> NEQf, false
...
| CFnge -> NLef, truewhich avoids the risk of being incoherent between the need to swap, and the actual swapping. There are only two call sites for the function : one doesn't need to perform the actual swapping (and passes unit arguments), so it can use this function directly; the other can do the swapping itself.
|
|
||
| let max_register_pressure = function | ||
| let max_register_pressure = | ||
| let consumes ~int ~float = |
There was a problem hiding this comment.
This is a nice approach, but even with strong enough inlining, the compiler will allocate those arrays everytime consumes is called (because arrays are mutable). The impact on compilation-time performance is probably negligible, but it might be worth confirming it empirically.
There was a problem hiding this comment.
I believe this functions is called from add_superpressure_regs in spill.ml (and nowhere else?), and there, we also have an array.make for the same length -- so I guess we'd allocate a bit more, but we're already doing array allocation each time we're calling max_register_pressure. I think it's probably fine to not do additional benchmarks for this, unless people feel strongly about it.
There was a problem hiding this comment.
I don't think benchmarks would show any difference, because the previous code also allocates a new array each time. We don't have immutable arrays in the language, so even constant arrays must be re-allocated each time if they're not defined at toplevel.
asmcomp/amd64/emit.mlp
Outdated
| let cond, a0, a1, _ = float_cond_and_swap cmp (arg i 0) (arg i 1) in | ||
| I.cmpsd cond a1 a0; | ||
| I.movd a0 (res i 0); | ||
| I.neg (res i 0); |
There was a problem hiding this comment.
Sorry for the very dummy question (showing, btw, that I should really not be the one reviewing this PR...), but why is the negation necessary?
There was a problem hiding this comment.
compsd's result is either all one bits (-1) or zero, but we actually want zero or one, so neg is the easiest way to do the conversion (and, iirc, the instruction encoding of neg is also pretty short)
There was a problem hiding this comment.
Thanks! (It was pretty obvious from https://www.felixcloutier.com/x86/cmpsd but I looked at the wrong pseudo code...)
|
Other than anything done by the automated checks (which i guess is not much), I haven't really done any non-intel testing. What do people typically do? I don't have any environments where I could easily test. |
|
@alainfrisch: I think the non-intel architectures will be checked automatically when merged. I would think that performance testing is not necessary, since we're mostly reducing instruction count (or not changing instructions) |
|
This seems to be a good candidate for INRIA's CI (cf https://github.com/ocaml/ocaml/blob/trunk/HACKING.adoc#inrias-continuous-integration-ci). |
|
I managed to sign up and get permissioned for the ocaml project, but there's no "Build with parameters" at https://ci.inria.fr/ocaml/job/precheck/, like it says in the instructions -- maybe they need to be updated? |
5af23b3 to
2c9073f
Compare
Strange, it seems to be there for me. Anyway, I launched the precheck job: https://ci.inria.fr/ocaml/job/precheck/554/ |
|
Thanks! If I'm reading that right, it looks like it was successful |
Yes, that's correct (the only supported architecture that is not tested in the Was there any other outstanding issue apart from testing on non-intel architectures? |
|
No, I don't believe that there are any other outstanding things. |
|
I haven't looked at the PR in detail, but both @stedolan and @alainfrisch were in favor of merging and precheck has passed, so let's merge. Do you want to keep the history, or shall we just squash everything together? |
|
I don't see any test in the PR. Are we sure every case of every port has been exercised? My general take is that this treatment of FP comparisons is good to have for the platforms where it is easy to support (x86-64, ARM) but the i386 code is a mess and PowerPC is so-so. So I would have preferred a per-platform implementation based on platform-specific instructions. |
Co-authored-by: Alain Frisch <alain@frisch.fr>
Co-authored-by: Alain Frisch <alain@frisch.fr>
Co-authored-by: Alain Frisch <alain@frisch.fr>
Co-authored-by: Vincent Laviron <vincent.laviron@gmail.com>
Co-authored-by: Vincent Laviron <vincent.laviron@gmail.com>
Rebased and all checks pass. Thanks! |
|
@lthls As the person that reviewed this PR most recently, do you mind giving an official approval? We can then move to merge. |
|
@smuenzel do you mind if we squash-merge? |
|
Squash sounds great. Thanks! |
| let cond, need_swap = float_cond_and_need_swap cmp in | ||
| let a0, a1 = if need_swap then arg i 1, arg i 0 else arg i 0, arg i 1 in | ||
| I.cmpsd cond a1 a0; | ||
| I.movd a0 (res i 0); |
There was a problem hiding this comment.
This should be movq not movd when argument and result are in 64-bit registers.
There was a problem hiding this comment.
I believe movd is right, strange as it sounds. But let me try to dig out a reference since I'm currently relying on some hazy memory.
There was a problem hiding this comment.
I think this bug report is the correct reference:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43215
Relevant quote:
However, since AMD64 uses movd instead
of movq, some non-GNU assemblers may not support movq. Why change it
when nothing is broken?
So apparently assembler support for movd is more universal than movq, since movq wasn't in the initial amd64 spec.
But I guess
movd eax,xmm0
neg eax
is even better, since it has a shorter encoding (by two bytes, i think?). Probably doesn't make enough difference for reviewing another PR.
There was a problem hiding this comment.
ok, thanks for checking.
Currently, float comparisons that produce a bool result are compiled into jumps producing the correct bool value.
Further, due to the NaN conditions, we actually end up with multiple jumps and condition checks, not just one.
This patch eliminates these jumps by directly using float comparison operations available on the target architecture.
It is a work in progress, since support for all architectures still needs to be added.
On the following benchmark on amd64, it's about twice as fast as the previous version when using an AMD Ryzen 3700X. On an Intel Kaby Lake, the performance improvement is more modest, around 20 to 30% (I'm not adding the numbers here, since that was done on my laptop, and I don't trust the power saving logic).
https://gist.github.com/smuenzel/77c2fe63072f78237a7b419d548e1292
with patch:
0m1.770s
0m1.756s
0m1.770s
0m1.757s
without patch:
0m3.395s
0m3.391s
0m3.400s
0m3.426s