Turn float comparisons into primitive operations by smuenzel · Pull Request #9945 · ocaml/ocaml

smuenzel · 2020-09-28T09:44:15Z

Currently, float comparisons that produce a bool result are compiled into jumps producing the correct bool value.
Further, due to the NaN conditions, we actually end up with multiple jumps and condition checks, not just one.
This patch eliminates these jumps by directly using float comparison operations available on the target architecture.

It is a work in progress, since support for all architectures still needs to be added.

On the following benchmark on amd64, it's about twice as fast as the previous version when using an AMD Ryzen 3700X. On an Intel Kaby Lake, the performance improvement is more modest, around 20 to 30% (I'm not adding the numbers here, since that was done on my laptop, and I don't trust the power saving logic).

https://gist.github.com/smuenzel/77c2fe63072f78237a7b419d548e1292

with patch:
0m1.770s
0m1.756s
0m1.770s
0m1.757s

without patch:
0m3.395s
0m3.391s
0m3.400s
0m3.426s

stedolan · 2020-09-28T12:44:22Z

Adding Icompf to match Icomp looks useful, but for x86 have you considered using ucomisd + setcc instead of cmpsd + movd + neg? You don't need the extra register, you'd be able to share more code with the existing float-comparison logic, and https://uops.info/table.html suggests it's probably faster as well.

smuenzel · 2020-09-28T13:10:07Z

For reference, here's what currrent c compilers produce: https://godbolt.org/z/3xYK39

Yes, I took a look at that too, but the problem is that eq and neq need to be treated specially, because of NaN -- the problem is that there aren't any flags available that give you those conditions. So for simplicity, I wanted to generate the same code for all of the comparisons. It's also nice that icc produces this code for all of them.
(When I say "need" to be treated specially, I mean that ideally they should have the same performance as other conditions. In the godbolt link, clang tries to do that by using the cmpsd method for eq and neq, and setcc otherwise)

I can probably do some benchmarks --but I'm tempted to rely on icc to guide us a little bit.

stedolan · 2020-10-07T17:01:50Z

The code looks good to me, although I haven't reviewed the non-x86 backends in detail. I'm still not quite convinced by the output, though:

The code generated by icc is highly dependent on the exact input. When I change the test functions to store their results in an integer (see https://godbolt.org/z/x4bbjd), icc often generates branching code. Whether it generates branches or not depends on the signedness of the result type (!), despite the answer always being 0 or 1. So I don't think we can use it as an oracle for good code.

Since the output of cmpsd consumes an extra register, it can potentially cause a spill in situations where ucomisd would not. In the absence of serious benchmarking of one versus the other (and honestly, I can't imagine there's much difference), I'd lean towards the one that spills less.

Finally, would you mind re-running your original benchmark using an array whose elements are all equal? You've shown a significant speed-up in the case where the branch predictor performs badly (random data), but I'd like to also verify that there's not (much) slowdown in the more common case where it performs well.

smuenzel · 2020-10-08T05:14:45Z

Sounds like we should be reporting bugs for the intel compiler 😄

I guess we'll have to do the combination of benchmarks in the following categories:

Predictable: Yes/No
Implementation: CMPSD/SETcc/Jcc
Architecture: Zen2/Skylake

Hopefully that will give us enough coverage to make a decision.

smuenzel · 2020-11-06T13:49:38Z

I did some of these benchmarks. Sadly, I don't have a current non-laptop intel box available right now, so you'll have to live with the laptop Kaby lake benchmarks (the variance is pretty high on those). The Zen2 are super consistent though.
All benchmarks below are median of 5, runtime in seconds.
current is the base of my PR, compsd is this PR, and setcc is https://github.com/smuenzel/ocaml/tree/perf--compf-setcc
Note that setcc still has branches for eq and neq. Maybe that's also why it's a bit slower (on Zen2, on Kaby, it's within variance).

revision| Kaby (det)| Kaby (rand)| Zen2 (det)| Zen2 (rand)
--------+-----------+------------+-----------+------------
current |    3.79   |     4.65   |   2.329   |    3.403
setcc   |    3.73   |     3.61   |   1.858   |    1.851
compsd  |    3.63   |     3.75   |   1.770   |    1.797

smuenzel · 2020-11-07T04:28:44Z

And now some numbers for another intel processor (Gold 6244). the variance on intel processors seems to be higher, don't know why that is the case (I may try to get a much larger sample size to figure this one out). Note that "det" is the previous gist with the random replaced by a constant.

revision| Cascade (det)| Cascade (rand)| 
--------+--------------+---------------+
current |      2.566   |       3.476   |
setcc   |      2.359   |       2.340   |
compsd  |      2.345   |       2.428   |

smuenzel · 2020-11-07T05:03:06Z

I redid everything with 10 samples each now. I still don't trust the intel numbers, because of the high variance, if someone has a better method they can try, that would be awesome.
On Zen2, compsd is clearly better than setcc, but only by something like 3%. It is pretty nice to not have any branches in this code at all though.

Revision	Cascade (det)	Cascade (rand)	Kaby (det)	Kaby (rand)	Zen2 (det)	Zen2 (rand)
current	2.556±0.030	3.541±0.134	3.569±0.117	4.530±0.200	2.314±0.005	3.413±0.007
compsd	2.437±0.129	2.407±0.119	3.372±0.275	3.654±0.176	1.802±0.006	1.793±0.057
setcc	2.414±0.109	2.439±0.134	3.367±0.099	3.272±0.128	1.853±0.005	1.853±0.005

smuenzel · 2020-11-26T11:52:16Z

@stedolan : any thoughts about these results?

smuenzel · 2020-11-30T23:49:10Z

ping @stedolan

stedolan · 2021-01-20T18:08:39Z

Sorry this took me so long to get to! Those benchmarks are pretty convincing, this looks like a clear improvement over what's there.

smuenzel · 2021-02-11T08:11:01Z

@alainfrisch , since you generally care a lot about float performance as well, do you want to review this patch?

alainfrisch

I reviewed only the x86/amd64 backends (having no knowledge of other architectures), and the PR looks ok to me. I've left minor superficial remarks (plus one question). Thanks for providing the results with the micro-benchmark; they look great, and I'm in favor of merging the PR.

How much testing/benchmarking has been done on non Intel architectures?

alainfrisch · 2021-02-11T08:46:49Z

asmcomp/amd64/arch.ml

+(* Certain float conditions aren't represented directly in the opcode for
+   float comparison, so we have to swap the arguments. The swap information
+   is also needed downstream because one of the arguments is clobbered. *)
+let float_cond_and_swap cond a0 a1 =


Nitpicking, but it seems to better to have something like:

let float_cond_and_need_swap = function | CFeg -> EQf, false | CFneq -> NEQf, false ... | CFnge -> NLef, true

which avoids the risk of being incoherent between the need to swap, and the actual swapping. There are only two call sites for the function : one doesn't need to perform the actual swapping (and passes unit arguments), so it can use this function directly; the other can do the swapping itself.

alainfrisch · 2021-02-11T08:49:36Z

asmcomp/amd64/proc.ml


-let max_register_pressure = function
+let max_register_pressure =
+  let consumes ~int ~float =


This is a nice approach, but even with strong enough inlining, the compiler will allocate those arrays everytime consumes is called (because arrays are mutable). The impact on compilation-time performance is probably negligible, but it might be worth confirming it empirically.

I believe this functions is called from add_superpressure_regs in spill.ml (and nowhere else?), and there, we also have an array.make for the same length -- so I guess we'd allocate a bit more, but we're already doing array allocation each time we're calling max_register_pressure. I think it's probably fine to not do additional benchmarks for this, unless people feel strongly about it.

I don't think benchmarks would show any difference, because the previous code also allocates a new array each time. We don't have immutable arrays in the language, so even constant arrays must be re-allocated each time if they're not defined at toplevel.

asmcomp/amd64/selection.ml

alainfrisch · 2021-02-11T09:18:08Z

asmcomp/amd64/emit.mlp

+      let cond, a0, a1, _ = float_cond_and_swap cmp (arg i 0) (arg i 1) in
+      I.cmpsd cond a1 a0;
+      I.movd a0 (res i 0);
+      I.neg (res i 0);


Sorry for the very dummy question (showing, btw, that I should really not be the one reviewing this PR...), but why is the negation necessary?

compsd's result is either all one bits (-1) or zero, but we actually want zero or one, so neg is the easiest way to do the conversion (and, iirc, the instruction encoding of neg is also pretty short)

Thanks! (It was pretty obvious from https://www.felixcloutier.com/x86/cmpsd but I looked at the wrong pseudo code...)

asmcomp/amd64/emit.mlp

smuenzel · 2021-02-11T10:03:30Z

Other than anything done by the automated checks (which i guess is not much), I haven't really done any non-intel testing. What do people typically do? I don't have any environments where I could easily test.

smuenzel · 2021-03-29T08:40:22Z

@alainfrisch: I think the non-intel architectures will be checked automatically when merged. I would think that performance testing is not necessary, since we're mostly reducing instruction count (or not changing instructions)

alainfrisch · 2021-03-29T13:27:23Z

This seems to be a good candidate for INRIA's CI (cf https://github.com/ocaml/ocaml/blob/trunk/HACKING.adoc#inrias-continuous-integration-ci).

smuenzel · 2021-03-30T01:02:21Z

I managed to sign up and get permissioned for the ocaml project, but there's no "Build with parameters" at https://ci.inria.fr/ocaml/job/precheck/, like it says in the instructions -- maybe they need to be updated?

nojb · 2021-03-30T05:19:48Z

I managed to sign up and get permissioned for the ocaml project, but there's no "Build with parameters" at https://ci.inria.fr/ocaml/job/precheck/, like it says in the instructions -- maybe they need to be updated?

Strange, it seems to be there for me. Anyway, I launched the precheck job: https://ci.inria.fr/ocaml/job/precheck/554/

smuenzel · 2021-03-30T06:35:18Z

Thanks! If I'm reading that right, it looks like it was successful

nojb · 2021-03-30T06:41:02Z

Thanks! If I'm reading that right, it looks like it was successful

Yes, that's correct (the only supported architecture that is not tested in the precheck is RISC-V which is not yet plugged into it -- that one will be tested when merged).

Was there any other outstanding issue apart from testing on non-intel architectures?

smuenzel · 2021-03-30T07:07:02Z

No, I don't believe that there are any other outstanding things.

nojb · 2021-03-30T07:28:27Z

I haven't looked at the PR in detail, but both @stedolan and @alainfrisch were in favor of merging and precheck has passed, so let's merge. Do you want to keep the history, or shall we just squash everything together?

xavierleroy · 2021-03-30T07:48:29Z

I don't see any test in the PR. Are we sure every case of every port has been exercised?

My general take is that this treatment of FP comparisons is good to have for the platforms where it is easy to support (x86-64, ARM) but the i386 code is a mess and PowerPC is so-so. So I would have preferred a per-platform implementation based on platform-specific instructions.

Co-authored-by: Alain Frisch <alain@frisch.fr>

Co-authored-by: Vincent Laviron <vincent.laviron@gmail.com>

smuenzel · 2023-01-23T09:33:16Z

FYI: #11904 has now been merged.

Rebased and all checks pass. Thanks!

nojb · 2023-01-23T09:37:08Z

@lthls As the person that reviewed this PR most recently, do you mind giving an official approval? We can then move to merge.

nojb · 2023-01-24T13:00:38Z

@smuenzel do you mind if we squash-merge?

smuenzel · 2023-01-24T13:07:26Z

Squash sounds great. Thanks!

gretay-js · 2023-01-24T14:05:59Z

asmcomp/amd64/emit.mlp

+      let cond, need_swap = float_cond_and_need_swap cmp in
+      let a0, a1 = if need_swap then arg i 1, arg i 0 else arg i 0, arg i 1 in
+      I.cmpsd cond a1 a0;
+      I.movd a0 (res i 0);


This should be movq not movd when argument and result are in 64-bit registers.

I believe movd is right, strange as it sounds. But let me try to dig out a reference since I'm currently relying on some hazy memory.

I think this bug report is the correct reference:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43215

Relevant quote:

However, since AMD64 uses movd instead
of movq, some non-GNU assemblers may not support movq. Why change it
when nothing is broken?

So apparently assembler support for movd is more universal than movq, since movq wasn't in the initial amd64 spec.

But I guess

movd eax,xmm0 neg eax

is even better, since it has a shorter encoding (by two bytes, i think?). Probably doesn't make enough difference for reviewing another PR.

ok, thanks for checking.

smuenzel force-pushed the perf--compf branch from 8cc270a to 98b2aa0 Compare September 28, 2020 13:33

smuenzel changed the title ~~[WIP] Turn float comparisons into primitive operations~~ Turn float comparisons into primitive operations Sep 29, 2020

stedolan marked this pull request as ready for review January 20, 2021 18:06

smuenzel force-pushed the perf--compf branch 2 times, most recently from a24349c to 0f35000 Compare January 25, 2021 03:23

alainfrisch reviewed Feb 11, 2021

View reviewed changes

asmcomp/amd64/emit.mlp Outdated Show resolved Hide resolved

smuenzel force-pushed the perf--compf branch 2 times, most recently from 5af23b3 to 2c9073f Compare March 30, 2021 01:32

smuenzel and others added 21 commits January 23, 2023 16:18

centralize float_cond_and_swap

cdbed1f

Add some comments

e6310eb

Consumes in register pressure function

dae2cc6

fix

7592bc1

fix depend

bc80f22

Fix dependency cycle

0ad97cf

Attempt to add all arches

f86323c

fix negated condition for power

a53899d

Changes

380716b

Update asmcomp/amd64/selection.ml

57b14b5

Co-authored-by: Alain Frisch <alain@frisch.fr>

Update asmcomp/amd64/selection.ml

047a84d

Co-authored-by: Alain Frisch <alain@frisch.fr>

update float_cond_and_need_swap

2ccc36f

Update asmcomp/amd64/emit.mlp

a7bc520

Co-authored-by: Alain Frisch <alain@frisch.fr>

fix line that's too long

cb54a1f

fix dependencies

6b313b7

Unsupported platforms

75c47f4

Update asmcomp/amd64/arch.mli

4d4f98a

Co-authored-by: Vincent Laviron <vincent.laviron@gmail.com>

Update asmcomp/amd64/arch.ml

1f9cff2

Co-authored-by: Vincent Laviron <vincent.laviron@gmail.com>

Use brc in s390

d68a2ed

Fixes

c9d6c30

Add Issue report to changes

6c66637

smuenzel force-pushed the perf--compf branch from 72bfd92 to 6c66637 Compare January 23, 2023 08:20

lthls approved these changes Jan 24, 2023

View reviewed changes

nojb merged commit ebc23f1 into ocaml:trunk Jan 24, 2023

gretay-js reviewed Jan 24, 2023

View reviewed changes

gretay-js mentioned this pull request Oct 20, 2023

Fix backend conflicts ocaml-flambda/ocaml-jst#174

Merged

Conversation

smuenzel commented Sep 28, 2020

Uh oh!

stedolan commented Sep 28, 2020

Uh oh!

smuenzel commented Sep 28, 2020

Uh oh!

stedolan commented Oct 7, 2020

Uh oh!

smuenzel commented Oct 8, 2020

Uh oh!

smuenzel commented Nov 6, 2020

Uh oh!

smuenzel commented Nov 7, 2020

Uh oh!

smuenzel commented Nov 7, 2020

Uh oh!

smuenzel commented Nov 26, 2020

Uh oh!

smuenzel commented Nov 30, 2020

Uh oh!

stedolan commented Jan 20, 2021

Uh oh!

smuenzel commented Feb 11, 2021

Uh oh!

alainfrisch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

smuenzel commented Feb 11, 2021

Uh oh!

smuenzel commented Mar 29, 2021

Uh oh!

alainfrisch commented Mar 29, 2021

Uh oh!

smuenzel commented Mar 30, 2021

Uh oh!

nojb commented Mar 30, 2021

Uh oh!

smuenzel commented Mar 30, 2021

Uh oh!

nojb commented Mar 30, 2021

Uh oh!

smuenzel commented Mar 30, 2021

Uh oh!

nojb commented Mar 30, 2021

Uh oh!

xavierleroy commented Mar 30, 2021

Uh oh!

smuenzel commented Jan 23, 2023

Uh oh!

nojb commented Jan 23, 2023

Uh oh!

nojb commented Jan 24, 2023

Uh oh!

smuenzel commented Jan 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment