Skip to content

Turn float comparisons into primitive operations#9945

Merged
nojb merged 24 commits intoocaml:trunkfrom
smuenzel:perf--compf
Jan 24, 2023
Merged

Turn float comparisons into primitive operations#9945
nojb merged 24 commits intoocaml:trunkfrom
smuenzel:perf--compf

Conversation

@smuenzel
Copy link
Copy Markdown
Contributor

Currently, float comparisons that produce a bool result are compiled into jumps producing the correct bool value.
Further, due to the NaN conditions, we actually end up with multiple jumps and condition checks, not just one.
This patch eliminates these jumps by directly using float comparison operations available on the target architecture.

It is a work in progress, since support for all architectures still needs to be added.

On the following benchmark on amd64, it's about twice as fast as the previous version when using an AMD Ryzen 3700X. On an Intel Kaby Lake, the performance improvement is more modest, around 20 to 30% (I'm not adding the numbers here, since that was done on my laptop, and I don't trust the power saving logic).

https://gist.github.com/smuenzel/77c2fe63072f78237a7b419d548e1292

with patch:
0m1.770s
0m1.756s
0m1.770s
0m1.757s

without patch:
0m3.395s
0m3.391s
0m3.400s
0m3.426s

@stedolan
Copy link
Copy Markdown
Contributor

Adding Icompf to match Icomp looks useful, but for x86 have you considered using ucomisd + setcc instead of cmpsd + movd + neg? You don't need the extra register, you'd be able to share more code with the existing float-comparison logic, and https://uops.info/table.html suggests it's probably faster as well.

@smuenzel
Copy link
Copy Markdown
Contributor Author

For reference, here's what currrent c compilers produce: https://godbolt.org/z/3xYK39

Yes, I took a look at that too, but the problem is that eq and neq need to be treated specially, because of NaN -- the problem is that there aren't any flags available that give you those conditions. So for simplicity, I wanted to generate the same code for all of the comparisons. It's also nice that icc produces this code for all of them.
(When I say "need" to be treated specially, I mean that ideally they should have the same performance as other conditions. In the godbolt link, clang tries to do that by using the cmpsd method for eq and neq, and setcc otherwise)

I can probably do some benchmarks --but I'm tempted to rely on icc to guide us a little bit.

@smuenzel smuenzel changed the title [WIP] Turn float comparisons into primitive operations Turn float comparisons into primitive operations Sep 29, 2020
@stedolan
Copy link
Copy Markdown
Contributor

stedolan commented Oct 7, 2020

The code looks good to me, although I haven't reviewed the non-x86 backends in detail. I'm still not quite convinced by the output, though:

The code generated by icc is highly dependent on the exact input. When I change the test functions to store their results in an integer (see https://godbolt.org/z/x4bbjd), icc often generates branching code. Whether it generates branches or not depends on the signedness of the result type (!), despite the answer always being 0 or 1. So I don't think we can use it as an oracle for good code.

Since the output of cmpsd consumes an extra register, it can potentially cause a spill in situations where ucomisd would not. In the absence of serious benchmarking of one versus the other (and honestly, I can't imagine there's much difference), I'd lean towards the one that spills less.

Finally, would you mind re-running your original benchmark using an array whose elements are all equal? You've shown a significant speed-up in the case where the branch predictor performs badly (random data), but I'd like to also verify that there's not (much) slowdown in the more common case where it performs well.

@smuenzel
Copy link
Copy Markdown
Contributor Author

smuenzel commented Oct 8, 2020

Sounds like we should be reporting bugs for the intel compiler 😄

I guess we'll have to do the combination of benchmarks in the following categories:

Predictable: Yes/No
Implementation: CMPSD/SETcc/Jcc
Architecture: Zen2/Skylake

Hopefully that will give us enough coverage to make a decision.

@smuenzel
Copy link
Copy Markdown
Contributor Author

smuenzel commented Nov 6, 2020

I did some of these benchmarks. Sadly, I don't have a current non-laptop intel box available right now, so you'll have to live with the laptop Kaby lake benchmarks (the variance is pretty high on those). The Zen2 are super consistent though.
All benchmarks below are median of 5, runtime in seconds.
current is the base of my PR, compsd is this PR, and setcc is https://github.com/smuenzel/ocaml/tree/perf--compf-setcc
Note that setcc still has branches for eq and neq. Maybe that's also why it's a bit slower (on Zen2, on Kaby, it's within variance).

revision| Kaby (det)| Kaby (rand)| Zen2 (det)| Zen2 (rand)
--------+-----------+------------+-----------+------------
current |    3.79   |     4.65   |   2.329   |    3.403
setcc   |    3.73   |     3.61   |   1.858   |    1.851
compsd  |    3.63   |     3.75   |   1.770   |    1.797

@smuenzel
Copy link
Copy Markdown
Contributor Author

smuenzel commented Nov 7, 2020

And now some numbers for another intel processor (Gold 6244). the variance on intel processors seems to be higher, don't know why that is the case (I may try to get a much larger sample size to figure this one out). Note that "det" is the previous gist with the random replaced by a constant.

revision| Cascade (det)| Cascade (rand)| 
--------+--------------+---------------+
current |      2.566   |       3.476   |
setcc   |      2.359   |       2.340   |
compsd  |      2.345   |       2.428   |

@smuenzel
Copy link
Copy Markdown
Contributor Author

smuenzel commented Nov 7, 2020

I redid everything with 10 samples each now. I still don't trust the intel numbers, because of the high variance, if someone has a better method they can try, that would be awesome.
On Zen2, compsd is clearly better than setcc, but only by something like 3%. It is pretty nice to not have any branches in this code at all though.

Revision Cascade (det) Cascade (rand) Kaby (det) Kaby (rand) Zen2 (det) Zen2 (rand)
current 2.556±0.030 3.541±0.134 3.569±0.117 4.530±0.200 2.314±0.005 3.413±0.007
compsd 2.437±0.129 2.407±0.119 3.372±0.275 3.654±0.176 1.802±0.006 1.793±0.057
setcc 2.414±0.109 2.439±0.134 3.367±0.099 3.272±0.128 1.853±0.005 1.853±0.005

@smuenzel
Copy link
Copy Markdown
Contributor Author

@stedolan : any thoughts about these results?

@smuenzel
Copy link
Copy Markdown
Contributor Author

ping @stedolan

@stedolan stedolan marked this pull request as ready for review January 20, 2021 18:06
@stedolan
Copy link
Copy Markdown
Contributor

Sorry this took me so long to get to! Those benchmarks are pretty convincing, this looks like a clear improvement over what's there.

@smuenzel smuenzel force-pushed the perf--compf branch 2 times, most recently from a24349c to 0f35000 Compare January 25, 2021 03:23
@smuenzel
Copy link
Copy Markdown
Contributor Author

@alainfrisch , since you generally care a lot about float performance as well, do you want to review this patch?

Copy link
Copy Markdown
Contributor

@alainfrisch alainfrisch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed only the x86/amd64 backends (having no knowledge of other architectures), and the PR looks ok to me. I've left minor superficial remarks (plus one question). Thanks for providing the results with the micro-benchmark; they look great, and I'm in favor of merging the PR.

How much testing/benchmarking has been done on non Intel architectures?

(* Certain float conditions aren't represented directly in the opcode for
float comparison, so we have to swap the arguments. The swap information
is also needed downstream because one of the arguments is clobbered. *)
let float_cond_and_swap cond a0 a1 =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, but it seems to better to have something like:

let float_cond_and_need_swap = function
  | CFeg -> EQf, false
  | CFneq -> NEQf, false
  ...
  | CFnge -> NLef, true

which avoids the risk of being incoherent between the need to swap, and the actual swapping. There are only two call sites for the function : one doesn't need to perform the actual swapping (and passes unit arguments), so it can use this function directly; the other can do the swapping itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


let max_register_pressure = function
let max_register_pressure =
let consumes ~int ~float =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice approach, but even with strong enough inlining, the compiler will allocate those arrays everytime consumes is called (because arrays are mutable). The impact on compilation-time performance is probably negligible, but it might be worth confirming it empirically.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this functions is called from add_superpressure_regs in spill.ml (and nowhere else?), and there, we also have an array.make for the same length -- so I guess we'd allocate a bit more, but we're already doing array allocation each time we're calling max_register_pressure. I think it's probably fine to not do additional benchmarks for this, unless people feel strongly about it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think benchmarks would show any difference, because the previous code also allocates a new array each time. We don't have immutable arrays in the language, so even constant arrays must be re-allocated each time if they're not defined at toplevel.

let cond, a0, a1, _ = float_cond_and_swap cmp (arg i 0) (arg i 1) in
I.cmpsd cond a1 a0;
I.movd a0 (res i 0);
I.neg (res i 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the very dummy question (showing, btw, that I should really not be the one reviewing this PR...), but why is the negation necessary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compsd's result is either all one bits (-1) or zero, but we actually want zero or one, so neg is the easiest way to do the conversion (and, iirc, the instruction encoding of neg is also pretty short)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! (It was pretty obvious from https://www.felixcloutier.com/x86/cmpsd but I looked at the wrong pseudo code...)

@smuenzel
Copy link
Copy Markdown
Contributor Author

Other than anything done by the automated checks (which i guess is not much), I haven't really done any non-intel testing. What do people typically do? I don't have any environments where I could easily test.

@smuenzel
Copy link
Copy Markdown
Contributor Author

@alainfrisch: I think the non-intel architectures will be checked automatically when merged. I would think that performance testing is not necessary, since we're mostly reducing instruction count (or not changing instructions)

@alainfrisch
Copy link
Copy Markdown
Contributor

This seems to be a good candidate for INRIA's CI (cf https://github.com/ocaml/ocaml/blob/trunk/HACKING.adoc#inrias-continuous-integration-ci).

@smuenzel
Copy link
Copy Markdown
Contributor Author

I managed to sign up and get permissioned for the ocaml project, but there's no "Build with parameters" at https://ci.inria.fr/ocaml/job/precheck/, like it says in the instructions -- maybe they need to be updated?

@smuenzel smuenzel force-pushed the perf--compf branch 2 times, most recently from 5af23b3 to 2c9073f Compare March 30, 2021 01:32
@nojb
Copy link
Copy Markdown
Contributor

nojb commented Mar 30, 2021

I managed to sign up and get permissioned for the ocaml project, but there's no "Build with parameters" at https://ci.inria.fr/ocaml/job/precheck/, like it says in the instructions -- maybe they need to be updated?

Strange, it seems to be there for me. Anyway, I launched the precheck job: https://ci.inria.fr/ocaml/job/precheck/554/

@smuenzel
Copy link
Copy Markdown
Contributor Author

Thanks! If I'm reading that right, it looks like it was successful

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Mar 30, 2021

Thanks! If I'm reading that right, it looks like it was successful

Yes, that's correct (the only supported architecture that is not tested in the precheck is RISC-V which is not yet plugged into it -- that one will be tested when merged).

Was there any other outstanding issue apart from testing on non-intel architectures?

@smuenzel
Copy link
Copy Markdown
Contributor Author

No, I don't believe that there are any other outstanding things.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Mar 30, 2021

I haven't looked at the PR in detail, but both @stedolan and @alainfrisch were in favor of merging and precheck has passed, so let's merge. Do you want to keep the history, or shall we just squash everything together?

@xavierleroy
Copy link
Copy Markdown
Contributor

I don't see any test in the PR. Are we sure every case of every port has been exercised?

My general take is that this treatment of FP comparisons is good to have for the platforms where it is easy to support (x86-64, ARM) but the i386 code is a mess and PowerPC is so-so. So I would have preferred a per-platform implementation based on platform-specific instructions.

@smuenzel
Copy link
Copy Markdown
Contributor Author

FYI: #11904 has now been merged.

Rebased and all checks pass. Thanks!

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Jan 23, 2023

@lthls As the person that reviewed this PR most recently, do you mind giving an official approval? We can then move to merge.

@nojb
Copy link
Copy Markdown
Contributor

nojb commented Jan 24, 2023

@smuenzel do you mind if we squash-merge?

@smuenzel
Copy link
Copy Markdown
Contributor Author

Squash sounds great. Thanks!

@nojb nojb merged commit ebc23f1 into ocaml:trunk Jan 24, 2023
let cond, need_swap = float_cond_and_need_swap cmp in
let a0, a1 = if need_swap then arg i 1, arg i 0 else arg i 0, arg i 1 in
I.cmpsd cond a1 a0;
I.movd a0 (res i 0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be movq not movd when argument and result are in 64-bit registers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe movd is right, strange as it sounds. But let me try to dig out a reference since I'm currently relying on some hazy memory.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this bug report is the correct reference:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=43215

Relevant quote:

However, since AMD64 uses movd instead
of movq, some non-GNU assemblers may not support movq. Why change it
when nothing is broken?

So apparently assembler support for movd is more universal than movq, since movq wasn't in the initial amd64 spec.

But I guess

movd eax,xmm0
neg eax

is even better, since it has a shorter encoding (by two bytes, i think?). Probably doesn't make enough difference for reviewing another PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, thanks for checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants