Optimize 32->64 sign-extension for AMD64 by LemonBoy · Pull Request #1631 · ocaml/ocaml

LemonBoy · 2018-02-25T15:46:13Z

Before:

48 c1 e3 20            shl    rbx,0x20
48 c1 fb 20            sar    rbx,0x20

After:

48 63 db               movsxd rbx,ebx
-- or, if the value is in eax --
48 98                  cdqe

The same can be done for AArch64 using the sxtw op.

alainfrisch · 2018-02-27T22:04:53Z

Can the difference be observed (even on a micro-benchmark)?

LemonBoy · 2018-02-28T09:44:21Z

A completely unscientific benchmark using the following code and using time gives the following results:

Version	Time
trunk	2.828
trunk+flambda	3.944
trunk+PR	1.714
trunk+PR+flambda	2.273

The speedup and the reduction of code size is proportional to the operations made on 32bit integers.
It may also be useful to notice that flambda produces one more (useless) sign extension:

(let Paddbint_44/1259
   (>>s (<< (+ (>>s (<< Foo_y/1257 32) 32) 1) 32) 32)
  (assign Foo_y/1257 (>>s (<< Paddbint_44/1259 32) 32)))

(loop (assign y/1235 (>>s (<< (+ (>>s (<< y/1235 32) 32) 1) 32) 32))

let  y = ref (Int32.of_int 0) in                                            
for i=0 to 1000000000 do                                                      
    y := Int32.(succ !y)
done;                                               
print_int (Int32.to_int !y);                                             
print_newline ()

alainfrisch · 2018-02-28T09:56:10Z

asmcomp/amd64/emit.mlp

+  | Lop(Ispecific(Isextend)) ->
+      let src = res i 0 in
+      begin match src with
+      | Reg64 x when x = RAX -> I.cdqe ()


| Reg64 RAX -> ...

gasche · 2018-02-28T09:57:54Z

(Edit: the benchmark above has been much improved, so the message above isn't relevant anymore.)

The benchmark does not give the numbers I would be looking for:

what about the state of trunk just before this PR? Comparing to 4.06.1 risks measuring other trunk changes in the difference.
could we get the "diff" of this PR (compared to trunk) both with and without flambda?

(I'm not saying you should re-do those benchmarks, I can't judge whether the benchmarks matter at all to evaluate the PR, just commenting on the proposed numbres.)

alainfrisch · 2018-02-28T09:58:16Z

asmcomp/amd64/selection.ml

  | Iintop_imm((Iadd|Isub|Imul|Iand|Ior|Ixor|Ilsl|Ilsr|Iasr), _)
  | Iabsf | Inegf
-  | Ispecific(Ibswap (32|64)) ->
+  | Ispecific(Ibswap (32|64)|Isextend) ->


The movsxd would support different source and target registers, no?

Yes, I've lifted this restriction in the latest commit.

If there a risk that the restriction was actually useful, in that it allowed to use more often the RAX case, resulting in better performance?

I don't think so, the gains coming from the use of a single cdqe are often shadowed by the gains brought by generating less mov in insert_op_debug (the worst case is 2 mov reg,reg for a whopping 6 bytes)

alainfrisch · 2018-02-28T10:00:39Z

I can't judge whether the benchmarks matter at all to evaluate the PR

Personally, I cannot judge an "optimization" PR without any number. There is always a cost to review and a risk inherent to any change, and the numbers are the justification for these.

alainfrisch · 2018-02-28T10:01:45Z

Cc:ing @chambart @mshinwell @lpw25 about the flambda "regression" reported above. Perhaps worth keeping track of it independently of this PR, though.

LemonBoy · 2018-02-28T11:00:22Z

The benchmark does not give the numbers I would be looking for

I've re-run the test with four different compilers, the results should be slightly more meaningful now.

alainfrisch · 2018-02-28T11:55:00Z

Thanks, the numbers are convincing and I'm in favor of accepting the change. I'll give a few days to let other developers comment if they wish. In the meantime, can you address my inline comments and also:

Please add a Changelog entry.
Can you confirm that both generated instructions are exercised in the testsuite, and if not, add a dedicated test?

LemonBoy · 2018-03-01T09:36:19Z

Can you confirm that both generated instructions are exercised in the testsuite, and if not, add a dedicated test?

The lib-digest/md5.ml test generates both the instructions and executes correctly.

alainfrisch · 2018-03-06T22:29:54Z

This is touching the backend, so let's ask the boss: @xavierleroy, no opposition from your side on this PR?

xavierleroy

Two suggestions below.

xavierleroy · 2018-03-07T08:43:56Z

asmcomp/amd64/arch.ml

  | Isqrtf                             (* Float square root *)
  | Ifloatsqrtf of addressing_mode     (* Float square root from memory *)
+  | Isextend                           (* Convert value with sign extension *)
 and float_operation =


The name Isextend and the comment are not precise enough. There are instructions to sign-extend 8, 16 and 32-bit quantities.

Is Isextend32 better?

xavierleroy · 2018-03-07T08:44:41Z

asmcomp/amd64/emit.mlp

+  | Lop(Ispecific(Isextend)) ->
+      begin match (arg i 0, res i 0) with
+      | (Reg64 RAX, Reg64 RAX) -> I.cdqe ()
+      | (_, dst)               -> I.movsxd (arg32 i 0) dst


I'm skeptical that it is worth recognizing and generating the cdqe special-case instruction. Emitting movsxd in all cases would be just as efficient in practice and keep the code simpler.

Roger that, I'll update the PR as soon as possible.

LemonBoy · 2018-05-09T13:22:54Z

Amended and rebased, sorry for the delay.

alainfrisch · 2018-05-11T10:53:56Z

LGTM.

Can you add @xavierleroy as a reviewer to the Changes entry?

Will merge in a few days, unless @xavierleroy or someone else objects to it.

Drop the cdqe optimization and always use movsxd instead.

alainfrisch · 2018-05-28T13:39:07Z

Merged, thanks!

alainfrisch reviewed Feb 28, 2018

View reviewed changes

alainfrisch self-assigned this Feb 28, 2018

LemonBoy force-pushed the signextend branch from ed85400 to 6d09cb6 Compare March 1, 2018 09:29

alainfrisch approved these changes Mar 1, 2018

View reviewed changes

xavierleroy reviewed Mar 7, 2018

View reviewed changes

Optimize 32->64 sign-extension for AMD64

86f9e1c

LemonBoy force-pushed the signextend branch from 70796f3 to 87a8e02 Compare May 9, 2018 09:25

Rename Isextend to Isextend32

8c10ca1

Drop the cdqe optimization and always use movsxd instead.

LemonBoy force-pushed the signextend branch from 87a8e02 to 8c10ca1 Compare May 12, 2018 08:15

Won't be in 4.07

d022ac9

alainfrisch merged commit 86d1f0d into ocaml:trunk May 28, 2018

stedolan mentioned this pull request Oct 1, 2019

Int32 code generation improvements #9006

Merged

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024

New Left Sidebar Accordion Nav for Exercises page. (ocaml#1631)

dee8971

Conversation

LemonBoy commented Feb 25, 2018

Uh oh!

alainfrisch commented Feb 27, 2018

Uh oh!

LemonBoy commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gasche commented Feb 28, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alainfrisch commented Feb 28, 2018

Uh oh!

alainfrisch commented Feb 28, 2018

Uh oh!

LemonBoy commented Feb 28, 2018

Uh oh!

alainfrisch commented Feb 28, 2018

Uh oh!

LemonBoy commented Mar 1, 2018

Uh oh!

alainfrisch commented Mar 6, 2018

Uh oh!

xavierleroy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LemonBoy commented May 9, 2018

Uh oh!

alainfrisch commented May 11, 2018

Uh oh!

alainfrisch commented May 28, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

LemonBoy commented Feb 28, 2018 •

edited

Loading

gasche commented Feb 28, 2018 •

edited

Loading