Skip to content

Optimize 32->64 sign-extension for AMD64#1631

Merged
alainfrisch merged 3 commits intoocaml:trunkfrom
LemonBoy:signextend
May 28, 2018
Merged

Optimize 32->64 sign-extension for AMD64#1631
alainfrisch merged 3 commits intoocaml:trunkfrom
LemonBoy:signextend

Conversation

@LemonBoy
Copy link
Copy Markdown
Contributor

Before:

48 c1 e3 20            shl    rbx,0x20
48 c1 fb 20            sar    rbx,0x20

After:

48 63 db               movsxd rbx,ebx
-- or, if the value is in eax --
48 98                  cdqe

The same can be done for AArch64 using the sxtw op.

@alainfrisch
Copy link
Copy Markdown
Contributor

Can the difference be observed (even on a micro-benchmark)?

@LemonBoy
Copy link
Copy Markdown
Contributor Author

LemonBoy commented Feb 28, 2018

A completely unscientific benchmark using the following code and using time gives the following results:

Version Time
trunk 2.828
trunk+flambda 3.944
trunk+PR 1.714
trunk+PR+flambda 2.273

The speedup and the reduction of code size is proportional to the operations made on 32bit integers.
It may also be useful to notice that flambda produces one more (useless) sign extension:

(let Paddbint_44/1259
   (>>s (<< (+ (>>s (<< Foo_y/1257 32) 32) 1) 32) 32)
  (assign Foo_y/1257 (>>s (<< Paddbint_44/1259 32) 32)))
(loop (assign y/1235 (>>s (<< (+ (>>s (<< y/1235 32) 32) 1) 32) 32))
let  y = ref (Int32.of_int 0) in                                            
for i=0 to 1000000000 do                                                      
    y := Int32.(succ !y)
done;                                               
print_int (Int32.to_int !y);                                             
print_newline ()

| Lop(Ispecific(Isextend)) ->
let src = res i 0 in
begin match src with
| Reg64 x when x = RAX -> I.cdqe ()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

| Reg64 RAX -> ...

@gasche
Copy link
Copy Markdown
Member

gasche commented Feb 28, 2018

(Edit: the benchmark above has been much improved, so the message above isn't relevant anymore.)

The benchmark does not give the numbers I would be looking for:

  • what about the state of trunk just before this PR? Comparing to 4.06.1 risks measuring other trunk changes in the difference.
  • could we get the "diff" of this PR (compared to trunk) both with and without flambda?

(I'm not saying you should re-do those benchmarks, I can't judge whether the benchmarks matter at all to evaluate the PR, just commenting on the proposed numbres.)

| Iintop_imm((Iadd|Isub|Imul|Iand|Ior|Ixor|Ilsl|Ilsr|Iasr), _)
| Iabsf | Inegf
| Ispecific(Ibswap (32|64)) ->
| Ispecific(Ibswap (32|64)|Isextend) ->
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The movsxd would support different source and target registers, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've lifted this restriction in the latest commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there a risk that the restriction was actually useful, in that it allowed to use more often the RAX case, resulting in better performance?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, the gains coming from the use of a single cdqe are often shadowed by the gains brought by generating less mov in insert_op_debug (the worst case is 2 mov reg,reg for a whopping 6 bytes)

@alainfrisch
Copy link
Copy Markdown
Contributor

I can't judge whether the benchmarks matter at all to evaluate the PR

Personally, I cannot judge an "optimization" PR without any number. There is always a cost to review and a risk inherent to any change, and the numbers are the justification for these.

@alainfrisch
Copy link
Copy Markdown
Contributor

Cc:ing @chambart @mshinwell @lpw25 about the flambda "regression" reported above. Perhaps worth keeping track of it independently of this PR, though.

@LemonBoy
Copy link
Copy Markdown
Contributor Author

The benchmark does not give the numbers I would be looking for

I've re-run the test with four different compilers, the results should be slightly more meaningful now.

@alainfrisch
Copy link
Copy Markdown
Contributor

Thanks, the numbers are convincing and I'm in favor of accepting the change. I'll give a few days to let other developers comment if they wish. In the meantime, can you address my inline comments and also:

  • Please add a Changelog entry.

  • Can you confirm that both generated instructions are exercised in the testsuite, and if not, add a dedicated test?

@LemonBoy
Copy link
Copy Markdown
Contributor Author

LemonBoy commented Mar 1, 2018

Can you confirm that both generated instructions are exercised in the testsuite, and if not, add a dedicated test?

The lib-digest/md5.ml test generates both the instructions and executes correctly.

@alainfrisch
Copy link
Copy Markdown
Contributor

This is touching the backend, so let's ask the boss: @xavierleroy, no opposition from your side on this PR?

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two suggestions below.

| Isqrtf (* Float square root *)
| Ifloatsqrtf of addressing_mode (* Float square root from memory *)
| Isextend (* Convert value with sign extension *)
and float_operation =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name Isextend and the comment are not precise enough. There are instructions to sign-extend 8, 16 and 32-bit quantities.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Isextend32 better?

| Lop(Ispecific(Isextend)) ->
begin match (arg i 0, res i 0) with
| (Reg64 RAX, Reg64 RAX) -> I.cdqe ()
| (_, dst) -> I.movsxd (arg32 i 0) dst
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical that it is worth recognizing and generating the cdqe special-case instruction. Emitting movsxd in all cases would be just as efficient in practice and keep the code simpler.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roger that, I'll update the PR as soon as possible.

@LemonBoy
Copy link
Copy Markdown
Contributor Author

LemonBoy commented May 9, 2018

Amended and rebased, sorry for the delay.

@alainfrisch
Copy link
Copy Markdown
Contributor

LGTM.

Can you add @xavierleroy as a reviewer to the Changes entry?

Will merge in a few days, unless @xavierleroy or someone else objects to it.

Drop the cdqe optimization and always use movsxd instead.
@alainfrisch alainfrisch merged commit 86d1f0d into ocaml:trunk May 28, 2018
@alainfrisch
Copy link
Copy Markdown
Contributor

Merged, thanks!

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants