Optimize 32->64 sign-extension for AMD64#1631
Conversation
|
Can the difference be observed (even on a micro-benchmark)? |
|
A completely unscientific benchmark using the following code and using
The speedup and the reduction of code size is proportional to the operations made on 32bit integers. (let Paddbint_44/1259
(>>s (<< (+ (>>s (<< Foo_y/1257 32) 32) 1) 32) 32)
(assign Foo_y/1257 (>>s (<< Paddbint_44/1259 32) 32)))(loop (assign y/1235 (>>s (<< (+ (>>s (<< y/1235 32) 32) 1) 32) 32))let y = ref (Int32.of_int 0) in
for i=0 to 1000000000 do
y := Int32.(succ !y)
done;
print_int (Int32.to_int !y);
print_newline () |
asmcomp/amd64/emit.mlp
Outdated
| | Lop(Ispecific(Isextend)) -> | ||
| let src = res i 0 in | ||
| begin match src with | ||
| | Reg64 x when x = RAX -> I.cdqe () |
|
(Edit: the benchmark above has been much improved, so the message above isn't relevant anymore.) The benchmark does not give the numbers I would be looking for:
(I'm not saying you should re-do those benchmarks, I can't judge whether the benchmarks matter at all to evaluate the PR, just commenting on the proposed numbres.) |
asmcomp/amd64/selection.ml
Outdated
| | Iintop_imm((Iadd|Isub|Imul|Iand|Ior|Ixor|Ilsl|Ilsr|Iasr), _) | ||
| | Iabsf | Inegf | ||
| | Ispecific(Ibswap (32|64)) -> | ||
| | Ispecific(Ibswap (32|64)|Isextend) -> |
There was a problem hiding this comment.
The movsxd would support different source and target registers, no?
There was a problem hiding this comment.
Yes, I've lifted this restriction in the latest commit.
There was a problem hiding this comment.
If there a risk that the restriction was actually useful, in that it allowed to use more often the RAX case, resulting in better performance?
There was a problem hiding this comment.
I don't think so, the gains coming from the use of a single cdqe are often shadowed by the gains brought by generating less mov in insert_op_debug (the worst case is 2 mov reg,reg for a whopping 6 bytes)
Personally, I cannot judge an "optimization" PR without any number. There is always a cost to review and a risk inherent to any change, and the numbers are the justification for these. |
|
Cc:ing @chambart @mshinwell @lpw25 about the flambda "regression" reported above. Perhaps worth keeping track of it independently of this PR, though. |
I've re-run the test with four different compilers, the results should be slightly more meaningful now. |
|
Thanks, the numbers are convincing and I'm in favor of accepting the change. I'll give a few days to let other developers comment if they wish. In the meantime, can you address my inline comments and also:
|
The |
|
This is touching the backend, so let's ask the boss: @xavierleroy, no opposition from your side on this PR? |
xavierleroy
left a comment
There was a problem hiding this comment.
Two suggestions below.
| | Isqrtf (* Float square root *) | ||
| | Ifloatsqrtf of addressing_mode (* Float square root from memory *) | ||
| | Isextend (* Convert value with sign extension *) | ||
| and float_operation = |
There was a problem hiding this comment.
The name Isextend and the comment are not precise enough. There are instructions to sign-extend 8, 16 and 32-bit quantities.
There was a problem hiding this comment.
Is Isextend32 better?
asmcomp/amd64/emit.mlp
Outdated
| | Lop(Ispecific(Isextend)) -> | ||
| begin match (arg i 0, res i 0) with | ||
| | (Reg64 RAX, Reg64 RAX) -> I.cdqe () | ||
| | (_, dst) -> I.movsxd (arg32 i 0) dst |
There was a problem hiding this comment.
I'm skeptical that it is worth recognizing and generating the cdqe special-case instruction. Emitting movsxd in all cases would be just as efficient in practice and keep the code simpler.
There was a problem hiding this comment.
Roger that, I'll update the PR as soon as possible.
|
Amended and rebased, sorry for the delay. |
|
LGTM. Can you add @xavierleroy as a reviewer to the Changes entry? Will merge in a few days, unless @xavierleroy or someone else objects to it. |
Drop the cdqe optimization and always use movsxd instead.
|
Merged, thanks! |
Before:
After:
The same can be done for AArch64 using the
sxtwop.