Int32 code generation improvements#9006
Conversation
|
This is a neat idea. The first commit (machine-independent changes in Cmmgen) looks good to me, even though I wasn't quite able to follow what's going on with unboxed Clet-bound variables. Someone more familiar with this part of the code should review too. For the second commit (the zextend32 operation for amd64), I'm of two minds. You could avoid adding a specific operation by recognizing |
I did try this approach, but it doesn't work out simply. The issue is that 0xFFFF_FFFF is too large to be a 32-bit immediate constant (because of sign-extension), so instruction selection won't generate an Iintop_imm, but instead an Iintop and an explicit move to a register. We could hack around this in various ways (e.g. allowing instruction selection to generate the immediate 0xFFFF_FFFF when the operation is Iand, or matching the move-and sequence in emit.ml), but it seemed cleaner to just add a new |
|
You're right, This said, I'm not pushing this alternative and I'm fine with |
xavierleroy
left a comment
There was a problem hiding this comment.
After a second reading of the code, I'm pretty confident that it is good. A second opinion would be even better. @alainfrisch perhaps?
b580b8e to
ff279e0
Compare
alainfrisch
left a comment
There was a problem hiding this comment.
LGTM.
In transl_catch, I think you want to use unbox_number_low to avoid sign-extending the argument on each staticraise site (it will be sign-extended upon read). If I'm not wrong, unbox_number is then only used in unbox_number_low, so perhaps the functions should be merged.
On possible downside of the approach is that sign-extension might need to be done more often (dynamically), but this happens in an already bad case where the same id will be boxed several time, but this is probably ok.
I wonder if one could also avoid some sign-extension when calling C functions taking unboxed integers, but I
asmcomp/cmmgen.ml
Outdated
| | Boxed_float _ -> unbox_float dbg arg | ||
| | Boxed_integer (bi, _) -> unbox_int dbg bi arg | ||
|
|
||
| (* Same as unbox_number, but may return garbage in the upper bits *) |
There was a problem hiding this comment.
The comment could be more explicit, indicating that this may return garbage in the 32 upper bits when applied to an int32 boxed integer on a 64-bit platform.
| | Cop(Calloc, | ||
| [hdr; ops; | ||
| Cop(Clsl, [contents; Cconst_int (32, _)], dbg')], _dbg) | ||
| Cop(Clsl, [contents; Cconst_int (32, _)], _dbg')], _dbg) |
There was a problem hiding this comment.
This is now throwing dbg' away. I've no idea if we care; perhaps @mshinwell can comment on that.
There was a problem hiding this comment.
Checked with @mshinwell, he says it's fine.
Thanks for the suggestion, done.
Agreed. If we're doing redundant boxing, the cost of doing redundant sign-extension as well isn't going to be noticeable.
I think your comment was zero-extended! I'm avoiding optimising sign-extensions for C functions, as C ABIs are a bit vague about whether they expect numbers to be sign-extended, zero-extended, or have garbage in the upper bits. (More clearly specifying this is an open issue in the amd64 sysv ABI. I think the consensus is that 32-bit values may have garbage in the upper bits but <32-bit values should be sign-extended to 32-bits, which is odd but matches the oddness of x86. See here and here for more). |
:-) I was going to write "but I'm not sure what the ABI says about it". |
c075915 to
0852266
Compare
|
I think this is a good optimization to have in ocamlopt, and there is no reason to not have it in 4.10. So, let's merge now. |
This PR improves code generation for programs using int32, by avoiding redundant sign-extensions and generating better code for zero-extensions.
This came up in an implementation of Jenkins' 32-bit hash function which performed poorly. With this patch, the
mixfunction below gets 3.2x faster (1374ms vs. 424ms for the benchmark), and its code gets 2.8x smaller (457 bytes vs. 166 bytes). The source and assembly output for the benchmark are below (click to expand):Mix routine of Jenkins' hash function and benchmark
amd64 assembly on trunk
amd64 assembly with patch
There are two separate issues with the code that trunk generates: there are too many sign-extensions, and the code generated for zero-extension is too verbose. These are fixed separately in two commits.
1. Generating fewer sign-extensions
Many arithmetic operations have the property that the the low k bits of the result depends only on the low k bits of the inputs. This is true for addition, subtraction, multiplication, left shift and bitwise operators, but not for division, comparison or right shift. A consequence of this is that if the result is sign-extended, then the inputs need not be, as sign-extension only affects the high bits, giving:
In other words, if the result will be sign-extended anyway, then the contents of the high bits of the input registers is irrelevant (for certain operations).
So, this patch modifies
cmmgen.mlto add a new functiontransl_unbox_int_low, which works the same astransl_unbox_int, except when returning a 32-bit value it makes no guarantees about the contents of the high bits. It does this by taking the result oftransl_unbox_int, and discarding any sign- or zero-extensions.The same change is also applied to the unboxed int32 values (
unboxed_ids) - these are no longer maintained in sign-extended form, and instead sign-extended when used.2. Better code-generation for zero-extension (amd64 only)
Zero-extension amounts to masking with 0xFFFFFFFF, and may be written in the source or generated by the compiler when compiling right logical shifts. Unfortunately, this generates quite verbose code on trunk, e.g.:
The problem is that 0xFFFFFFFF is too large to be an immediate to the
andinstruction, so causes a largemovabsinstruction to be generated to build the constant. This patch follows the same lines as #1631, detecting zero-extensions in instruction-selection and emitting a shorter sequence for them.