Skip to content

Int32 code generation improvements#9006

Merged
xavierleroy merged 2 commits intoocaml:trunkfrom
stedolan:int32-compilation
Oct 14, 2019
Merged

Int32 code generation improvements#9006
xavierleroy merged 2 commits intoocaml:trunkfrom
stedolan:int32-compilation

Conversation

@stedolan
Copy link
Copy Markdown
Contributor

@stedolan stedolan commented Oct 1, 2019

This PR improves code generation for programs using int32, by avoiding redundant sign-extensions and generating better code for zero-extensions.

This came up in an implementation of Jenkins' 32-bit hash function which performed poorly. With this patch, the mix function below gets 3.2x faster (1374ms vs. 424ms for the benchmark), and its code gets 2.8x smaller (457 bytes vs. 166 bytes). The source and assembly output for the benchmark are below (click to expand):

Mix routine of Jenkins' hash function and benchmark
open Int32
let (-) = sub
let (lsr) = shift_right_logical
let (lxor) = logxor
let (land) = logand
let (lsl) = shift_left

let fa a b c shift =
  let a = a - b in
  let a = a - c in
  (a lxor (c lsr shift))

let fb a b c shift =
  let b = b - c in
  let b = b - a in
  (b lxor (a lsl shift))

let fc a b c shift =
  let c = c - a in
  let c = c - b in
  (c lxor (b lsr shift))

let [@inline never] mix a b c =
  let a = fa a b c 13 in
  let b = fb a b c 8 in
  let c = fc a b c 13 in
  let a = fa a b c 12 in
  let b = fb a b c 16 in
  let c = fc a b c 5 in
  let a = fa a b c 3 in
  let b = fb a b c 10 in
  let c = fc a b c 15 in
  to_int c

let () =
  Random.init 0;
  let rand () = Random.int32 0x7fff_ffffl in
  let a, b, c = rand (), rand (), rand () in
  for i = 1 to 100000000 do
    ignore (mix a b c : int)
  done
amd64 assembly on trunk
0000000000402f50 <camlJenk__mix_148>:
  402f50:       48 63 5b 08             movslq 0x8(%rbx),%rbx
  402f54:       48 63 40 08             movslq 0x8(%rax),%rax
  402f58:       48 29 d8                sub    %rbx,%rax
  402f5b:       48 63 f0                movslq %eax,%rsi
  402f5e:       48 63 47 08             movslq 0x8(%rdi),%rax
  402f62:       48 63 fe                movslq %esi,%rdi
  402f65:       48 29 c7                sub    %rax,%rdi
  402f68:       48 63 ff                movslq %edi,%rdi
  402f6b:       48 be ff ff ff ff 00    movabs $0xffffffff,%rsi
  402f72:       00 00 00 
  402f75:       48 89 c2                mov    %rax,%rdx
  402f78:       48 21 f2                and    %rsi,%rdx
  402f7b:       48 c1 ea 0d             shr    $0xd,%rdx
  402f7f:       48 63 f2                movslq %edx,%rsi
  402f82:       48 63 f6                movslq %esi,%rsi
  402f85:       48 63 ff                movslq %edi,%rdi
  402f88:       48 31 f7                xor    %rsi,%rdi
  402f8b:       48 63 ff                movslq %edi,%rdi
  402f8e:       48 29 c3                sub    %rax,%rbx
  402f91:       48 63 f3                movslq %ebx,%rsi
  402f94:       48 63 df                movslq %edi,%rbx
  402f97:       48 63 fe                movslq %esi,%rdi
  402f9a:       48 29 df                sub    %rbx,%rdi
  402f9d:       48 63 ff                movslq %edi,%rdi
  402fa0:       48 89 de                mov    %rbx,%rsi
  402fa3:       48 c1 e6 08             shl    $0x8,%rsi
  402fa7:       48 63 f6                movslq %esi,%rsi
  402faa:       48 63 f6                movslq %esi,%rsi
  402fad:       48 63 ff                movslq %edi,%rdi
  402fb0:       48 31 f7                xor    %rsi,%rdi
  402fb3:       48 63 ff                movslq %edi,%rdi
  402fb6:       48 29 d8                sub    %rbx,%rax
  402fb9:       48 63 c0                movslq %eax,%rax
  402fbc:       48 63 ff                movslq %edi,%rdi
  402fbf:       48 63 c0                movslq %eax,%rax
  402fc2:       48 29 f8                sub    %rdi,%rax
  402fc5:       48 63 c0                movslq %eax,%rax
  402fc8:       48 be ff ff ff ff 00    movabs $0xffffffff,%rsi
  402fcf:       00 00 00 
  402fd2:       48 89 fa                mov    %rdi,%rdx
  402fd5:       48 21 f2                and    %rsi,%rdx
  402fd8:       48 c1 ea 0d             shr    $0xd,%rdx
  402fdc:       48 63 f2                movslq %edx,%rsi
  402fdf:       48 63 f6                movslq %esi,%rsi
  402fe2:       48 63 c0                movslq %eax,%rax
  402fe5:       48 31 f0                xor    %rsi,%rax
  402fe8:       48 63 c0                movslq %eax,%rax
  402feb:       48 29 fb                sub    %rdi,%rbx
  402fee:       48 63 db                movslq %ebx,%rbx
  402ff1:       48 63 c0                movslq %eax,%rax
  402ff4:       48 63 db                movslq %ebx,%rbx
  402ff7:       48 29 c3                sub    %rax,%rbx
  402ffa:       48 63 db                movslq %ebx,%rbx
  402ffd:       48 be ff ff ff ff 00    movabs $0xffffffff,%rsi
  403004:       00 00 00 
  403007:       48 89 c2                mov    %rax,%rdx
  40300a:       48 21 f2                and    %rsi,%rdx
  40300d:       48 c1 ea 0c             shr    $0xc,%rdx
  403011:       48 63 f2                movslq %edx,%rsi
  403014:       48 63 f6                movslq %esi,%rsi
  403017:       48 63 db                movslq %ebx,%rbx
  40301a:       48 31 f3                xor    %rsi,%rbx
  40301d:       48 63 db                movslq %ebx,%rbx
  403020:       48 29 c7                sub    %rax,%rdi
  403023:       48 63 ff                movslq %edi,%rdi
  403026:       48 63 db                movslq %ebx,%rbx
  403029:       48 63 ff                movslq %edi,%rdi
  40302c:       48 29 df                sub    %rbx,%rdi
  40302f:       48 63 ff                movslq %edi,%rdi
  403032:       48 89 de                mov    %rbx,%rsi
  403035:       48 c1 e6 10             shl    $0x10,%rsi
  403039:       48 63 f6                movslq %esi,%rsi
  40303c:       48 63 f6                movslq %esi,%rsi
  40303f:       48 63 ff                movslq %edi,%rdi
  403042:       48 31 f7                xor    %rsi,%rdi
  403045:       48 63 ff                movslq %edi,%rdi
  403048:       48 29 d8                sub    %rbx,%rax
  40304b:       48 63 c0                movslq %eax,%rax
  40304e:       48 63 ff                movslq %edi,%rdi
  403051:       48 63 c0                movslq %eax,%rax
  403054:       48 29 f8                sub    %rdi,%rax
  403057:       48 63 c0                movslq %eax,%rax
  40305a:       48 be ff ff ff ff 00    movabs $0xffffffff,%rsi
  403061:       00 00 00 
  403064:       48 89 fa                mov    %rdi,%rdx
  403067:       48 21 f2                and    %rsi,%rdx
  40306a:       48 c1 ea 05             shr    $0x5,%rdx
  40306e:       48 63 f2                movslq %edx,%rsi
  403071:       48 63 f6                movslq %esi,%rsi
  403074:       48 63 c0                movslq %eax,%rax
  403077:       48 31 f0                xor    %rsi,%rax
  40307a:       48 63 c0                movslq %eax,%rax
  40307d:       48 29 fb                sub    %rdi,%rbx
  403080:       48 63 db                movslq %ebx,%rbx
  403083:       48 63 c0                movslq %eax,%rax
  403086:       48 63 db                movslq %ebx,%rbx
  403089:       48 29 c3                sub    %rax,%rbx
  40308c:       48 63 db                movslq %ebx,%rbx
  40308f:       48 be ff ff ff ff 00    movabs $0xffffffff,%rsi
  403096:       00 00 00 
  403099:       48 89 c2                mov    %rax,%rdx
  40309c:       48 21 f2                and    %rsi,%rdx
  40309f:       48 c1 ea 03             shr    $0x3,%rdx
  4030a3:       48 63 f2                movslq %edx,%rsi
  4030a6:       48 63 f6                movslq %esi,%rsi
  4030a9:       48 63 db                movslq %ebx,%rbx
  4030ac:       48 31 f3                xor    %rsi,%rbx
  4030af:       48 63 db                movslq %ebx,%rbx
  4030b2:       48 29 c7                sub    %rax,%rdi
  4030b5:       48 63 ff                movslq %edi,%rdi
  4030b8:       48 63 db                movslq %ebx,%rbx
  4030bb:       48 63 ff                movslq %edi,%rdi
  4030be:       48 29 df                sub    %rbx,%rdi
  4030c1:       48 63 ff                movslq %edi,%rdi
  4030c4:       48 89 de                mov    %rbx,%rsi
  4030c7:       48 c1 e6 0a             shl    $0xa,%rsi
  4030cb:       48 63 f6                movslq %esi,%rsi
  4030ce:       48 63 f6                movslq %esi,%rsi
  4030d1:       48 63 ff                movslq %edi,%rdi
  4030d4:       48 31 f7                xor    %rsi,%rdi
  4030d7:       48 63 ff                movslq %edi,%rdi
  4030da:       48 29 d8                sub    %rbx,%rax
  4030dd:       48 63 c0                movslq %eax,%rax
  4030e0:       48 63 df                movslq %edi,%rbx
  4030e3:       48 63 c0                movslq %eax,%rax
  4030e6:       48 29 d8                sub    %rbx,%rax
  4030e9:       48 63 c0                movslq %eax,%rax
  4030ec:       48 bf ff ff ff ff 00    movabs $0xffffffff,%rdi
  4030f3:       00 00 00 
  4030f6:       48 21 fb                and    %rdi,%rbx
  4030f9:       48 c1 eb 0f             shr    $0xf,%rbx
  4030fd:       48 63 db                movslq %ebx,%rbx
  403100:       48 63 db                movslq %ebx,%rbx
  403103:       48 63 c0                movslq %eax,%rax
  403106:       48 31 d8                xor    %rbx,%rax
  403109:       48 63 c0                movslq %eax,%rax
  40310c:       48 c1 e0 20             shl    $0x20,%rax
  403110:       48 c1 f8 1f             sar    $0x1f,%rax
  403114:       48 83 c8 01             or     $0x1,%rax
  403118:       c3                      retq   
amd64 assembly with patch
0000000000402ef0 <camlJenk__mix_148>:
  402ef0:       48 89 fe                mov    %rdi,%rsi
  402ef3:       48 63 5b 08             movslq 0x8(%rbx),%rbx
  402ef7:       48 63 78 08             movslq 0x8(%rax),%rdi
  402efb:       48 29 df                sub    %rbx,%rdi
  402efe:       48 63 46 08             movslq 0x8(%rsi),%rax
  402f02:       48 29 c7                sub    %rax,%rdi
  402f05:       89 c6                   mov    %eax,%esi
  402f07:       48 c1 ee 0d             shr    $0xd,%rsi
  402f0b:       48 31 f7                xor    %rsi,%rdi
  402f0e:       48 29 c3                sub    %rax,%rbx
  402f11:       48 29 fb                sub    %rdi,%rbx
  402f14:       48 89 fe                mov    %rdi,%rsi
  402f17:       48 c1 e6 08             shl    $0x8,%rsi
  402f1b:       48 31 f3                xor    %rsi,%rbx
  402f1e:       48 29 f8                sub    %rdi,%rax
  402f21:       48 29 d8                sub    %rbx,%rax
  402f24:       89 de                   mov    %ebx,%esi
  402f26:       48 c1 ee 0d             shr    $0xd,%rsi
  402f2a:       48 31 f0                xor    %rsi,%rax
  402f2d:       48 29 df                sub    %rbx,%rdi
  402f30:       48 29 c7                sub    %rax,%rdi
  402f33:       89 c6                   mov    %eax,%esi
  402f35:       48 c1 ee 0c             shr    $0xc,%rsi
  402f39:       48 31 f7                xor    %rsi,%rdi
  402f3c:       48 29 c3                sub    %rax,%rbx
  402f3f:       48 29 fb                sub    %rdi,%rbx
  402f42:       48 89 fe                mov    %rdi,%rsi
  402f45:       48 c1 e6 10             shl    $0x10,%rsi
  402f49:       48 31 f3                xor    %rsi,%rbx
  402f4c:       48 29 f8                sub    %rdi,%rax
  402f4f:       48 29 d8                sub    %rbx,%rax
  402f52:       89 de                   mov    %ebx,%esi
  402f54:       48 c1 ee 05             shr    $0x5,%rsi
  402f58:       48 31 f0                xor    %rsi,%rax
  402f5b:       48 29 df                sub    %rbx,%rdi
  402f5e:       48 29 c7                sub    %rax,%rdi
  402f61:       89 c6                   mov    %eax,%esi
  402f63:       48 c1 ee 03             shr    $0x3,%rsi
  402f67:       48 31 f7                xor    %rsi,%rdi
  402f6a:       48 29 c3                sub    %rax,%rbx
  402f6d:       48 29 fb                sub    %rdi,%rbx
  402f70:       48 89 fe                mov    %rdi,%rsi
  402f73:       48 c1 e6 0a             shl    $0xa,%rsi
  402f77:       48 31 f3                xor    %rsi,%rbx
  402f7a:       48 29 f8                sub    %rdi,%rax
  402f7d:       48 29 d8                sub    %rbx,%rax
  402f80:       89 db                   mov    %ebx,%ebx
  402f82:       48 c1 eb 0f             shr    $0xf,%rbx
  402f86:       48 31 d8                xor    %rbx,%rax
  402f89:       48 c1 e0 20             shl    $0x20,%rax
  402f8d:       48 c1 f8 1f             sar    $0x1f,%rax
  402f91:       48 83 c8 01             or     $0x1,%rax
  402f95:       c3                      retq   

There are two separate issues with the code that trunk generates: there are too many sign-extensions, and the code generated for zero-extension is too verbose. These are fixed separately in two commits.

1. Generating fewer sign-extensions

Many arithmetic operations have the property that the the low k bits of the result depends only on the low k bits of the inputs. This is true for addition, subtraction, multiplication, left shift and bitwise operators, but not for division, comparison or right shift. A consequence of this is that if the result is sign-extended, then the inputs need not be, as sign-extension only affects the high bits, giving:

sext(sext(a) + sext(b)) = sext(a + b)

In other words, if the result will be sign-extended anyway, then the contents of the high bits of the input registers is irrelevant (for certain operations).

So, this patch modifies cmmgen.ml to add a new function transl_unbox_int_low, which works the same as transl_unbox_int, except when returning a 32-bit value it makes no guarantees about the contents of the high bits. It does this by taking the result of transl_unbox_int, and discarding any sign- or zero-extensions.

The same change is also applied to the unboxed int32 values (unboxed_ids) - these are no longer maintained in sign-extended form, and instead sign-extended when used.

2. Better code-generation for zero-extension (amd64 only)

Zero-extension amounts to masking with 0xFFFFFFFF, and may be written in the source or generated by the compiler when compiling right logical shifts. Unfortunately, this generates quite verbose code on trunk, e.g.:

  402f6b:       48 be ff ff ff ff 00    movabs $0xffffffff,%rsi
  402f72:       00 00 00 
  402f75:       48 89 c2                mov    %rax,%rdx
  402f78:       48 21 f2                and    %rsi,%rdx

The problem is that 0xFFFFFFFF is too large to be an immediate to the and instruction, so causes a large movabs instruction to be generated to build the constant. This patch follows the same lines as #1631, detecting zero-extensions in instruction-selection and emitting a shorter sequence for them.

@stedolan stedolan changed the title Int32 compilation Int32 compilation improvements Oct 1, 2019
@stedolan stedolan changed the title Int32 compilation improvements Int32 code generation improvements Oct 1, 2019
@xavierleroy
Copy link
Copy Markdown
Contributor

This is a neat idea.

The first commit (machine-independent changes in Cmmgen) looks good to me, even though I wasn't quite able to follow what's going on with unboxed Clet-bound variables. Someone more familiar with this part of the code should review too.

For the second commit (the zextend32 operation for amd64), I'm of two minds. You could avoid adding a specific operation by recognizing Iintop_imm(Iand, 0xFFFF_FFFF)) in emit.mlp and generating the 32-bit move there, instead of recognizing in selection.ml. The one slight advantage of having a specific operation is that the source and result registers can differ, while Iintop_imm forces them to be the same.

@stedolan
Copy link
Copy Markdown
Contributor Author

stedolan commented Oct 7, 2019

You could avoid adding a specific operation by recognizing Iintop_imm(Iand, 0xFFFF_FFFF)) in emit.mlp and generating the 32-bit move there, instead of recognizing in selection.ml.

I did try this approach, but it doesn't work out simply. The issue is that 0xFFFF_FFFF is too large to be a 32-bit immediate constant (because of sign-extension), so instruction selection won't generate an Iintop_imm, but instead an Iintop and an explicit move to a register. We could hack around this in various ways (e.g. allowing instruction selection to generate the immediate 0xFFFF_FFFF when the operation is Iand, or matching the move-and sequence in emit.ml), but it seemed cleaner to just add a new Ispecfic Izextend32, especially as Ispecific Isextend32 already exists.

@xavierleroy
Copy link
Copy Markdown
Contributor

You're right, is_immediate behaves the same regardless of the operation. However we could still add a special-case "Cand with 0xFFFF_FFFF" to asmcomp/amd64/selection.ml that generates an Iintop_imm(Iand, 0xFFFF_FFFF). Just like your code but without the specific operation.

This said, I'm not pushing this alternative and I'm fine with Ispecific Izextend32.

Copy link
Copy Markdown
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a second reading of the code, I'm pretty confident that it is good. A second opinion would be even better. @alainfrisch perhaps?

Copy link
Copy Markdown
Contributor

@alainfrisch alainfrisch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

In transl_catch, I think you want to use unbox_number_low to avoid sign-extending the argument on each staticraise site (it will be sign-extended upon read). If I'm not wrong, unbox_number is then only used in unbox_number_low, so perhaps the functions should be merged.

On possible downside of the approach is that sign-extension might need to be done more often (dynamically), but this happens in an already bad case where the same id will be boxed several time, but this is probably ok.

I wonder if one could also avoid some sign-extension when calling C functions taking unboxed integers, but I

| Boxed_float _ -> unbox_float dbg arg
| Boxed_integer (bi, _) -> unbox_int dbg bi arg

(* Same as unbox_number, but may return garbage in the upper bits *)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment could be more explicit, indicating that this may return garbage in the 32 upper bits when applied to an int32 boxed integer on a 64-bit platform.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

| Cop(Calloc,
[hdr; ops;
Cop(Clsl, [contents; Cconst_int (32, _)], dbg')], _dbg)
Cop(Clsl, [contents; Cconst_int (32, _)], _dbg')], _dbg)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now throwing dbg' away. I've no idea if we care; perhaps @mshinwell can comment on that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked with @mshinwell, he says it's fine.

@stedolan
Copy link
Copy Markdown
Contributor Author

In transl_catch, I think you want to use unbox_number_low to avoid sign-extending the argument on each staticraise site (it will be sign-extended upon read). If I'm not wrong, unbox_number is then only used in unbox_number_low, so perhaps the functions should be merged.

Thanks for the suggestion, done.

On possible downside of the approach is that sign-extension might need to be done more often (dynamically), but this happens in an already bad case where the same id will be boxed several time, but this is probably ok.

Agreed. If we're doing redundant boxing, the cost of doing redundant sign-extension as well isn't going to be noticeable.

I wonder if one could also avoid some sign-extension when calling C functions taking unboxed integers, but I

I think your comment was zero-extended!

I'm avoiding optimising sign-extensions for C functions, as C ABIs are a bit vague about whether they expect numbers to be sign-extended, zero-extended, or have garbage in the upper bits. (More clearly specifying this is an open issue in the amd64 sysv ABI. I think the consensus is that 32-bit values may have garbage in the upper bits but <32-bit values should be sign-extended to 32-bits, which is odd but matches the oddness of x86. See here and here for more).

@alainfrisch
Copy link
Copy Markdown
Contributor

I think your comment was zero-extended!

:-) I was going to write "but I'm not sure what the ABI says about it".

@xavierleroy
Copy link
Copy Markdown
Contributor

I think this is a good optimization to have in ocamlopt, and there is no reason to not have it in 4.10. So, let's merge now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants