Skip to content

Tweaks to amd64 emitter#1490

Closed
xclerc wants to merge 645 commits intoocaml:trunkfrom
xclerc:amd64-emit-tweaks
Closed

Tweaks to amd64 emitter#1490
xclerc wants to merge 645 commits intoocaml:trunkfrom
xclerc:amd64-emit-tweaks

Conversation

@xclerc
Copy link
Copy Markdown
Contributor

@xclerc xclerc commented Nov 23, 2017

This PR suggests the following changes to amd64/emit.mlp in order to
emit shorter instructions when possible:

  • when moving a known constant to a register;
  • when and-ing a register with a known constant;
  • when xor-ing a register with itself.

These changes are based on the fact that when using a 32-bit register,
the lower part is used to performed the actual computation while the
higher part is cleared.

This PR also disables inc/dec instructions, as suggested by the Intel
manual because, contrary to add/sub instructions, they do not set all
flags, which can in turn add data dependencies.

(The PR is split into multiple commit to easily apply only a subset.)

@alainfrisch
Copy link
Copy Markdown
Contributor

Does this come from aesthetic considerations, or did you observe noticeable reductions in code size? If so, please share your findings!

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Nov 23, 2017

These changes are part of a larger patch we successfully
tested against the critical path of one of our applications.
However we did not test of these changes individually;
I will add a variant on [1] to ensure there is no regression.

[1] https://github.com/OCamlPro/ocamlbench-repo

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Nov 24, 2017

It just occurred to me that the following rewrites should probably
be enabled iff fast_code is false:

  • mov ~> mov+cdqe;
  • mov ~> mov+sal.

(We could also ban inc/dec instructions iff fast_code is true.)

@xavierleroy
Copy link
Copy Markdown
Contributor

These changes are part of a larger patch we successfully tested against the critical path of one of our applications.

Good to know, but that wasn't @alainfrisch's question. Does all this reduce the code size significantly? By how much?

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Nov 24, 2017

Sorry, I misread Alain's question.

Here are the sizes for the binaries from the distribution:

  before (bytes) after (bytes) change (bytes) change (percentage)
./tools/ocamlobjinfo.opt 8974040 8952760 -21280 -0.24
./tools/read_cmt.opt 8324456 8307312 -17144 -0.21
./tools/ocamlmktop.opt 1765248 1761032 -4216 -0.24
./tools/ocamlmklib.opt 1640024 1635816 -4208 -0.26
./tools/ocamloptp.opt 2074832 2070608 -4224 -0.20
./tools/ocamlprof.opt 2747760 2739320 -8440 -0.31
./tools/cmpbyt.opt 8304136 8282896 -21240 -0.26
./tools/dumpobj.opt 1803112 1798888 -4224 -0.23
./tools/ocamldep.opt 8293872 8272632 -21240 -0.26
./tools/stripdebug.opt 8295408 8274168 -21240 -0.26
./tools/ocamlcp.opt 2059496 2055272 -4224 -0.21
./tools/primreq.opt 1344272 1344208 -64 -0.00
./ocamldoc/ocamldoc.opt 10808664 10779192 -29472 -0.27
./ocamltest/ocamltest.opt 8525824 8504584 -21240 -0.25
./lex/ocamllex.opt 1724056 1723960 -96 -0.01
./ocamlc.opt 8573992 8552640 -21352 -0.25
./ocamlopt.opt 11621056 11597560 -23496 -0.20

@alainfrisch
Copy link
Copy Markdown
Contributor

And do these tiny reductions lead to any observable performance improvement? Smaller code is usually better, except if you start using less-optimized instructions.

I'm trying to find reasons to add complexity to the compiler and risk performance regressions (on specific versions of the CPU).

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Nov 24, 2017

I will measure the gains in tight loops; in the meantime, here is an illustration
of the gain (or loss in the case of inc/dec) for each rewrite:

  # use 32-bit legacy instructions when data size is 32 bits -- section 9.2.1
  48 31 c0             	xor    %rax,%rax
  31 c0                	xor    %eax,%eax

  # ibid
  48 25 ff ff ff 7f    	and    $0x7fffffff,%rax
  25 ff ff ff 7f       	and    $0x7fffffff,%eax

  # ibid
  48 c7 c0 01 00 00 00 	mov    $0x1,%rax
  b8 01 00 00 00       	mov    $0x1,%eax

  # ibid
  48 c7 c0 ff ff ff 7f 	mov    $0x7fffffff,%rax
  b8 ff ff ff 7f       	mov    $0x7fffffff,%eax

  # avoid instruction that are eight+ bytes in length -- section 3.4.2.7
  48 b8 ff ff ff ff 00 	movabs $0xffffffff,%rax
  00 00 00 
  b8 ff ff ff ff       	mov    $0xffffffff,%eax
  48 98                	cltq   

  # ibid
  48 b8 00 00 00 00 01 	movabs $0x100000000,%rax
  00 00 00 
  b8 01 00 00 00       	mov    $0x1,%eax
  48 c1 e0 20          	shl    $0x20,%rax

  # avoid inc -- section 3.5.1.1
  48 ff c0             	inc    %rax
  48 83 c0 01          	add    $0x1,%rax

  # ibid
  48 ff c8             	dec    %rax
  48 83 e8 01          	sub    $0x1,%rax

The sections refer to "Intel 64 and IA-32 Architectures Optimization Manual",
available here.

@xavierleroy
Copy link
Copy Markdown
Contributor

I had a quick look at the new instruction patterns.

Using movl $n, %R32 instead of movq $n, %R64 when n is in [0...0xFFFF_FFFF] is a clear win: two bytes saved, very common idiom, no risk of performance regression.

The other patterns for "load integer constant" are less convincing to me. Replacing one movabsq instruction by two dependent instructions is smaller but might run slower! Plus, I suspect those cases of "load integer constant" to occur rarely.

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Nov 24, 2017

Using movl $n, %R32 instead of movq $n, %R64 when n is in [0...0xFFFF_FFFF] is a clear win: two bytes saved, very common idiom, no risk of performance regression.

(The submitted patch considered only the values in the range of Int32.t.)
Your example leads to an even larger win:

movq $0xffffffff, %rax ~> 48 b8 ff ff ff ff 00 00 00 00
movl $0xffffffff, %eax ~> b8 ff ff ff ff

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Nov 24, 2017

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Dec 6, 2017

An instrumented version of the compiler gives the following distribution
of Iconst_int instructions for a run of make world.opt:

compilation percentage
32-bit xor 7.1
32-bit mov 89.2
mov + cdqe 0.3
mov + sal 0.3
no optimization 3.1

As @xavierleroy pointed out, there is no risk of performance regression if we
restrict the patch to only the first two cases.

Would you consider such a patch for merge?

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Jan 11, 2018

Sorry to insist, but would you accept a patch reduced to the following changes?

  1. the use of 32-bit instructions for xor / mov / and when the operand
    is in [0..0xFFFF_FFFF]
  2. the use of add / sub instructions instead of incr / decr ones

@xavierleroy
Copy link
Copy Markdown
Contributor

the use of 32-bit instructions for xor / mov / and when the operand is in [0..0xFFFF_FFFF]

Definitely yes.

the use of add / sub instructions instead of incr / decr ones

Yes, but this is trading increased code size for reduced dependencies, so perhaps keep inc / dec if fast_code is false.

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Jan 16, 2018

the use of 32-bit instructions for xor / mov / and when the operand is in [0..0xFFFF_FFFF]

Definitely yes.

I have removed the complex patterns around the mov instruction.

the use of add / sub instructions instead of incr / decr ones

Yes, but this is trading increased code size for reduced dependencies, so perhaps keep inc / dec if fast_code is false.

As suggested, I have made the code emitting inc / dec depend on
the value of fastcode_flag.

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Jan 16, 2018

Travis is not happy. The check_all_arch fails because the last
commit uses a constant (0x01_0000_0000n) that is invalid on
a 32-bit architecture. See #1571.

@xavierleroy
Copy link
Copy Markdown
Contributor

the last commit uses a constant (0x01_0000_0000n) that is invalid on a 32-bit architecture.

Just write n <= 0xFFFF_FFFFn instead of n < 0x01_0000_0000. Just as clear + syntactically valid on a 32-bit system.

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Jan 16, 2018

Just write n <= 0xFFFF_FFFFn instead of n < 0x01_0000_0000. Just as clear + syntactically valid on a 32-bit system.

Of course. My gut feeling was that it would hide a real issue.
I guess my understanding was slightly wrong: the code built
my check_all_arches is only expected to be compiled but not
to be executed (some operations over nativeint values
in cmmgen.ml would probably produce invalid results on
32-bit platforms anyway).

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Jan 16, 2018

Travis is still not happy, but this time it is not even clear why.
The last two lines of the log are:

========= CHECKING asmcomp/arm64 ==============
make: *** [check_all_arches] Error 1

@xavierleroy
Copy link
Copy Markdown
Contributor

You have another occurrence of 0x01_0000_0000 left.

@xclerc
Copy link
Copy Markdown
Contributor Author

xclerc commented Jan 16, 2018

How embarrassing... I guess I was puzzled by the change/lack
of error message.

However, isn't this occurrence a bit more annoying? As it is an
int value on a 32-bit machine, I will probably not be allowed to
write 0xFFFF_FFFF. Of course, I can promote the value to be
compared to int64, but that does not feel right...

shindere and others added 9 commits February 8, 2018 15:04
…1578)

Before, cyclic dependencies were reported as a warning, and ocamldep -sort would exit with code 0.
Now, the message says "error" and the exit code is nonzero.
Except for the Camlinternal* modules and the new Stdlib module, all
modules in the stdlib now compile to Stdlib__<module>.

Pervasives is renamed to Stdlib and now contains a list of aliases
from the long names to the short ones, so that from inside and outside
the stdlib we can refer to the standard modules as just List or
Stdlib.List rather than Stdlib__list.

In order to avoid printing the long names in error messages and in the
toplevel, the following heuristic is added to Printtyp: given a path
Foo__bar, if Foo.Bar exists and is a direct or indirect alias to
Foo__bar, then prefer Foo.Bar.

A bootstrap step was required to replace Pervasives by Stdlib as the
module opened by default.
awk is symbolic link in Cygwin, which means it can't be used in -pp for
a native Windows build. Just use gawk instead, as no other package
provides the awk command on Cygwin.
which is used by both ocamldoc and the reference manual.
shindere and others added 20 commits March 18, 2018 22:24
Since the reflector auxiliary program used by this test is now
written in OCaml rather than in C, the redirections father process
must take care to pass on OCAMLRUNPARAM, othewise the test fails
when run with the debug runtime.
The function argument ocamlrunparam should actually be called systemenv.
@smuenzel-js
Copy link
Copy Markdown
Contributor

smuenzel-js commented Mar 27, 2019

@xclerc: Looks like you merged trunk into your branch, and this resulted in the pull request being messed up. Can you remove the merge commit and rebase instead?

Some other ideas (for a future PR):

  • For Iand: when n = 0xFF, use movzx
25 ff 00 00 00       	and    $0xff,%eax
0f b6 c0             	movzbl %al,%eax
  • For other operations with 8-bit immediates: We may get a shorter encoding if we use 8-bit registers. This transformation is valid when the high 56-bits are not affected by the operation. IIRC this does not result in an extra merge micro-op.
48 83 cb 01          	or     $0x1,%rbx
80 cb 01             	or     $0x1,%bl
  • We may want to consider two-instruction sequences that start with zero-cost operations (xor r32,r32) if they result in a shorter overall encoding.
48 c7 c0 01 00 00 00    mov $0x1,%rax
b8 01 00 00 00          mov $0x1,%eax
31 c0 b0 01             xor %eax,%eax; mov %0x1,%al

@smuenzel
Copy link
Copy Markdown
Contributor

ping?

@damiendoligez
Copy link
Copy Markdown
Member

ping @xclerc

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 18, 2021

Ping again. What is the status here? If @xclerc is not interested in working on this anymore, maybe @smuenzel you would be interested in rebasing the restricted version of this PR and proposing it again, as a new PR?

(I am planning to close next time I encounter this PR if there has not been any progress.)

@gasche
Copy link
Copy Markdown
Member

gasche commented Apr 28, 2021

Closing. We can always reopen if there is interest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.