Tweaks to amd64 emitter by xclerc · Pull Request #1490 · ocaml/ocaml

xclerc · 2017-11-23T13:57:58Z

This PR suggests the following changes to amd64/emit.mlp in order to
emit shorter instructions when possible:

when moving a known constant to a register;
when and-ing a register with a known constant;
when xor-ing a register with itself.

These changes are based on the fact that when using a 32-bit register,
the lower part is used to performed the actual computation while the
higher part is cleared.

This PR also disables inc/dec instructions, as suggested by the Intel
manual because, contrary to add/sub instructions, they do not set all
flags, which can in turn add data dependencies.

(The PR is split into multiple commit to easily apply only a subset.)

alainfrisch · 2017-11-23T14:15:38Z

Does this come from aesthetic considerations, or did you observe noticeable reductions in code size? If so, please share your findings!

xclerc · 2017-11-23T14:34:46Z

These changes are part of a larger patch we successfully
tested against the critical path of one of our applications.
However we did not test of these changes individually;
I will add a variant on [1] to ensure there is no regression.

[1] https://github.com/OCamlPro/ocamlbench-repo

xclerc · 2017-11-24T09:07:30Z

It just occurred to me that the following rewrites should probably
be enabled iff fast_code is false:

mov ~> mov+cdqe;
mov ~> mov+sal.

(We could also ban inc/dec instructions iff fast_code is true.)

xavierleroy · 2017-11-24T09:44:15Z

These changes are part of a larger patch we successfully tested against the critical path of one of our applications.

Good to know, but that wasn't @alainfrisch's question. Does all this reduce the code size significantly? By how much?

xclerc · 2017-11-24T10:32:00Z

Sorry, I misread Alain's question.

Here are the sizes for the binaries from the distribution:

	before (bytes)	after (bytes)	change (bytes)	change (percentage)
./tools/ocamlobjinfo.opt	8974040	8952760	-21280	-0.24
./tools/read_cmt.opt	8324456	8307312	-17144	-0.21
./tools/ocamlmktop.opt	1765248	1761032	-4216	-0.24
./tools/ocamlmklib.opt	1640024	1635816	-4208	-0.26
./tools/ocamloptp.opt	2074832	2070608	-4224	-0.20
./tools/ocamlprof.opt	2747760	2739320	-8440	-0.31
./tools/cmpbyt.opt	8304136	8282896	-21240	-0.26
./tools/dumpobj.opt	1803112	1798888	-4224	-0.23
./tools/ocamldep.opt	8293872	8272632	-21240	-0.26
./tools/stripdebug.opt	8295408	8274168	-21240	-0.26
./tools/ocamlcp.opt	2059496	2055272	-4224	-0.21
./tools/primreq.opt	1344272	1344208	-64	-0.00
./ocamldoc/ocamldoc.opt	10808664	10779192	-29472	-0.27
./ocamltest/ocamltest.opt	8525824	8504584	-21240	-0.25
./lex/ocamllex.opt	1724056	1723960	-96	-0.01
./ocamlc.opt	8573992	8552640	-21352	-0.25
./ocamlopt.opt	11621056	11597560	-23496	-0.20

alainfrisch · 2017-11-24T10:43:11Z

And do these tiny reductions lead to any observable performance improvement? Smaller code is usually better, except if you start using less-optimized instructions.

I'm trying to find reasons to add complexity to the compiler and risk performance regressions (on specific versions of the CPU).

xclerc · 2017-11-24T12:44:20Z

I will measure the gains in tight loops; in the meantime, here is an illustration
of the gain (or loss in the case of inc/dec) for each rewrite:

  # use 32-bit legacy instructions when data size is 32 bits -- section 9.2.1
  48 31 c0             	xor    %rax,%rax
  31 c0                	xor    %eax,%eax

  # ibid
  48 25 ff ff ff 7f    	and    $0x7fffffff,%rax
  25 ff ff ff 7f       	and    $0x7fffffff,%eax

  # ibid
  48 c7 c0 01 00 00 00 	mov    $0x1,%rax
  b8 01 00 00 00       	mov    $0x1,%eax

  # ibid
  48 c7 c0 ff ff ff 7f 	mov    $0x7fffffff,%rax
  b8 ff ff ff 7f       	mov    $0x7fffffff,%eax

  # avoid instruction that are eight+ bytes in length -- section 3.4.2.7
  48 b8 ff ff ff ff 00 	movabs $0xffffffff,%rax
  00 00 00 
  b8 ff ff ff ff       	mov    $0xffffffff,%eax
  48 98                	cltq   

  # ibid
  48 b8 00 00 00 00 01 	movabs $0x100000000,%rax
  00 00 00 
  b8 01 00 00 00       	mov    $0x1,%eax
  48 c1 e0 20          	shl    $0x20,%rax

  # avoid inc -- section 3.5.1.1
  48 ff c0             	inc    %rax
  48 83 c0 01          	add    $0x1,%rax

  # ibid
  48 ff c8             	dec    %rax
  48 83 e8 01          	sub    $0x1,%rax

The sections refer to "Intel 64 and IA-32 Architectures Optimization Manual",
available here.

xavierleroy · 2017-11-24T15:52:13Z

I had a quick look at the new instruction patterns.

Using movl $n, %R32 instead of movq $n, %R64 when n is in [0...0xFFFF_FFFF] is a clear win: two bytes saved, very common idiom, no risk of performance regression.

The other patterns for "load integer constant" are less convincing to me. Replacing one movabsq instruction by two dependent instructions is smaller but might run slower! Plus, I suspect those cases of "load integer constant" to occur rarely.

xclerc · 2017-11-24T17:12:12Z

Using movl $n, %R32 instead of movq $n, %R64 when n is in [0...0xFFFF_FFFF] is a clear win: two bytes saved, very common idiom, no risk of performance regression.

(The submitted patch considered only the values in the range of Int32.t.)
Your example leads to an even larger win:

movq $0xffffffff, %rax ~> 48 b8 ff ff ff ff 00 00 00 00
movl $0xffffffff, %eax ~> b8 ff ff ff ff

xclerc · 2017-11-24T17:46:17Z

Here is the comparison against trunk mentioned above:
http://bench.flambda.ocamlpro.com/compare?test=2017-11-24-1724%2F4.07.0%2Bamd64tweaks%2Bbench&reference=2017-11-24-1724%2F4.07.0%2Btrunk%2Bbench

xclerc · 2017-12-06T12:59:36Z

An instrumented version of the compiler gives the following distribution
of Iconst_int instructions for a run of make world.opt:

compilation	percentage
32-bit `xor`	7.1
32-bit `mov`	89.2
`mov` + `cdqe`	0.3
`mov` + `sal`	0.3
no optimization	3.1

As @xavierleroy pointed out, there is no risk of performance regression if we
restrict the patch to only the first two cases.

Would you consider such a patch for merge?

xclerc · 2018-01-11T13:02:58Z

Sorry to insist, but would you accept a patch reduced to the following changes?

the use of 32-bit instructions for xor / mov / and when the operand
is in [0..0xFFFF_FFFF]
the use of add / sub instructions instead of incr / decr ones

xavierleroy · 2018-01-14T17:45:26Z

the use of 32-bit instructions for xor / mov / and when the operand is in [0..0xFFFF_FFFF]

Definitely yes.

the use of add / sub instructions instead of incr / decr ones

Yes, but this is trading increased code size for reduced dependencies, so perhaps keep inc / dec if fast_code is false.

xclerc · 2018-01-16T12:28:43Z

the use of 32-bit instructions for xor / mov / and when the operand is in [0..0xFFFF_FFFF]

Definitely yes.

I have removed the complex patterns around the mov instruction.

the use of add / sub instructions instead of incr / decr ones

Yes, but this is trading increased code size for reduced dependencies, so perhaps keep inc / dec if fast_code is false.

As suggested, I have made the code emitting inc / dec depend on
the value of fastcode_flag.

xclerc · 2018-01-16T14:35:18Z

Travis is not happy. The check_all_arch fails because the last
commit uses a constant (0x01_0000_0000n) that is invalid on
a 32-bit architecture. See #1571.

xavierleroy · 2018-01-16T15:14:25Z

the last commit uses a constant (0x01_0000_0000n) that is invalid on a 32-bit architecture.

Just write n <= 0xFFFF_FFFFn instead of n < 0x01_0000_0000. Just as clear + syntactically valid on a 32-bit system.

xclerc · 2018-01-16T15:53:40Z

Just write n <= 0xFFFF_FFFFn instead of n < 0x01_0000_0000. Just as clear + syntactically valid on a 32-bit system.

Of course. My gut feeling was that it would hide a real issue.
I guess my understanding was slightly wrong: the code built
my check_all_arches is only expected to be compiled but not
to be executed (some operations over nativeint values
in cmmgen.ml would probably produce invalid results on
32-bit platforms anyway).

xclerc · 2018-01-16T16:13:39Z

Travis is still not happy, but this time it is not even clear why.
The last two lines of the log are:

========= CHECKING asmcomp/arm64 ==============
make: *** [check_all_arches] Error 1

xavierleroy · 2018-01-16T16:37:06Z

You have another occurrence of 0x01_0000_0000 left.

xclerc · 2018-01-16T16:53:34Z

How embarrassing... I guess I was puzzled by the change/lack
of error message.

However, isn't this occurrence a bit more annoying? As it is an
int value on a 32-bit machine, I will probably not be allowed to
write 0xFFFF_FFFF. Of course, I can promote the value to be
compared to int64, but that does not feel right...

Fix minor typo

…1578) Before, cyclic dependencies were reported as a warning, and ocamldep -sort would exit with code 0. Now, the message says "error" and the exit code is nonzero.

Except for the Camlinternal* modules and the new Stdlib module, all modules in the stdlib now compile to Stdlib__<module>. Pervasives is renamed to Stdlib and now contains a list of aliases from the long names to the short ones, so that from inside and outside the stdlib we can refer to the standard modules as just List or Stdlib.List rather than Stdlib__list. In order to avoid printing the long names in error messages and in the toplevel, the following heuristic is added to Printtyp: given a path Foo__bar, if Foo.Bar exists and is a direct or indirect alias to Foo__bar, then prefer Foo.Bar. A bootstrap step was required to replace Pervasives by Stdlib as the module opened by default.

awk is symbolic link in Cygwin, which means it can't be used in -pp for a native Windows build. Just use gawk instead, as no other package provides the awk command on Cygwin.

which is used by both ocamldoc and the reference manual.

Since the reflector auxiliary program used by this test is now written in OCaml rather than in C, the redirections father process must take care to pass on OCAMLRUNPARAM, othewise the test fails when run with the debug runtime.

The function argument ocamlrunparam should actually be called systemenv.

…nto amd64-emit-tweaks

smuenzel-js · 2019-03-27T03:20:14Z

@xclerc: Looks like you merged trunk into your branch, and this resulted in the pull request being messed up. Can you remove the merge commit and rebase instead?

Some other ideas (for a future PR):

For Iand: when n = 0xFF, use movzx

25 ff 00 00 00       	and    $0xff,%eax
0f b6 c0             	movzbl %al,%eax

For other operations with 8-bit immediates: We may get a shorter encoding if we use 8-bit registers. This transformation is valid when the high 56-bits are not affected by the operation. IIRC this does not result in an extra merge micro-op.

48 83 cb 01          	or     $0x1,%rbx
80 cb 01             	or     $0x1,%bl

We may want to consider two-instruction sequences that start with zero-cost operations (xor r32,r32) if they result in a shorter overall encoding.

48 c7 c0 01 00 00 00    mov $0x1,%rax
b8 01 00 00 00          mov $0x1,%eax
31 c0 b0 01             xor %eax,%eax; mov %0x1,%al

smuenzel · 2020-04-15T05:40:23Z

ping?

damiendoligez · 2020-09-08T14:30:39Z

ping @xclerc

gasche · 2021-04-18T14:46:32Z

Ping again. What is the status here? If @xclerc is not interested in working on this anymore, maybe @smuenzel you would be interested in rebasing the restricted version of this PR and proposing it again, as a new PR?

(I am planning to close next time I encounter this PR if there has not been any progress.)

gasche · 2021-04-28T12:29:09Z

Closing. We can always reopen if there is interest.

damiendoligez mentioned this pull request Feb 5, 2018

Make the check_all_arches target no-op on 32-bit architectures #1571

Closed

shindere and others added 9 commits February 8, 2018 15:04

ocamltest: fix typo in previous commit

01d436b

Fix minor typo

3899948

Merge pull request ocaml#1601 from delamonpansie/fix-doc-typo

f8b145b

Fix minor typo

MPR#7710: cyclic dependencies are an error for ocamldep -sort (ocaml#…

e2e928f

…1578) Before, cyclic dependencies were reported as a warning, and ocamldep -sort would exit with code 0. Now, the message says "error" and the exit code is nonzero.

Use gawk on Windows in the build system

4d79045

awk is symbolic link in Cygwin, which means it can't be used in -pp for a native Windows build. Just use gawk instead, as no other package provides the awk command on Cygwin.

Alter awk scripts to cope with CRLF checkout

5e35fd0

Minimal fix for the manual build pocess

e95b0c4

Factorize the build of the unprefixed stdlib

8f2a153

which is used by both ocamldoc and the reference manual.

shindere and others added 20 commits March 18, 2018 22:24

Migrate the typing-objects-bugs tests to ocamltest

df471e0

Migrate the typing-poly-bugs tests to ocamltest

516bfd1

Migrate the typing-polyvariants-bugs tests to ocamltest

79dd370

Migrate the typing-private-bugs tests to ocamltest

76bb01a

Migrate the typing-recmod tests to ocamltest

36e611e

Migrate the typing-rectypes-bugs tests to ocamltest

ca4fa29

ocamltest: export environment to more external commands

2160aac

Fix the lib-unix/common/redirections.ml test

7a0c95c

Since the reflector auxiliary program used by this test is now written in OCaml rather than in C, the redirections father process must take care to pass on OCAMLRUNPARAM, othewise the test fails when run with the debug runtime.

Improve naming in testsuite/tests/lib-unix/common/redirections.ml

7d5e40c

The function argument ocamlrunparam should actually be called systemenv.

Add support for CDQE instruction.

ed78153

Use smaller MOV instructions when possible.

d89b6c7

Use smaller AND instructions when possible.

960f81b

Use smaller XOR instructions when possible.

0a60973

Avoid INC/DEC instructions.

cc12fc6

Keep INC/DEC instructions when fast_code is false.

986da2c

Remove support for CDQE instruction.

f0eea94

Delete the complex patterns around the MOV instruction.

dce4c23

Make the code valid on 32-bit architectures.

18abce6

Add change entry for GPR#1490.

3588014

Merge branch 'amd64-emit-tweaks' of https://github.com/xclerc/ocaml i…

f901a35

…nto amd64-emit-tweaks

ctk21 mentioned this pull request Sep 27, 2019

amd64: Emit 32bit registers for Iconst_int when we can #8990

Merged

gasche closed this Apr 28, 2021

xclerc mentioned this pull request May 10, 2021

Amd64 emit tweaks oxcaml/oxcaml#25

Merged

gretay-js mentioned this pull request Jan 10, 2024

Always use x86 inc/dec for add/sub with 1 oxcaml/oxcaml#2211

Merged

EmileTrotignon pushed a commit to EmileTrotignon/ocaml that referenced this pull request Jan 12, 2024

Fix link to heading around UTop recommendation (ocaml#1490)

494c88d

Conversation

xclerc commented Nov 23, 2017

Uh oh!

alainfrisch commented Nov 23, 2017

Uh oh!

xclerc commented Nov 23, 2017

Uh oh!

xclerc commented Nov 24, 2017

Uh oh!

xavierleroy commented Nov 24, 2017

Uh oh!

xclerc commented Nov 24, 2017

Uh oh!

alainfrisch commented Nov 24, 2017

Uh oh!

xclerc commented Nov 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xavierleroy commented Nov 24, 2017

Uh oh!

xclerc commented Nov 24, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xclerc commented Nov 24, 2017

Uh oh!

xclerc commented Dec 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xclerc commented Jan 11, 2018

Uh oh!

xavierleroy commented Jan 14, 2018

Uh oh!

xclerc commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xclerc commented Jan 16, 2018

Uh oh!

xavierleroy commented Jan 16, 2018

Uh oh!

xclerc commented Jan 16, 2018

Uh oh!

xclerc commented Jan 16, 2018

Uh oh!

xavierleroy commented Jan 16, 2018

Uh oh!

xclerc commented Jan 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smuenzel-js commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smuenzel commented Apr 15, 2020

Uh oh!

damiendoligez commented Sep 8, 2020

Uh oh!

gasche commented Apr 18, 2021

Uh oh!

gasche commented Apr 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

xclerc commented Nov 24, 2017 •

edited

Loading

xclerc commented Nov 24, 2017 •

edited

Loading

xclerc commented Dec 6, 2017 •

edited

Loading

xclerc commented Jan 16, 2018 •

edited

Loading

xclerc commented Jan 16, 2018 •

edited

Loading

smuenzel-js commented Mar 27, 2019 •

edited

Loading