Skip to content

MachInst lowering logic: allow effectful instructions to merge.#2366

Merged
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:load-isel
Nov 16, 2020
Merged

MachInst lowering logic: allow effectful instructions to merge.#2366
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:load-isel

Conversation

@cfallin
Copy link
Member

@cfallin cfallin commented Nov 5, 2020

This PR updates the "coloring" scheme that accounts for side-effects in
the MachInst lowering logic. As a result, the new backends will now be
able to merge effectful operations (such as memory loads) into other
operations; previously, only the other way (pure ops merged into
effectful ops) was possible. This will allow, for example, a load+ALU-op
combination, as is common on x86. It should even allow a load + ALU-op +
store sequence to merge into one lowered instruction.

The scheme arose from many fruitful discussions with @julian-seward1
(thanks!); significant credit is due to him for the insights here.

The first insight is that given the right basic conditions, i.e. that
the root instruction is the only use of an effectful instruction's
result, all we need is that the "color" of the effectful instruction is
one less than the color of the current instruction. It's easier to
think about colors on the program points between instructions: if the
color coming out of the first (effectful def) instruction and in to
the second (effectful or effect-free use) instruction are the same, then
they can merge. Basically the color denotes a version of global state;
if the same, then no other effectful ops happened in the meantime.

The second insight is that we can keep state as we scan, tracking the
"current color", and update this when we sink (merge) an op. Hence
when we sink a load into another op, we effectively re-color every
instruction it moved over; this may allow further sinks.

Consider the example (and assume that we consider loads effectful in
order to conservatively ensure a strong memory model; otherwise, replace
with other effectful value-producing insts):

  v0 = load x
  v1 = load y
  v2 = add v0, 1
  v3 = add v1, 1

Scanning from bottom to top, we first see the add producing v3 and we
can sink the load producing v1 into it, producing a load + ALU-op
machine instruction. This is legal because v1 moves over only v2,
which is a pure instruction. Consider, though, v2: under a simple
scheme that has no other context, v0 could not sink to v2 because it
would move over v1, another load. But because we already sunk v1
down to v3, we are free to sink v0 to v2; the update of the
"current color" during the scan allows this.

This PR also cleans up the LowerCtx interface a bit at the same time:
whereas previously it always gave some subset of (constant, mergeable
inst, register) directly from LowerCtx::get_input(), it now returns
zero or more of (constant, mergable inst) from
LowerCtx::maybe_get_input_as_source_or_const(), and returns the
register only from LowerCtx::put_input_in_reg(). This removes the need
to explicitly denote uses of the register, so it's a little safer.

Note that this PR does not actually make use of the new ability to merge
loads into other ops; that will come in future PRs, especially to
optimize the x64 backend by using direct-memory operands.

Testing: existing filetests ensure the DCE-while-lowering and existing
merging pattern-matches continue to work; merging of effectful ops
will be verified in a subsequent PR with load+op pattern-matching.

Fixes #2340.

@cfallin cfallin added the cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. label Nov 5, 2020
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen labels Nov 5, 2020
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 7, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 7, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 7, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 9, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
@cfallin cfallin force-pushed the load-isel branch 2 times, most recently from ea5e7b3 to 61a629f Compare November 10, 2020 01:55
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 10, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 10, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 11, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 11, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 11, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 11, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
@cfallin
Copy link
Member Author

cfallin commented Nov 11, 2020

Any volunteers to review? @julian-seward1 has let me know he's deep in fire-fighting at the moment; @fitzgen or @peterhuene perhaps?

@fitzgen
Copy link
Member

fitzgen commented Nov 11, 2020

I can review this if no one else with expertise in this part of the code base has time.

Quick question: will this (eventually) enable collapsing

v0 = load.i32 $addr
v1 = iadd_imm v0, 1
store v0, $addr

into

add $addr, 1

on x64? We emit clif like that in our refcounting gc barriers for externrefs right now >.<

@cfallin
Copy link
Member Author

cfallin commented Nov 12, 2020

@fitzgen Thanks very much!

Quick question: will this (eventually) enable collapsing

v0 = load.i32 $addr
v1 = iadd_imm v0, 1
store v0, $addr

into

add $addr, 1

on x64? We emit clif like that in our refcounting gc barriers for externrefs right now >.<

Indeed, that's possible to build once this lands; it would be the next step after #2389 (which so far just merges the load and add).

@cfallin cfallin requested review from fitzgen and removed request for julian-seward1 November 12, 2020 00:26
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, although I think that if we assume plain loads have effects then my refcounting example won't collapse to add $mem, 1 because the load and the store would have different colors. We will probably want separate instructions for atomic memory operations (or to parametrize memory operations with an ordering). It is unfortunate that we treat Wasm linear memory loads and stack slot loads and internal VM context loads, etc... identically.

A few inline nitpicks and soft suggestions below. Feel free to take em or leave em.

@cfallin cfallin force-pushed the load-isel branch 2 times, most recently from d74a205 to 7c88af5 Compare November 16, 2020 22:52
This PR updates the "coloring" scheme that accounts for side-effects in
the MachInst lowering logic. As a result, the new backends will now be
able to merge effectful operations (such as memory loads) *into* other
operations; previously, only the other way (pure ops merged into
effectful ops) was possible. This will allow, for example, a load+ALU-op
combination, as is common on x86. It should even allow a load + ALU-op +
store sequence to merge into one lowered instruction.

The scheme arose from many fruitful discussions with @julian-seward1
(thanks!); significant credit is due to him for the insights here.

The first insight is that given the right basic conditions, i.e.  that
the root instruction is the only use of an effectful instruction's
result, all we need is that the "color" of the effectful instruction is
*one less* than the color of the current instruction. It's easier to
think about colors on the program points between instructions: if the
color coming *out* of the first (effectful def) instruction and *in* to
the second (effectful or effect-free use) instruction are the same, then
they can merge. Basically the color denotes a version of global state;
if the same, then no other effectful ops happened in the meantime.

The second insight is that we can keep state as we scan, tracking the
"current color", and *update* this when we sink (merge) an op. Hence
when we sink a load into another op, we effectively *re-color* every
instruction it moved over; this may allow further sinks.

Consider the example (and assume that we consider loads effectful in
order to conservatively ensure a strong memory model; otherwise, replace
with other effectful value-producing insts):

```
  v0 = load x
  v1 = load y
  v2 = add v0, 1
  v3 = add v1, 1
```

Scanning from bottom to top, we first see the add producing `v3` and we
can sink the load producing `v1` into it, producing a load + ALU-op
machine instruction. This is legal because `v1` moves over only `v2`,
which is a pure instruction. Consider, though, `v2`: under a simple
scheme that has no other context, `v0` could not sink to `v2` because it
would move over `v1`, another load. But because we already sunk `v1`
down to `v3`, we are free to sink `v0` to `v2`; the update of the
"current color" during the scan allows this.

This PR also cleans up the `LowerCtx` interface a bit at the same time:
whereas previously it always gave some subset of (constant, mergeable
inst, register) directly from `LowerCtx::get_input()`, it now returns
zero or more of (constant, mergable inst) from
`LowerCtx::maybe_get_input_as_source_or_const()`, and returns the
register only from `LowerCtx::put_input_in_reg()`. This removes the need
to explicitly denote uses of the register, so it's a little safer.

Note that this PR does not actually make use of the new ability to merge
loads into other ops; that will come in future PRs, especially to
optimize the `x64` backend by using direct-memory operands.
@cfallin cfallin merged commit 2150a53 into bytecodealliance:main Nov 16, 2020
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 16, 2020
This was added as an incremental step to improve AArch64 code quality in
PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR bytecodealliance#2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 17, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 17, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 17, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Nov 17, 2020
This PR makes use of the support in bytecodealliance#2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in bytecodealliance#2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see bytecodealliance#2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
cfallin added a commit that referenced this pull request Nov 30, 2020
This was added as an incremental step to improve AArch64 code quality in
PR #2278. At the time, we did not have a way to pattern-match the load +
splat opcode sequence that the relevant Wasm opcodes lowered to.
However, now with PR #2366, we can merge effectful instructions such as
loads into other ops, and so we can do this pattern matching directly.
The pattern-matching update will come in a subsequent commit.
cfallin added a commit that referenced this pull request Nov 30, 2020
This PR makes use of the support in #2366 for sinking effectful
instructions and merging them with consumers. In particular, on x86, we
want to make use of the ability of many instructions to load one operand
directly from memory. That is, instead of this:

```
    movq 0(%rdi), %rax
    addq %rax, %rbx
```

we want to generate this:

```
    addq 0(%rdi), %rax
```

As described in more detail in #2366, sinking and merging the load is
only possible under certain conditions. In particular, we need to ensure
that the use is the *only* use (otherwise the load happens more than
once), and we need to ensure that it does not move across other
effectful ops (see #2366 for how we ensure this).

This change is actually fairly simple, given that all the framework is
in place: we simply pattern-match a load on one operand of an ALU
instruction that takes an RMI (reg, mem, or immediate) operand, and
generate the mem form when we match.

Also makes a drive-by improvement in the x64 backend to use
statically-monomorphized `LowerCtx` types rather than a `&mut dyn
LowerCtx`.

On `bz2.wasm`, this results in ~1% instruction-count reduction. More is
likely possible by following up with other instructions that can merge
memory loads as well.
@cfallin cfallin deleted the load-isel branch January 6, 2021 18:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow loads to merge into other operations during instruction selection in MachInst backends

2 participants