MachInst lowering logic: allow effectful instructions to merge. by cfallin · Pull Request #2366 · bytecodealliance/wasmtime

cfallin · 2020-11-05T08:15:18Z

This PR updates the "coloring" scheme that accounts for side-effects in
the MachInst lowering logic. As a result, the new backends will now be
able to merge effectful operations (such as memory loads) into other
operations; previously, only the other way (pure ops merged into
effectful ops) was possible. This will allow, for example, a load+ALU-op
combination, as is common on x86. It should even allow a load + ALU-op +
store sequence to merge into one lowered instruction.

The scheme arose from many fruitful discussions with @julian-seward1
(thanks!); significant credit is due to him for the insights here.

The first insight is that given the right basic conditions, i.e. that
the root instruction is the only use of an effectful instruction's
result, all we need is that the "color" of the effectful instruction is
one less than the color of the current instruction. It's easier to
think about colors on the program points between instructions: if the
color coming out of the first (effectful def) instruction and in to
the second (effectful or effect-free use) instruction are the same, then
they can merge. Basically the color denotes a version of global state;
if the same, then no other effectful ops happened in the meantime.

The second insight is that we can keep state as we scan, tracking the
"current color", and update this when we sink (merge) an op. Hence
when we sink a load into another op, we effectively re-color every
instruction it moved over; this may allow further sinks.

Consider the example (and assume that we consider loads effectful in
order to conservatively ensure a strong memory model; otherwise, replace
with other effectful value-producing insts):

  v0 = load x
  v1 = load y
  v2 = add v0, 1
  v3 = add v1, 1

Scanning from bottom to top, we first see the add producing v3 and we
can sink the load producing v1 into it, producing a load + ALU-op
machine instruction. This is legal because v1 moves over only v2,
which is a pure instruction. Consider, though, v2: under a simple
scheme that has no other context, v0 could not sink to v2 because it
would move over v1, another load. But because we already sunk v1
down to v3, we are free to sink v0 to v2; the update of the
"current color" during the scan allows this.

This PR also cleans up the LowerCtx interface a bit at the same time:
whereas previously it always gave some subset of (constant, mergeable
inst, register) directly from LowerCtx::get_input(), it now returns
zero or more of (constant, mergable inst) from
LowerCtx::maybe_get_input_as_source_or_const(), and returns the
register only from LowerCtx::put_input_in_reg(). This removes the need
to explicitly denote uses of the register, so it's a little safer.

Note that this PR does not actually make use of the new ability to merge
loads into other ops; that will come in future PRs, especially to
optimize the x64 backend by using direct-memory operands.

Testing: existing filetests ensure the DCE-while-lowering and existing
merging pattern-matches continue to work; merging of effectful ops
will be verified in a subsequent PR with load+op pattern-matching.

Fixes #2340.

This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.

This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`.

This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.

This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`.

This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.

cfallin · 2020-11-11T17:55:53Z

Any volunteers to review? @julian-seward1 has let me know he's deep in fire-fighting at the moment; @fitzgen or @peterhuene perhaps?

fitzgen · 2020-11-11T23:51:26Z

I can review this if no one else with expertise in this part of the code base has time.

Quick question: will this (eventually) enable collapsing

v0 = load.i32 $addr
v1 = iadd_imm v0, 1
store v0, $addr

into

add $addr, 1

on x64? We emit clif like that in our refcounting gc barriers for externrefs right now >.<

cfallin · 2020-11-12T00:26:04Z

@fitzgen Thanks very much!

Quick question: will this (eventually) enable collapsing
v0 = load.i32 $addr
v1 = iadd_imm v0, 1
store v0, $addr
into
add $addr, 1
on x64? We emit clif like that in our refcounting gc barriers for externrefs right now >.<

Indeed, that's possible to build once this lands; it would be the next step after #2389 (which so far just merges the load and add).

fitzgen

This makes sense, although I think that if we assume plain loads have effects then my refcounting example won't collapse to add $mem, 1 because the load and the store would have different colors. We will probably want separate instructions for atomic memory operations (or to parametrize memory operations with an ordering). It is unfortunate that we treat Wasm linear memory loads and stack slot loads and internal VM context loads, etc... identically.

A few inline nitpicks and soft suggestions below. Feel free to take em or leave em.

cranelift/codegen/src/machinst/lower.rs

@julian-seward1

This PR updates the "coloring" scheme that accounts for side-effects in the MachInst lowering logic. As a result, the new backends will now be able to merge effectful operations (such as memory loads) *into* other operations; previously, only the other way (pure ops merged into effectful ops) was possible. This will allow, for example, a load+ALU-op combination, as is common on x86. It should even allow a load + ALU-op + store sequence to merge into one lowered instruction. The scheme arose from many fruitful discussions with @julian-seward1 (thanks!); significant credit is due to him for the insights here. The first insight is that given the right basic conditions, i.e. that the root instruction is the only use of an effectful instruction's result, all we need is that the "color" of the effectful instruction is *one less* than the color of the current instruction. It's easier to think about colors on the program points between instructions: if the color coming *out* of the first (effectful def) instruction and *in* to the second (effectful or effect-free use) instruction are the same, then they can merge. Basically the color denotes a version of global state; if the same, then no other effectful ops happened in the meantime. The second insight is that we can keep state as we scan, tracking the "current color", and *update* this when we sink (merge) an op. Hence when we sink a load into another op, we effectively *re-color* every instruction it moved over; this may allow further sinks. Consider the example (and assume that we consider loads effectful in order to conservatively ensure a strong memory model; otherwise, replace with other effectful value-producing insts): ``` v0 = load x v1 = load y v2 = add v0, 1 v3 = add v1, 1 ``` Scanning from bottom to top, we first see the add producing `v3` and we can sink the load producing `v1` into it, producing a load + ALU-op machine instruction. This is legal because `v1` moves over only `v2`, which is a pure instruction. Consider, though, `v2`: under a simple scheme that has no other context, `v0` could not sink to `v2` because it would move over `v1`, another load. But because we already sunk `v1` down to `v3`, we are free to sink `v0` to `v2`; the update of the "current color" during the scan allows this. This PR also cleans up the `LowerCtx` interface a bit at the same time: whereas previously it always gave some subset of (constant, mergeable inst, register) directly from `LowerCtx::get_input()`, it now returns zero or more of (constant, mergable inst) from `LowerCtx::maybe_get_input_as_source_or_const()`, and returns the register only from `LowerCtx::put_input_in_reg()`. This removes the need to explicitly denote uses of the register, so it's a little safer. Note that this PR does not actually make use of the new ability to merge loads into other ops; that will come in future PRs, especially to optimize the `x64` backend by using direct-memory operands.

This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.

This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.

This was added as an incremental step to improve AArch64 code quality in PR #2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR #2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.

This PR makes use of the support in #2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in #2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see #2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.

cfallin added the cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. label Nov 5, 2020

cfallin requested a review from julian-seward1 November 5, 2020 08:15

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen labels Nov 5, 2020

cfallin force-pushed the load-isel branch from e467de7 to 4434bf2 Compare November 7, 2020 00:14

cfallin force-pushed the load-isel branch from 4434bf2 to c9abc59 Compare November 7, 2020 00:17

cfallin mentioned this pull request Nov 7, 2020

AArch64 SIMD: replace LoadSplat with pattern-matching on load+splat #2376

Merged

cfallin force-pushed the load-isel branch from c9abc59 to e17a3c3 Compare November 7, 2020 00:25

cfallin force-pushed the load-isel branch from e17a3c3 to faee74a Compare November 9, 2020 18:30

cfallin force-pushed the load-isel branch 2 times, most recently from ea5e7b3 to 61a629f Compare November 10, 2020 01:55

cfallin force-pushed the load-isel branch from 61a629f to d2c8f1b Compare November 11, 2020 01:06

cfallin mentioned this pull request Nov 11, 2020

x64 backend: merge loads into ALU ops when appropriate. #2389

Merged

cfallin requested review from fitzgen and removed request for julian-seward1 November 12, 2020 00:26

fitzgen approved these changes Nov 16, 2020

View reviewed changes

cfallin force-pushed the load-isel branch 2 times, most recently from d74a205 to 7c88af5 Compare November 16, 2020 22:52

cfallin force-pushed the load-isel branch from 7c88af5 to 3c8cb7b Compare November 16, 2020 22:53

cfallin merged commit 2150a53 into bytecodealliance:main Nov 16, 2020

fitzgen mentioned this pull request Nov 23, 2020

Collapse load; add; store into a single add on x86-64 #1925

Closed

cfallin deleted the load-isel branch January 6, 2021 18:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MachInst lowering logic: allow effectful instructions to merge.#2366

MachInst lowering logic: allow effectful instructions to merge.#2366
cfallin merged 1 commit intobytecodealliance:mainfrom
cfallin:load-isel

cfallin commented Nov 5, 2020 •

edited

Loading

Uh oh!

cfallin commented Nov 11, 2020

Uh oh!

fitzgen commented Nov 11, 2020

Uh oh!

cfallin commented Nov 12, 2020

Uh oh!

fitzgen left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cfallin commented Nov 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfallin commented Nov 11, 2020

Uh oh!

fitzgen commented Nov 11, 2020

Uh oh!

cfallin commented Nov 12, 2020

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cfallin commented Nov 5, 2020 •

edited

Loading