MachInst lowering logic: allow effectful instructions to merge.#2366
MachInst lowering logic: allow effectful instructions to merge.#2366cfallin merged 1 commit intobytecodealliance:mainfrom
Conversation
This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
ea5e7b3 to
61a629f
Compare
This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`.
This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
|
Any volunteers to review? @julian-seward1 has let me know he's deep in fire-fighting at the moment; @fitzgen or @peterhuene perhaps? |
|
I can review this if no one else with expertise in this part of the code base has time. Quick question: will this (eventually) enable collapsing into on x64? We emit clif like that in our refcounting gc barriers for externrefs right now >.< |
|
@fitzgen Thanks very much!
Indeed, that's possible to build once this lands; it would be the next step after #2389 (which so far just merges the load and add). |
fitzgen
left a comment
There was a problem hiding this comment.
This makes sense, although I think that if we assume plain loads have effects then my refcounting example won't collapse to add $mem, 1 because the load and the store would have different colors. We will probably want separate instructions for atomic memory operations (or to parametrize memory operations with an ordering). It is unfortunate that we treat Wasm linear memory loads and stack slot loads and internal VM context loads, etc... identically.
A few inline nitpicks and soft suggestions below. Feel free to take em or leave em.
d74a205 to
7c88af5
Compare
This PR updates the "coloring" scheme that accounts for side-effects in the MachInst lowering logic. As a result, the new backends will now be able to merge effectful operations (such as memory loads) *into* other operations; previously, only the other way (pure ops merged into effectful ops) was possible. This will allow, for example, a load+ALU-op combination, as is common on x86. It should even allow a load + ALU-op + store sequence to merge into one lowered instruction. The scheme arose from many fruitful discussions with @julian-seward1 (thanks!); significant credit is due to him for the insights here. The first insight is that given the right basic conditions, i.e. that the root instruction is the only use of an effectful instruction's result, all we need is that the "color" of the effectful instruction is *one less* than the color of the current instruction. It's easier to think about colors on the program points between instructions: if the color coming *out* of the first (effectful def) instruction and *in* to the second (effectful or effect-free use) instruction are the same, then they can merge. Basically the color denotes a version of global state; if the same, then no other effectful ops happened in the meantime. The second insight is that we can keep state as we scan, tracking the "current color", and *update* this when we sink (merge) an op. Hence when we sink a load into another op, we effectively *re-color* every instruction it moved over; this may allow further sinks. Consider the example (and assume that we consider loads effectful in order to conservatively ensure a strong memory model; otherwise, replace with other effectful value-producing insts): ``` v0 = load x v1 = load y v2 = add v0, 1 v3 = add v1, 1 ``` Scanning from bottom to top, we first see the add producing `v3` and we can sink the load producing `v1` into it, producing a load + ALU-op machine instruction. This is legal because `v1` moves over only `v2`, which is a pure instruction. Consider, though, `v2`: under a simple scheme that has no other context, `v0` could not sink to `v2` because it would move over `v1`, another load. But because we already sunk `v1` down to `v3`, we are free to sink `v0` to `v2`; the update of the "current color" during the scan allows this. This PR also cleans up the `LowerCtx` interface a bit at the same time: whereas previously it always gave some subset of (constant, mergeable inst, register) directly from `LowerCtx::get_input()`, it now returns zero or more of (constant, mergable inst) from `LowerCtx::maybe_get_input_as_source_or_const()`, and returns the register only from `LowerCtx::put_input_in_reg()`. This removes the need to explicitly denote uses of the register, so it's a little safer. Note that this PR does not actually make use of the new ability to merge loads into other ops; that will come in future PRs, especially to optimize the `x64` backend by using direct-memory operands.
This was added as an incremental step to improve AArch64 code quality in PR bytecodealliance#2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR bytecodealliance#2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
This PR makes use of the support in bytecodealliance#2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in bytecodealliance#2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see bytecodealliance#2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
This was added as an incremental step to improve AArch64 code quality in PR #2278. At the time, we did not have a way to pattern-match the load + splat opcode sequence that the relevant Wasm opcodes lowered to. However, now with PR #2366, we can merge effectful instructions such as loads into other ops, and so we can do this pattern matching directly. The pattern-matching update will come in a subsequent commit.
This PR makes use of the support in #2366 for sinking effectful instructions and merging them with consumers. In particular, on x86, we want to make use of the ability of many instructions to load one operand directly from memory. That is, instead of this: ``` movq 0(%rdi), %rax addq %rax, %rbx ``` we want to generate this: ``` addq 0(%rdi), %rax ``` As described in more detail in #2366, sinking and merging the load is only possible under certain conditions. In particular, we need to ensure that the use is the *only* use (otherwise the load happens more than once), and we need to ensure that it does not move across other effectful ops (see #2366 for how we ensure this). This change is actually fairly simple, given that all the framework is in place: we simply pattern-match a load on one operand of an ALU instruction that takes an RMI (reg, mem, or immediate) operand, and generate the mem form when we match. Also makes a drive-by improvement in the x64 backend to use statically-monomorphized `LowerCtx` types rather than a `&mut dyn LowerCtx`. On `bz2.wasm`, this results in ~1% instruction-count reduction. More is likely possible by following up with other instructions that can merge memory loads as well.
This PR updates the "coloring" scheme that accounts for side-effects in
the MachInst lowering logic. As a result, the new backends will now be
able to merge effectful operations (such as memory loads) into other
operations; previously, only the other way (pure ops merged into
effectful ops) was possible. This will allow, for example, a load+ALU-op
combination, as is common on x86. It should even allow a load + ALU-op +
store sequence to merge into one lowered instruction.
The scheme arose from many fruitful discussions with @julian-seward1
(thanks!); significant credit is due to him for the insights here.
The first insight is that given the right basic conditions, i.e. that
the root instruction is the only use of an effectful instruction's
result, all we need is that the "color" of the effectful instruction is
one less than the color of the current instruction. It's easier to
think about colors on the program points between instructions: if the
color coming out of the first (effectful def) instruction and in to
the second (effectful or effect-free use) instruction are the same, then
they can merge. Basically the color denotes a version of global state;
if the same, then no other effectful ops happened in the meantime.
The second insight is that we can keep state as we scan, tracking the
"current color", and update this when we sink (merge) an op. Hence
when we sink a load into another op, we effectively re-color every
instruction it moved over; this may allow further sinks.
Consider the example (and assume that we consider loads effectful in
order to conservatively ensure a strong memory model; otherwise, replace
with other effectful value-producing insts):
Scanning from bottom to top, we first see the add producing
v3and wecan sink the load producing
v1into it, producing a load + ALU-opmachine instruction. This is legal because
v1moves over onlyv2,which is a pure instruction. Consider, though,
v2: under a simplescheme that has no other context,
v0could not sink tov2because itwould move over
v1, another load. But because we already sunkv1down to
v3, we are free to sinkv0tov2; the update of the"current color" during the scan allows this.
This PR also cleans up the
LowerCtxinterface a bit at the same time:whereas previously it always gave some subset of (constant, mergeable
inst, register) directly from
LowerCtx::get_input(), it now returnszero or more of (constant, mergable inst) from
LowerCtx::maybe_get_input_as_source_or_const(), and returns theregister only from
LowerCtx::put_input_in_reg(). This removes the needto explicitly denote uses of the register, so it's a little safer.
Note that this PR does not actually make use of the new ability to merge
loads into other ops; that will come in future PRs, especially to
optimize the
x64backend by using direct-memory operands.Testing: existing filetests ensure the DCE-while-lowering and existing
merging pattern-matches continue to work; merging of effectful ops
will be verified in a subsequent PR with load+op pattern-matching.
Fixes #2340.