Implement SIMD widening instructions for x86#1994
Merged
abrown merged 6 commits intobytecodealliance:mainfrom Jul 15, 2020
Merged
Implement SIMD widening instructions for x86#1994abrown merged 6 commits intobytecodealliance:mainfrom
abrown merged 6 commits intobytecodealliance:mainfrom
Conversation
Subscribe to Label Actioncc @bnjbvr DetailsThis issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:meta", "cranelift:wasm"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
alexcrichton
added a commit
to alexcrichton/stdarch
that referenced
this pull request
Jul 13, 2020
Lots of time and lots of things have happened since the simd128 support was first added to this crate. Things are starting to settle down now so this commit syncs the Rust intrinsic definitions with the current specification (https://github.com/WebAssembly/simd). Unfortuantely not everything can be enabled just yet but everything is in the pipeline for getting enabled soon. This commit also applies a major revamp to how intrinsics are tested. The intention is that the setup should be much more lightweight and/or easy to work with after this commit. At a high-level, the changes here are: * Testing with node.js and `#[wasm_bindgen]` has been removed. Instead intrinsics are tested with Wasmtime which has a nearly complete implementation of the SIMD spec (and soon fully complete!) * Testing is switched to `wasm32-wasi` to make idiomatic Rust bits a bit easier to work with (e.g. `panic!)` * Testing of this crate's simd128 feature for wasm is re-enabled. This will run on CI and both compile and execute intrinsics. This should bring wasm intrinsics to the same level of parity as x86 intrinsics, for example. * New wasm intrinsics have been added: * `iNNxMM_loadAxA_{s,u}` * `vNNxMM_load_splat` * `v8x16_swizzle` * `v128_andnot` * `iNNxMM_abs` * `iNNxMM_narrow_*_{u,s}` * `iNNxMM_bitmask` - commented out until LLVM is updated to LLVM 11 * `iNNxMM_widen_*_{u,s}` - commented out until bytecodealliance/wasmtime#1994 lands * `iNNxMM_{max,min}_{u,s}` * `iNNxMM_avgr_u` * Some wasm intrinsics have been removed: * `i64x2_trunc_*` * `f64x2_convert_*` * `i8x16_mul` * The `v8x16.shuffle` instruction is exposed. This is done through a `macro` (not `macro_rules!`, but `macro`). This is intended to be somewhat experimental and unstable until we decide otherwise. This instruction has 16 immediate-mode expressions and is as a result unsuited to the existing `constify_*` logic of this crate. I'm hoping that we can game out over time what a macro might look like and/or look for better solutions. For now, though, what's implemented is the first of its kind in this crate (an architecture-specific macro), so some extra scrutiny looking at it would be appreciated. * Lots of `assert_instr` annotations have been fixed for wasm. * All wasm simd128 tests are uncommented and passing now. This is still missing tests for new intrinsics and it's also missing tests for various corner cases. I hope to get to those later as the upstream spec itself gets closer to stabilization. In the meantime, however, I went ahead and updated the `hex.rs` example with a wasm implementation using intrinsics. With it I got some very impressive speedups using Wasmtime: test benches::large_default ... bench: 213,961 ns/iter (+/- 5,108) = 4900 MB/s test benches::large_fallback ... bench: 3,108,434 ns/iter (+/- 75,730) = 337 MB/s test benches::small_default ... bench: 52 ns/iter (+/- 0) = 2250 MB/s test benches::small_fallback ... bench: 358 ns/iter (+/- 0) = 326 MB/s or otherwise using Wasmtime hex encoding using SIMD is 15x faster on 1MB chunks or 7x faster on small <128byte chunks. All of these intrinsics are still unstable and will continue to be so presumably until the simd proposal in wasm itself progresses to a later stage. Additionaly we'll still want to sync with clang on intrinsic names (or decide not to) at some point in the future.
alexcrichton
added a commit
to alexcrichton/stdarch
that referenced
this pull request
Jul 13, 2020
Lots of time and lots of things have happened since the simd128 support was first added to this crate. Things are starting to settle down now so this commit syncs the Rust intrinsic definitions with the current specification (https://github.com/WebAssembly/simd). Unfortuantely not everything can be enabled just yet but everything is in the pipeline for getting enabled soon. This commit also applies a major revamp to how intrinsics are tested. The intention is that the setup should be much more lightweight and/or easy to work with after this commit. At a high-level, the changes here are: * Testing with node.js and `#[wasm_bindgen]` has been removed. Instead intrinsics are tested with Wasmtime which has a nearly complete implementation of the SIMD spec (and soon fully complete!) * Testing is switched to `wasm32-wasi` to make idiomatic Rust bits a bit easier to work with (e.g. `panic!)` * Testing of this crate's simd128 feature for wasm is re-enabled. This will run on CI and both compile and execute intrinsics. This should bring wasm intrinsics to the same level of parity as x86 intrinsics, for example. * New wasm intrinsics have been added: * `iNNxMM_loadAxA_{s,u}` * `vNNxMM_load_splat` * `v8x16_swizzle` * `v128_andnot` * `iNNxMM_abs` * `iNNxMM_narrow_*_{u,s}` * `iNNxMM_bitmask` - commented out until LLVM is updated to LLVM 11 * `iNNxMM_widen_*_{u,s}` - commented out until bytecodealliance/wasmtime#1994 lands * `iNNxMM_{max,min}_{u,s}` * `iNNxMM_avgr_u` * Some wasm intrinsics have been removed: * `i64x2_trunc_*` * `f64x2_convert_*` * `i8x16_mul` * The `v8x16.shuffle` instruction is exposed. This is done through a `macro` (not `macro_rules!`, but `macro`). This is intended to be somewhat experimental and unstable until we decide otherwise. This instruction has 16 immediate-mode expressions and is as a result unsuited to the existing `constify_*` logic of this crate. I'm hoping that we can game out over time what a macro might look like and/or look for better solutions. For now, though, what's implemented is the first of its kind in this crate (an architecture-specific macro), so some extra scrutiny looking at it would be appreciated. * Lots of `assert_instr` annotations have been fixed for wasm. * All wasm simd128 tests are uncommented and passing now. This is still missing tests for new intrinsics and it's also missing tests for various corner cases. I hope to get to those later as the upstream spec itself gets closer to stabilization. In the meantime, however, I went ahead and updated the `hex.rs` example with a wasm implementation using intrinsics. With it I got some very impressive speedups using Wasmtime: test benches::large_default ... bench: 213,961 ns/iter (+/- 5,108) = 4900 MB/s test benches::large_fallback ... bench: 3,108,434 ns/iter (+/- 75,730) = 337 MB/s test benches::small_default ... bench: 52 ns/iter (+/- 0) = 2250 MB/s test benches::small_fallback ... bench: 358 ns/iter (+/- 0) = 326 MB/s or otherwise using Wasmtime hex encoding using SIMD is 15x faster on 1MB chunks or 7x faster on small <128byte chunks. All of these intrinsics are still unstable and will continue to be so presumably until the simd proposal in wasm itself progresses to a later stage. Additionaly we'll still want to sync with clang on intrinsic names (or decide not to) at some point in the future.
alexcrichton
added a commit
to alexcrichton/stdarch
that referenced
this pull request
Jul 13, 2020
Lots of time and lots of things have happened since the simd128 support was first added to this crate. Things are starting to settle down now so this commit syncs the Rust intrinsic definitions with the current specification (https://github.com/WebAssembly/simd). Unfortuantely not everything can be enabled just yet but everything is in the pipeline for getting enabled soon. This commit also applies a major revamp to how intrinsics are tested. The intention is that the setup should be much more lightweight and/or easy to work with after this commit. At a high-level, the changes here are: * Testing with node.js and `#[wasm_bindgen]` has been removed. Instead intrinsics are tested with Wasmtime which has a nearly complete implementation of the SIMD spec (and soon fully complete!) * Testing is switched to `wasm32-wasi` to make idiomatic Rust bits a bit easier to work with (e.g. `panic!)` * Testing of this crate's simd128 feature for wasm is re-enabled. This will run on CI and both compile and execute intrinsics. This should bring wasm intrinsics to the same level of parity as x86 intrinsics, for example. * New wasm intrinsics have been added: * `iNNxMM_loadAxA_{s,u}` * `vNNxMM_load_splat` * `v8x16_swizzle` * `v128_andnot` * `iNNxMM_abs` * `iNNxMM_narrow_*_{u,s}` * `iNNxMM_bitmask` - commented out until LLVM is updated to LLVM 11 * `iNNxMM_widen_*_{u,s}` - commented out until bytecodealliance/wasmtime#1994 lands * `iNNxMM_{max,min}_{u,s}` * `iNNxMM_avgr_u` * Some wasm intrinsics have been removed: * `i64x2_trunc_*` * `f64x2_convert_*` * `i8x16_mul` * The `v8x16.shuffle` instruction is exposed. This is done through a `macro` (not `macro_rules!`, but `macro`). This is intended to be somewhat experimental and unstable until we decide otherwise. This instruction has 16 immediate-mode expressions and is as a result unsuited to the existing `constify_*` logic of this crate. I'm hoping that we can game out over time what a macro might look like and/or look for better solutions. For now, though, what's implemented is the first of its kind in this crate (an architecture-specific macro), so some extra scrutiny looking at it would be appreciated. * Lots of `assert_instr` annotations have been fixed for wasm. * All wasm simd128 tests are uncommented and passing now. This is still missing tests for new intrinsics and it's also missing tests for various corner cases. I hope to get to those later as the upstream spec itself gets closer to stabilization. In the meantime, however, I went ahead and updated the `hex.rs` example with a wasm implementation using intrinsics. With it I got some very impressive speedups using Wasmtime: test benches::large_default ... bench: 213,961 ns/iter (+/- 5,108) = 4900 MB/s test benches::large_fallback ... bench: 3,108,434 ns/iter (+/- 75,730) = 337 MB/s test benches::small_default ... bench: 52 ns/iter (+/- 0) = 2250 MB/s test benches::small_fallback ... bench: 358 ns/iter (+/- 0) = 326 MB/s or otherwise using Wasmtime hex encoding using SIMD is 15x faster on 1MB chunks or 7x faster on small <128byte chunks. All of these intrinsics are still unstable and will continue to be so presumably until the simd proposal in wasm itself progresses to a later stage. Additionaly we'll still want to sync with clang on intrinsic names (or decide not to) at some point in the future.
julian-seward1
approved these changes
Jul 15, 2020
| pub static PADDUSW: [u8; 3] = [0x66, 0x0f, 0xdd]; | ||
|
|
||
| /// Concatenate destination and source operands, extract byte-aligned result shifted to the right by | ||
| /// the constant value in imm8 into xmm1 (SSSE3). |
Contributor
There was a problem hiding this comment.
Please specify whether the shift is by imm8 bits or imm8 bytes. I think it's the latter but I don't exactly remember.
… of lanes (i.e. merging) Certain operations (e.g. widening) will have operands with types like `NxM` but will return results with types like `(N*2)x(M/2)` (double the lane width, halve the number of lanes; maintain the same number of vector bits). This is equivalent to applying two `DerivedFunction`s to the type: `DerivedFunction::DoubleWidth` then `DerivedFunction::HalfVector`. Since there is no easy way to apply multiple `DerivedFunction`s (e.g. most of the logic is one-level deep, https://github.com/bytecodealliance/wasmtime/blob/1d5a678124e0f035f7614cafe43066c834a5113b/cranelift/codegen/meta/src/gen_inst.rs#L618-L621), I added `DerivedFunction::MergeLanes` to do the necessary type conversion.
This instruction is necessary for implementing `[s|u]widen_high`.
Use `x86_palignr` and `[u|s]widen_low` for legalizing this instruction.
Amanieu
pushed a commit
to rust-lang/stdarch
that referenced
this pull request
Jul 18, 2020
Lots of time and lots of things have happened since the simd128 support was first added to this crate. Things are starting to settle down now so this commit syncs the Rust intrinsic definitions with the current specification (https://github.com/WebAssembly/simd). Unfortuantely not everything can be enabled just yet but everything is in the pipeline for getting enabled soon. This commit also applies a major revamp to how intrinsics are tested. The intention is that the setup should be much more lightweight and/or easy to work with after this commit. At a high-level, the changes here are: * Testing with node.js and `#[wasm_bindgen]` has been removed. Instead intrinsics are tested with Wasmtime which has a nearly complete implementation of the SIMD spec (and soon fully complete!) * Testing is switched to `wasm32-wasi` to make idiomatic Rust bits a bit easier to work with (e.g. `panic!)` * Testing of this crate's simd128 feature for wasm is re-enabled. This will run on CI and both compile and execute intrinsics. This should bring wasm intrinsics to the same level of parity as x86 intrinsics, for example. * New wasm intrinsics have been added: * `iNNxMM_loadAxA_{s,u}` * `vNNxMM_load_splat` * `v8x16_swizzle` * `v128_andnot` * `iNNxMM_abs` * `iNNxMM_narrow_*_{u,s}` * `iNNxMM_bitmask` - commented out until LLVM is updated to LLVM 11 * `iNNxMM_widen_*_{u,s}` - commented out until bytecodealliance/wasmtime#1994 lands * `iNNxMM_{max,min}_{u,s}` * `iNNxMM_avgr_u` * Some wasm intrinsics have been removed: * `i64x2_trunc_*` * `f64x2_convert_*` * `i8x16_mul` * The `v8x16.shuffle` instruction is exposed. This is done through a `macro` (not `macro_rules!`, but `macro`). This is intended to be somewhat experimental and unstable until we decide otherwise. This instruction has 16 immediate-mode expressions and is as a result unsuited to the existing `constify_*` logic of this crate. I'm hoping that we can game out over time what a macro might look like and/or look for better solutions. For now, though, what's implemented is the first of its kind in this crate (an architecture-specific macro), so some extra scrutiny looking at it would be appreciated. * Lots of `assert_instr` annotations have been fixed for wasm. * All wasm simd128 tests are uncommented and passing now. This is still missing tests for new intrinsics and it's also missing tests for various corner cases. I hope to get to those later as the upstream spec itself gets closer to stabilization. In the meantime, however, I went ahead and updated the `hex.rs` example with a wasm implementation using intrinsics. With it I got some very impressive speedups using Wasmtime: test benches::large_default ... bench: 213,961 ns/iter (+/- 5,108) = 4900 MB/s test benches::large_fallback ... bench: 3,108,434 ns/iter (+/- 75,730) = 337 MB/s test benches::small_default ... bench: 52 ns/iter (+/- 0) = 2250 MB/s test benches::small_fallback ... bench: 358 ns/iter (+/- 0) = 326 MB/s or otherwise using Wasmtime hex encoding using SIMD is 15x faster on 1MB chunks or 7x faster on small <128byte chunks. All of these intrinsics are still unstable and will continue to be so presumably until the simd proposal in wasm itself progresses to a later stage. Additionaly we'll still want to sync with clang on intrinsic names (or decide not to) at some point in the future. * wasm: Unconditionally expose SIMD functions This commit unconditionally exposes SIMD functions from the `wasm32` module. This is done in such a way that the standard library does not need to be recompiled to access SIMD intrinsics and use them. This, hopefully, is the long-term story for SIMD in WebAssembly in Rust. It's unlikely that all WebAssembly runtimes will end up implementing SIMD so the standard library is unlikely to use SIMD any time soon, but we want to make sure it's easily available to folks! This commit enables all this by ensuring that SIMD is available to the standard library, regardless of compilation flags. This'll come with the same caveats as x86 support, where it doesn't make sense to call these functions unless you're enabling simd support one way or another locally. Additionally, as with x86, if you don't call these functions then the instructions won't show up in your binary. While I was here I went ahead and expanded the WebAssembly-specific documentation for the wasm32 module as well, ensuring that the current state of SIMD/Atomics are documented.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
After #1990 is merged [edit: it's merged!], this change adds the remaining instructions necessary for passing all of the current SIMD spec tests. This implementation uses
PALIGNR,PMOVSX*andPMOVZX*for lowering the Wasm SIMD widen instructions. It also adds a mechanism for merging lanes in the CDSL for calculating the various type conversions (e.g.i16x8is widened toi32x4).