Discuss implementing constants for SIMD#2
Discuss implementing constants for SIMD#2abrown wants to merge 28 commits intoadd-extract-lane-x86-rebasedfrom
Conversation
This refactor moves the calculation of the number of lanes to code closer to where the Instruction/BoundInstruction is bound; if in the future higher (or different) bit-widths are needed this may need to be made explicit again but for now, assuming 128-bit vectors, this avoids the opportunity for programmer errors in the SIMD code.
In talking to @sunfishcode, he preferred to avoid the confusion of more ISA predicates by eliminating SSE2. SSE2 was released with the Pentium 4 in 2000 so it is unlikely that current CPUs would have SIMD enabled and not have this feature. I tried to note the SSE2-specific instructions with comments in the code.
…ta arms
InstructionData match arms for immediates look like `InstructionData::SomeType{opcode, imm}`; when `imm` is a boxed value, Rust will error complaining that `imm` has been moved and the boxed type does not implement the copy trait. This change matches against `ref imm` (e.g.) and expects the match body to use `imm` accordingly.
Need to decide whether to do this here--currently impossible due to BTreeSet--or in Function::create_constant
|
I can only echo @sunfishcode here -- keep the constant pool with the function and emit a value so that reloc can be done predictably. I believe you are right though that we should be emitting the initial delta, as for jump tables, not zero -- there should be a need to fixup only if the consumer moves the rodata relative to the code, as SpiderMonkey does to make room for its own epilogue code that is pasted in late. When I recently made it possible to move the jump table in this way it seemed like it should generalize in a straightforward way to constants. |
|
(replying to the IRC comment about constant pools) From a security stand-point, I am for banning constant pools in the text section and move all the user-provided constant into a rodata section with no executable right. As Cranelift is meant to be used for generating code from untrusted sources, having any predictable sequences of raw bytes mapped as executable is a huge security issue. Instead of having program constants being allocated in a constant pool, I am in favor of having only artifact constant encoded in constant pools, including the pointer to the rodata where untrusted constant would be stored, without the need for constant blinding. Thus SpiderMonkey or other will just provide a way to map this rodata section, when generating the code. |
|
@nbp, just so we don't mix up two separate points here: At the moment, Cranelift generates code that does not require keeping constant pools with the code. As a practical matter the code and the pool are emitted next to each other (constants following code) in a bytevector by Cranelift, but the pool can subsequently be moved a substantial distance, certainly it'll be possible to move it to a page that does not have executable rights, if an embedder of Cranelift wishes to do so; the relocation information ensures that this will be possible. Most crucially for your point, there are no fragments of constant pools in the middle of the code with jumps across them, say. I agree we should keep it that way. Whether an embedder actually moves the rodata somewhere else is an orthogonal matter, and I think it should remain so. |
|
@lars-t-hansen, @nbp, thanks for your comments; I suspect @sunfishcode, @bnjbvr, and @bjorn3 may have comments as well. Looking at the disconnected comments in the code above is not the best and I have now had some more time to think about the above and with @jlb6740's help am going to try to put up some sort of plan to discuss: MotivationThe motivation behind discussing constant pools at all is to implement SIMD's This instruction moves 128 bits from a memory location to an XMM register. The Intel manual also documents an unaligned Prior to this point, cranelift had no need for constant pooling of any kind since all its WASM/CLIF immediates fit as immediates in the emitted instructions. Not anymore: the APIsTo clarify what I understand about cranelift (and what I don't), let me explain: cranelift emits instructions and data through a How an embedder of cranelift implements where Pooling: Function vs Higher?If I were to follow how jump tables are implemented in cranelift, constants would be pooled at the function level. However, it seems like it might be better to pool constants at a higher level (WASM module? cranelift context?):
ImplementationIn any case, whether constants are pooled at the function level or higher, we need a mechanism for tracking the constants and their sizes internally. After having tried several things, I propose: Whenever we create a constant (whether from WASM or CLIF directly) we call The I am requesting comments on this before I proceed and still need help with the following questions:
|
| impl From<u64> for Uimm128 { | ||
| fn from(x: u64) -> Self { | ||
| let mut buffer: [u8; 16] = [0; 16]; // zero-fill | ||
| (0..8).for_each(|byte| buffer[15 - byte] = (x >> (byte as u64 * 8) & 0xff) as u8); // insert each byte from the u64 into v after byte position 8 |
There was a problem hiding this comment.
I hadn't really thought much about this until you brought it up; thanks! I had always assumed that the [u8; 16] would be big-endian. To me that made sense because if I wrote 0x01 in CLIF I would assume that 01 byte is the least significant and rightmost in the array: [0, 0, ..., 1]. The code above might be confusing because I am trying to fill in bytes from right-to-left to leave 0s in the more significant bytes. If I wrote them correctly, the Uimm128 tests I wrote at the bottom of this file should show big-endianness.
Your comment made me start worrying that I might need the flip the order the bytes are emitted in the rodata section. I looked up MOVAPS and MOVUPS and the Intel manual has these semantics:
(V)MOVUPS (128-bit store-form version)
DEST[127:0] <-- SRC[127:0]
So it seems to me that I shouldn't fiddle around with the order too much because in Uimm128 I don't have to pack/unpack lanes into the 16 bytes; if a user of cranelift wanted to do a vconst.i32x4 I think they would need to structure their SIMD appropriately for the ISA.
|
@abrown, as @bjorn3 said I think we should stick to per-function pools here, so that we can move the pool with the code when we need to and relocate the pool when we need to do that (as we must in Firefox). Indeed you would hook your code in after the call to begin_rodata(). Regarding alignment, we could simply assume that the embedder will move the rodata to a location where it is properly aligned (which will be possible given the relocs). Or we could change the MemoryCodeSink to properly align the sections, for example its constructor could accept an isa or a block of parameters with the alignment requirements for the sections on the platform. |
|
@lars-t-hansen Hi .. thanks for your comments. Obviously the consensus from those who understand a little better is that this will make more sense if it is implemented at the function level and not the module level. Would it make sense/be feasible to think of a solution where there could be an option for module scope (not the default) for the constant pool for implementation cases outside of the browser. I am thinking for cases, maybe AOT, where memory is at premium (small form factor devices or even embedded that could support SIMD) any memory sharing between functions could be useful. @abrown I think whatever the final design we converge on want to be able to support types for future potential SIMD proposals for 256-bit and 512-bit. |
| ebb0: | ||
| [-, %xmm2] v0 = vconst.b8x16 0x00 ; bin: 0f 10 15 fffffff9 PCRelRodata4(0) | ||
| [-, %xmm3] v1 = vconst.b8x16 0x01 ; bin: 0f 10 1d 00000002 PCRelRodata4(16) | ||
| return |
There was a problem hiding this comment.
Can we add a way to check the readonly data emitted?
There was a problem hiding this comment.
How, though? I can't figure out how to get at the binemitted rodata after the function.
There was a problem hiding this comment.
I mean adding a new kind of directive to filetest.
There was a problem hiding this comment.
Let me know what you think of af2edc9--especially the formatting. Seems like comparing long vectors could get unwieldy.
1b00fa6 to
2ead474
Compare
2ead474 to
8bd1ff4
Compare
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from 1 to 2 to match the other PEXTR* implementations (all #2 style).
The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from 1 to 2 to match the other PEXTR* implementations (all #2 style).
This is still a WIP and not intended for merging. Some high-level questions I hope to answer:
Uimm128) or generic size (e.g. seeConstant) or both (converting as appropriate)?