Discuss implementing constants for SIMD by abrown · Pull Request #2 · abrown/cranelift

abrown · 2019-07-17T22:39:47Z

This is still a WIP and not intended for merging. Some high-level questions I hope to answer:

should we be pooling constants at the function level or at the program level?
should we be using constants of specific size (e.g. see Uimm128) or generic size (e.g. see Constant) or both (converting as appropriate)?
when should we calculate the constant offsets: when translating to CLIF? in a separate pass?

This refactor moves the calculation of the number of lanes to code closer to where the Instruction/BoundInstruction is bound; if in the future higher (or different) bit-widths are needed this may need to be made explicit again but for now, assuming 128-bit vectors, this avoids the opportunity for programmer errors in the SIMD code.

@sunfishcode

In talking to @sunfishcode, he preferred to avoid the confusion of more ISA predicates by eliminating SSE2. SSE2 was released with the Pentium 4 in 2000 so it is unlikely that current CPUs would have SIMD enabled and not have this feature. I tried to note the SSE2-specific instructions with comments in the code.

…ta arms InstructionData match arms for immediates look like `InstructionData::SomeType{opcode, imm}`; when `imm` is a boxed value, Rust will error complaining that `imm` has been moved and the boxed type does not implement the copy trait. This change matches against `ref imm` (e.g.) and expects the match body to use `imm` accordingly.

Need to decide whether to do this here--currently impossible due to BTreeSet--or in Function::create_constant

cranelift-codegen/meta/src/isa/x86/recipes.rs

cranelift-codegen/meta/src/shared/immediates.rs

cranelift-codegen/src/binemit/mod.rs

cranelift-codegen/src/binemit/relaxation.rs

cranelift-codegen/src/ir/constant.rs

cranelift-codegen/src/ir/function.rs

cranelift-codegen/src/isa/x86/binemit.rs

lars-t-hansen · 2019-07-18T07:03:47Z

I can only echo @sunfishcode here -- keep the constant pool with the function and emit a value so that reloc can be done predictably. I believe you are right though that we should be emitting the initial delta, as for jump tables, not zero -- there should be a need to fixup only if the consumer moves the rodata relative to the code, as SpiderMonkey does to make room for its own epilogue code that is pasted in late. When I recently made it possible to move the jump table in this way it seemed like it should generalize in a straightforward way to constants.

nbp · 2019-07-18T10:12:57Z

(replying to the IRC comment about constant pools)

From a security stand-point, I am for banning constant pools in the text section and move all the user-provided constant into a rodata section with no executable right. As Cranelift is meant to be used for generating code from untrusted sources, having any predictable sequences of raw bytes mapped as executable is a huge security issue.

Instead of having program constants being allocated in a constant pool, I am in favor of having only artifact constant encoded in constant pools, including the pointer to the rodata where untrusted constant would be stored, without the need for constant blinding.

Thus SpiderMonkey or other will just provide a way to map this rodata section, when generating the code.

lars-t-hansen · 2019-07-18T13:25:08Z

@nbp, just so we don't mix up two separate points here: At the moment, Cranelift generates code that does not require keeping constant pools with the code. As a practical matter the code and the pool are emitted next to each other (constants following code) in a bytevector by Cranelift, but the pool can subsequently be moved a substantial distance, certainly it'll be possible to move it to a page that does not have executable rights, if an embedder of Cranelift wishes to do so; the relocation information ensures that this will be possible. Most crucially for your point, there are no fragments of constant pools in the middle of the code with jumps across them, say. I agree we should keep it that way.

Whether an embedder actually moves the rodata somewhere else is an orthogonal matter, and I think it should remain so.

abrown · 2019-07-18T23:06:20Z

@lars-t-hansen, @nbp, thanks for your comments; I suspect @sunfishcode, @bnjbvr, and @bjorn3 may have comments as well. Looking at the disconnected comments in the code above is not the best and I have now had some more time to think about the above and with @jlb6740's help am going to try to put up some sort of plan to discuss:

Motivation

The motivation behind discussing constant pools at all is to implement SIMD's const instruction with:

MOVAPS 0x[some offset](%rip), %xmm[0-7]

This instruction moves 128 bits from a memory location to an XMM register. The Intel manual also documents an unaligned MOVUPS but I would guess MOVAPS is preferable.

Prior to this point, cranelift had no need for constant pooling of any kind since all its WASM/CLIF immediates fit as immediates in the emitted instructions. Not anymore: the MOVAPS expects the value to be in memory. It seems likely that cranelift might need to put constants of other sizes (not just the 128 bits I need) in some form of read-only data section, so I started looking into the changes required to do so (see code above).

APIs

To clarify what I understand about cranelift (and what I don't), let me explain: cranelift emits instructions and data through a CodeSink API. So the emit_function code looks like:

for ebb in func.layout.ebbs() {
  // emit instructions to sink
}

sink.begin_jumptables();

// emit jump tables to sink

sink.begin_rodata();

// I believe here is where I should write the constants to the sink

sink.end_codegen();

How an embedder of cranelift implements begin_rodata affects where those constant bytes are written. However, in the emitted instructions we need to have an accurate offset to the constant value so the RelocSink API is used to inform the embedder of offsets that they may need to fix up. My addition to RelocSink is:

fn reloc_constant(&mut self, code_offset: CodeOffset, reloc: Reloc, constant_offset: ConstantOffset);

where CodeOffset and ConstantOffset are equivalent to u32s. code_offset points to the instruction in the function that needs an offset fix and constant_offset is the number of bytes at which to find the constant value inside its pool. I'm not sure why/if reloc is needed but all the other methods have it. Using this, the embedder should be able to overwrite the offset of MOVAPS to the correct 4-byte relative address.

Pooling: Function vs Higher?

If I were to follow how jump tables are implemented in cranelift, constants would be pooled at the function level. However, it seems like it might be better to pool constants at a higher level (WASM module? cranelift context?):

obviously, some memory savings if functions are using the same constants
more straightforward read-only data: currently cranelift expects each function to write to its own rodata section but presumably embedders could prefer to have one rodata section to maintain
I don't exactly know if a higher-level pooling would work when cranelift is used as a JIT and would suspect that objections to a higher-level pool are on those grounds; however, perhaps there is some way around this following other VM implementations?

Implementation

In any case, whether constants are pooled at the function level or higher, we need a mechanism for tracking the constants and their sizes internally. After having tried several things, I propose:

// in ir/entities.rs
struct Constant(u32);

// then, in ir/constant.rs
type ConstantOffset = u32;

struct ConstantPool {
  // we need some set-like properties and need to avoid changing the order so that get_offset returns a consistent offset; @sunfishcode suggested BTreeSet? 
}

impl ConstantPool {
  pub fn new() -> Self;

  /// Insert a constant value in to the pool and return a handle for referencing it; if the value already exists, return the same handle
  pub fn insert(&mut self, value: &[u8]) -> Constant;  
 
  /// Retrieve the number of bytes before the given constant; perhaps here we may need a way to align the offsets?
  pub fn get_offset(&self, constant: Constant) -> ConstantOffset;
}

Whenever we create a constant (whether from WASM or CLIF directly) we call ConstantPool::insert and when we need to emit the deltas during encoding (as @lars-t-hansen mentioned) we use ConstantPool::get_offset. Thus, to answer one of my initial questions, this approach would calculate offsets on-demand before encoding them. And this approach also means that we need to store a Constant, representing a constant of any size but only itself occupying 4 bytes, with the InstructionData instead of the entire constant value or a Box (as I tried in the code above). One issue I do have with this approach is that it does not enforce a size constraint on the constant, which I would like to have to for my specific case: I only am concerned with 128 bit constants for WASM's v128.const. To solve this I propose using a Uimm128 type (like the other types in ir/immediates.rs) as the formatting and operand kind for a vconst CLIF instruction; I would unwrap the Uimm128 as a slice to pass to ConstantPool::insert.

The ConstantPool approach above could internally keep track of the sizes of different constants and try to organize and align them appropriately but until something like strings are necessary the values would only be 128 bits long. Therefore, I would likely make ConstantPool very simple initially until we need something more complex.

I am requesting comments on this before I proceed and still need help with the following questions:

should we do constant pooling at a higher level than the function?
should we worry about alignment at all? This would pre-suppose that embedders start their rodata aligned...
where do I need to call ConstantPool::insert? Somewhere in cranelift-reader/src/parser.rs and in cranelift-wasm/src/code_translator.rs?

cranelift-codegen/src/ir/constant.rs

bjorn3 · 2019-07-19T10:57:44Z

cranelift-codegen/src/ir/immediates.rs

+impl From<u64> for Uimm128 {
+    fn from(x: u64) -> Self {
+        let mut buffer: [u8; 16] = [0; 16]; // zero-fill
+        (0..8).for_each(|byte| buffer[15 - byte] = (x >> (byte as u64 * 8) & 0xff) as u8); // insert each byte from the u64 into v after byte position 8


Is this little or big endian?

I hadn't really thought much about this until you brought it up; thanks! I had always assumed that the [u8; 16] would be big-endian. To me that made sense because if I wrote 0x01 in CLIF I would assume that 01 byte is the least significant and rightmost in the array: [0, 0, ..., 1]. The code above might be confusing because I am trying to fill in bytes from right-to-left to leave 0s in the more significant bytes. If I wrote them correctly, the Uimm128 tests I wrote at the bottom of this file should show big-endianness.

Your comment made me start worrying that I might need the flip the order the bytes are emitted in the rodata section. I looked up MOVAPS and MOVUPS and the Intel manual has these semantics:

(V)MOVUPS (128-bit store-form version) DEST[127:0] <-- SRC[127:0]

So it seems to me that I shouldn't fiddle around with the order too much because in Uimm128 I don't have to pack/unpack lanes into the 16 bytes; if a user of cranelift wanted to do a vconst.i32x4 I think they would need to structure their SIMD appropriately for the ISA.

lars-t-hansen · 2019-07-19T11:26:56Z

@abrown, as @bjorn3 said I think we should stick to per-function pools here, so that we can move the pool with the code when we need to and relocate the pool when we need to do that (as we must in Firefox). Indeed you would hook your code in after the call to begin_rodata(). Regarding alignment, we could simply assume that the embedder will move the rodata to a location where it is properly aligned (which will be possible given the relocs). Or we could change the MemoryCodeSink to properly align the sections, for example its constructor could accept an isa or a block of parameters with the alignment requirements for the sections on the platform.

…uctionData arms" This reverts commit 0a0e08e

This reverts commit 1ffb7e4

jlb6740 · 2019-07-20T04:23:30Z

@lars-t-hansen Hi .. thanks for your comments. Obviously the consensus from those who understand a little better is that this will make more sense if it is implemented at the function level and not the module level. Would it make sense/be feasible to think of a solution where there could be an option for module scope (not the default) for the constant pool for implementation cases outside of the browser. I am thinking for cases, maybe AOT, where memory is at premium (small form factor devices or even embedded that could support SIMD) any memory sharing between functions could be useful. @abrown I think whatever the final design we converge on want to be able to support types for future potential SIMD proposals for 256-bit and 512-bit.

bjorn3 · 2019-07-20T06:32:49Z

filetests/isa/x86/vconst.clif

+ebb0:
+[-, %xmm2]  v0 = vconst.b8x16 0x00  ; bin: 0f 10 15 fffffff9 PCRelRodata4(0)
+[-, %xmm3]  v1 = vconst.b8x16 0x01  ; bin: 0f 10 1d 00000002 PCRelRodata4(16)
+            return


Can we add a way to check the readonly data emitted?

How, though? I can't figure out how to get at the binemitted rodata after the function.

I mean adding a new kind of directive to filetest.

Let me know what you think of af2edc9--especially the formatting. Seems like comparing long vectors could get unwieldy.

The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).

The x86 ISA has (at least) two encodings for PEXTRW: 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m This changes the 16-bit x86_pextr encoding from 1 to 2 to match the other PEXTR* implementations (all #2 style).

abrown added 16 commits July 16, 2019 17:41

Add x86 implementation of extractlane instruction

9785535

Fix static analysis warnings

9a036ec

Add Uimm128 immediate type

6ca84cf

Add unsigned 128-bit immediate operand kind and format

5d15402

Add vconst instruction

7fc2346

Add constant to IR

da5fa6a

Add constant to Function

588545f

Add ability to relocate constants to RelocSink

eae9bcc

Add recipe for vconst

fc3a0aa

Add x86 encoding for vconst

c6899c7

Write constants after in rodata for MemorySink

621a532

Set constant offsets during relaxation (WIP)

1ffb7e4

Need to decide whether to do this here--currently impossible due to BTreeSet--or in Function::create_constant

Fix encoding helper function

986dfc7

abrown changed the title ~~Add simd const x86 rebased~~ Discuss implementing constants for SIMD Jul 17, 2019