Skip to content

Discuss implementing constants for SIMD#2

Closed
abrown wants to merge 28 commits intoadd-extract-lane-x86-rebasedfrom
add-simd-const-x86-rebased
Closed

Discuss implementing constants for SIMD#2
abrown wants to merge 28 commits intoadd-extract-lane-x86-rebasedfrom
add-simd-const-x86-rebased

Conversation

@abrown
Copy link
Copy Markdown
Owner

@abrown abrown commented Jul 17, 2019

This is still a WIP and not intended for merging. Some high-level questions I hope to answer:

  • should we be pooling constants at the function level or at the program level?
  • should we be using constants of specific size (e.g. see Uimm128) or generic size (e.g. see Constant) or both (converting as appropriate)?
  • when should we calculate the constant offsets: when translating to CLIF? in a separate pass?

abrown added 16 commits July 16, 2019 17:41
This refactor moves the calculation of the number of lanes to code closer to where the Instruction/BoundInstruction is bound; if in the future higher (or different) bit-widths are needed this may need to be made explicit again but for now, assuming 128-bit vectors, this avoids the opportunity for programmer errors in the SIMD code.
In talking to @sunfishcode, he preferred to avoid the confusion of more ISA predicates by eliminating SSE2. SSE2 was released with the Pentium 4 in 2000 so it is unlikely that current CPUs would have SIMD enabled and not have this feature. I tried to note the SSE2-specific instructions with comments in the code.
…ta arms

InstructionData match arms for immediates look like `InstructionData::SomeType{opcode, imm}`; when `imm` is a boxed value, Rust will error complaining that `imm` has been moved and the boxed type does not implement the copy trait. This change matches against `ref imm` (e.g.) and expects the match body to use `imm` accordingly.
Need to decide whether to do this here--currently impossible due to BTreeSet--or in Function::create_constant
@abrown abrown changed the title Add simd const x86 rebased Discuss implementing constants for SIMD Jul 17, 2019
@lars-t-hansen
Copy link
Copy Markdown

I can only echo @sunfishcode here -- keep the constant pool with the function and emit a value so that reloc can be done predictably. I believe you are right though that we should be emitting the initial delta, as for jump tables, not zero -- there should be a need to fixup only if the consumer moves the rodata relative to the code, as SpiderMonkey does to make room for its own epilogue code that is pasted in late. When I recently made it possible to move the jump table in this way it seemed like it should generalize in a straightforward way to constants.

@nbp
Copy link
Copy Markdown

nbp commented Jul 18, 2019

(replying to the IRC comment about constant pools)

From a security stand-point, I am for banning constant pools in the text section and move all the user-provided constant into a rodata section with no executable right. As Cranelift is meant to be used for generating code from untrusted sources, having any predictable sequences of raw bytes mapped as executable is a huge security issue.

Instead of having program constants being allocated in a constant pool, I am in favor of having only artifact constant encoded in constant pools, including the pointer to the rodata where untrusted constant would be stored, without the need for constant blinding.

Thus SpiderMonkey or other will just provide a way to map this rodata section, when generating the code.

@lars-t-hansen
Copy link
Copy Markdown

@nbp, just so we don't mix up two separate points here: At the moment, Cranelift generates code that does not require keeping constant pools with the code. As a practical matter the code and the pool are emitted next to each other (constants following code) in a bytevector by Cranelift, but the pool can subsequently be moved a substantial distance, certainly it'll be possible to move it to a page that does not have executable rights, if an embedder of Cranelift wishes to do so; the relocation information ensures that this will be possible. Most crucially for your point, there are no fragments of constant pools in the middle of the code with jumps across them, say. I agree we should keep it that way.

Whether an embedder actually moves the rodata somewhere else is an orthogonal matter, and I think it should remain so.

@abrown
Copy link
Copy Markdown
Owner Author

abrown commented Jul 18, 2019

@lars-t-hansen, @nbp, thanks for your comments; I suspect @sunfishcode, @bnjbvr, and @bjorn3 may have comments as well. Looking at the disconnected comments in the code above is not the best and I have now had some more time to think about the above and with @jlb6740's help am going to try to put up some sort of plan to discuss:

Motivation

The motivation behind discussing constant pools at all is to implement SIMD's const instruction with:

MOVAPS 0x[some offset](%rip), %xmm[0-7]

This instruction moves 128 bits from a memory location to an XMM register. The Intel manual also documents an unaligned MOVUPS but I would guess MOVAPS is preferable.

Prior to this point, cranelift had no need for constant pooling of any kind since all its WASM/CLIF immediates fit as immediates in the emitted instructions. Not anymore: the MOVAPS expects the value to be in memory. It seems likely that cranelift might need to put constants of other sizes (not just the 128 bits I need) in some form of read-only data section, so I started looking into the changes required to do so (see code above).

APIs

To clarify what I understand about cranelift (and what I don't), let me explain: cranelift emits instructions and data through a CodeSink API. So the emit_function code looks like:

for ebb in func.layout.ebbs() {
  // emit instructions to sink
}

sink.begin_jumptables();

// emit jump tables to sink

sink.begin_rodata();

// I believe here is where I should write the constants to the sink

sink.end_codegen();

How an embedder of cranelift implements begin_rodata affects where those constant bytes are written. However, in the emitted instructions we need to have an accurate offset to the constant value so the RelocSink API is used to inform the embedder of offsets that they may need to fix up. My addition to RelocSink is:

fn reloc_constant(&mut self, code_offset: CodeOffset, reloc: Reloc, constant_offset: ConstantOffset); 

where CodeOffset and ConstantOffset are equivalent to u32s. code_offset points to the instruction in the function that needs an offset fix and constant_offset is the number of bytes at which to find the constant value inside its pool. I'm not sure why/if reloc is needed but all the other methods have it. Using this, the embedder should be able to overwrite the offset of MOVAPS to the correct 4-byte relative address.

Pooling: Function vs Higher?

If I were to follow how jump tables are implemented in cranelift, constants would be pooled at the function level. However, it seems like it might be better to pool constants at a higher level (WASM module? cranelift context?):

  • obviously, some memory savings if functions are using the same constants
  • more straightforward read-only data: currently cranelift expects each function to write to its own rodata section but presumably embedders could prefer to have one rodata section to maintain
    I don't exactly know if a higher-level pooling would work when cranelift is used as a JIT and would suspect that objections to a higher-level pool are on those grounds; however, perhaps there is some way around this following other VM implementations?

Implementation

In any case, whether constants are pooled at the function level or higher, we need a mechanism for tracking the constants and their sizes internally. After having tried several things, I propose:

// in ir/entities.rs
struct Constant(u32);

// then, in ir/constant.rs
type ConstantOffset = u32;

struct ConstantPool {
  // we need some set-like properties and need to avoid changing the order so that get_offset returns a consistent offset; @sunfishcode suggested BTreeSet? 
}

impl ConstantPool {
  pub fn new() -> Self;

  /// Insert a constant value in to the pool and return a handle for referencing it; if the value already exists, return the same handle
  pub fn insert(&mut self, value: &[u8]) -> Constant;  
 
  /// Retrieve the number of bytes before the given constant; perhaps here we may need a way to align the offsets?
  pub fn get_offset(&self, constant: Constant) -> ConstantOffset;
}

Whenever we create a constant (whether from WASM or CLIF directly) we call ConstantPool::insert and when we need to emit the deltas during encoding (as @lars-t-hansen mentioned) we use ConstantPool::get_offset. Thus, to answer one of my initial questions, this approach would calculate offsets on-demand before encoding them. And this approach also means that we need to store a Constant, representing a constant of any size but only itself occupying 4 bytes, with the InstructionData instead of the entire constant value or a Box (as I tried in the code above). One issue I do have with this approach is that it does not enforce a size constraint on the constant, which I would like to have to for my specific case: I only am concerned with 128 bit constants for WASM's v128.const. To solve this I propose using a Uimm128 type (like the other types in ir/immediates.rs) as the formatting and operand kind for a vconst CLIF instruction; I would unwrap the Uimm128 as a slice to pass to ConstantPool::insert.

The ConstantPool approach above could internally keep track of the sizes of different constants and try to organize and align them appropriately but until something like strings are necessary the values would only be 128 bits long. Therefore, I would likely make ConstantPool very simple initially until we need something more complex.

I am requesting comments on this before I proceed and still need help with the following questions:

  • should we do constant pooling at a higher level than the function?
  • should we worry about alignment at all? This would pre-suppose that embedders start their rodata aligned...
  • where do I need to call ConstantPool::insert? Somewhere in cranelift-reader/src/parser.rs and in cranelift-wasm/src/code_translator.rs?

impl From<u64> for Uimm128 {
fn from(x: u64) -> Self {
let mut buffer: [u8; 16] = [0; 16]; // zero-fill
(0..8).for_each(|byte| buffer[15 - byte] = (x >> (byte as u64 * 8) & 0xff) as u8); // insert each byte from the u64 into v after byte position 8
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this little or big endian?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't really thought much about this until you brought it up; thanks! I had always assumed that the [u8; 16] would be big-endian. To me that made sense because if I wrote 0x01 in CLIF I would assume that 01 byte is the least significant and rightmost in the array: [0, 0, ..., 1]. The code above might be confusing because I am trying to fill in bytes from right-to-left to leave 0s in the more significant bytes. If I wrote them correctly, the Uimm128 tests I wrote at the bottom of this file should show big-endianness.

Your comment made me start worrying that I might need the flip the order the bytes are emitted in the rodata section. I looked up MOVAPS and MOVUPS and the Intel manual has these semantics:

(V)MOVUPS (128-bit store-form version)
DEST[127:0]  <--  SRC[127:0]

So it seems to me that I shouldn't fiddle around with the order too much because in Uimm128 I don't have to pack/unpack lanes into the 16 bytes; if a user of cranelift wanted to do a vconst.i32x4 I think they would need to structure their SIMD appropriately for the ISA.

@lars-t-hansen
Copy link
Copy Markdown

@abrown, as @bjorn3 said I think we should stick to per-function pools here, so that we can move the pool with the code when we need to and relocate the pool when we need to do that (as we must in Firefox). Indeed you would hook your code in after the call to begin_rodata(). Regarding alignment, we could simply assume that the embedder will move the rodata to a location where it is properly aligned (which will be possible given the relocs). Or we could change the MemoryCodeSink to properly align the sections, for example its constructor could accept an isa or a block of parameters with the alignment requirements for the sections on the platform.

@jlb6740
Copy link
Copy Markdown

jlb6740 commented Jul 20, 2019

@lars-t-hansen Hi .. thanks for your comments. Obviously the consensus from those who understand a little better is that this will make more sense if it is implemented at the function level and not the module level. Would it make sense/be feasible to think of a solution where there could be an option for module scope (not the default) for the constant pool for implementation cases outside of the browser. I am thinking for cases, maybe AOT, where memory is at premium (small form factor devices or even embedded that could support SIMD) any memory sharing between functions could be useful. @abrown I think whatever the final design we converge on want to be able to support types for future potential SIMD proposals for 256-bit and 512-bit.

ebb0:
[-, %xmm2] v0 = vconst.b8x16 0x00 ; bin: 0f 10 15 fffffff9 PCRelRodata4(0)
[-, %xmm3] v1 = vconst.b8x16 0x01 ; bin: 0f 10 1d 00000002 PCRelRodata4(16)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a way to check the readonly data emitted?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How, though? I can't figure out how to get at the binemitted rodata after the function.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean adding a new kind of directive to filetest.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know what you think of af2edc9--especially the formatting. Seems like comparing long vectors could get unwieldy.

@abrown abrown force-pushed the add-extract-lane-x86-rebased branch 3 times, most recently from 1b00fa6 to 2ead474 Compare July 26, 2019 22:37
@abrown abrown force-pushed the add-extract-lane-x86-rebased branch from 2ead474 to 8bd1ff4 Compare August 19, 2019 22:45
@abrown abrown closed this Aug 20, 2019
@abrown abrown deleted the add-simd-const-x86-rebased branch August 21, 2019 16:15
abrown added a commit that referenced this pull request Sep 18, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
abrown added a commit that referenced this pull request Sep 20, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
abrown added a commit that referenced this pull request Sep 20, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
abrown added a commit that referenced this pull request Sep 25, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
abrown added a commit that referenced this pull request Sep 27, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
abrown added a commit that referenced this pull request Sep 30, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from #1 to #2 to match the other PEXTR* implementations (all #2 style).
abrown added a commit that referenced this pull request Sep 30, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from 1 to 2 to match the other PEXTR* implementations (all #2 style).
abrown added a commit that referenced this pull request Sep 30, 2019
The x86 ISA has (at least) two encodings for PEXTRW:
 1. in the SSE2 opcode (66 0f c5) the XMM operand uses r/m and the GPR operand uses reg
 2. in the SSE4.1 opcode (66 0f 3a 15) the XMM operand uses reg and the GPR operand uses r/m

This changes the 16-bit x86_pextr encoding from 1 to 2 to match the other PEXTR* implementations (all #2 style).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants