Wasm tests: add typed-funcref test showing example of desirable optimizations.#8158
Conversation
…izations. In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast. This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time. In particular, the following opts are still desirable: - With the use of non-nullable typed funcrefs, there shouldn't be a null check (there currently is, as noted by a comment in the code due to lack of type information at the right spot). - With the use of a constant table size and a constant index to the `table.get`, we should be able to load from the table without a bounds-check or any Spectre masking. Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that.
jameysharp
left a comment
There was a problem hiding this comment.
"Chris," I whine to myself, "stop working on a Saturday evening," as I approve your PR at 10pm
Thanks for this! It's definitely the kind of thing I was hoping for.
This example module initializes element zero of the table to point to a function, then calls through elements one and two. What is that supposed to do? Or am I misreading it?
Oh this syntax is actually initializing the entire table with Somewhat orthogonal question as well: I forget if I asked this in the past, but is the reason to use a typed table instead of typed globals to get the lazy initialization? If I change the above test to use two typed function globals it ends up codegenning (today) which is pretty good given that we've got #5291 in our pocket |
|
Thanks Alex, now I understand! I'd forgotten about #5291; we just added a different use of memflags to indicate trap code so we have precedent now. But besides that, if I understood Chris correctly, this example is supposed to be typed as a non-nullable function reference. So if we thread that type information through correctly then we should be able to elide the null-check even without #5291, right? |
|
Indeed, this initializes everything then calls ICs 1 and 2, which were meant to be arbitrary -- I'll add some comments to make it clearer!
not quite; I realized once I progressed past "IC caller" logic to "IC update" logic that while AOT-compiling can produce bodies that have statically different code locations per IC head, the update logic is shared (polymorphic over IC index) and there's no "set global N to V" instruction. I could potentially finagle the IC-stub ABI to return a new funcref and always update in the statically-unique callsite sequence or something, but that's a lot of overhead if these are frequent; IMHO this is what tables are made for :-) |
|
Ah so the
True! Thinking a bit more on this, if the |
|
This has inspired me to go off and do #8159 to handle the null check |
Right; the hope/aspiration is that that constant index can trickle through a carefully-placed series of optimizations so this turns into something closer to the good codegen we see with globals. (Two other points that came to mind later re: use of tables: scalability is another factor -- the max of 100k (?) globals is a real limit if we use one for every IC site in a large program; and also, your point about lazy init, wherein we don't lazy-init globals and that'd be a big regression of instantiation latency on said ~100k-IC programs. The flipside is that we have the lazy-init dynamic checks on table load...)
Ah! I hadn't realized that Thinking a bit more about lazy init: the overall behavior we want is that for a very large program with many ICs, we have fast instantiation and, once ICs are warmed up, fast IC invocation with as few dynamic checks as possible. IC sites always start as linked to the "fallback IC"; my planned use for this "fast IC head" mechanism was to have a conditional at every IC site testing whether the IC-stub struct is a fallback IC, invoking a traditional (C++) function pointer if so, and then (hitting a weval intrinsic that leads to) doing this typed-funcref thing if not. The upshot of that is that we always have some slowpath action (attaching an IC) before we do the typed funcref invocation. I guess what I'm getting at is: could we avoid the lazy-init checks if we had a nullable typed funcref table, with null as the default value? Then we do the usual init-the-anyfunc-before-you-take-its-address thing at the |
…e table. This is based on discussion in bytecodealliance#8158: - We can use `call_indirect` rather than `table.get` + `call_ref`, even on typed funcrefs. TIL; updated the test! - As noted in bytecodealliance#8160, if we use a nullable typed funcref table instead (and given that we know we'll initialize a particular slot before use on the application side, so we won't actually call a null ref), and if we have a null-ref default value, we should be able to avoid the lazy table-init mechanism entirely. (Ignore the part where this module doesn't actually have any update logic that would set non-null refs anywhere; it's a compile-test, not a runtest!) Once bytecodealliance#8159 is merged and bytecodealliance#8160 is implemented, we should see zero branches in this test.
This is based on discussion in bytecodealliance#8158: as noted in bytecodealliance#8160, if we use a nullable typed funcref table instead (and given that we know we'll initialize a particular slot before use on the application side, so we won't actually call a null ref), and if we have a null-ref default value, we should be able to avoid the lazy table-init mechanism entirely. (Ignore the part where this module doesn't actually have any update logic that would set non-null refs anywhere; it's a compile-test, not a runtest!) Once bytecodealliance#8159 is merged and bytecodealliance#8160 is implemented, we should see zero branches in this test.
|
Makes sense! It's always possible we can use the same optimization techniques on globals as well as tables, nothing saying we have to keep everything as-is for example. If tables work then there's no need to change though.
I like this idea! |
This is based on discussion in #8158: as noted in #8160, if we use a nullable typed funcref table instead (and given that we know we'll initialize a particular slot before use on the application side, so we won't actually call a null ref), and if we have a null-ref default value, we should be able to avoid the lazy table-init mechanism entirely. (Ignore the part where this module doesn't actually have any update logic that would set non-null refs anywhere; it's a compile-test, not a runtest!) Once #8159 is merged and #8160 is implemented, we should see zero branches in this test.
In order to have fast IC (inline cache) chains in AOT-compiled dynamic language Wasms, it would be great if we could make the "call to a typed funcref at a constant table index" pattern fast.
This use-case was discussed at the most recent Wasmtime biweekly and @jameysharp is working on some optimizations; the intent of this PR is to provide a concrete test-case whose blessed output we can see improve over time.
In particular, the following opts are still desirable:
table.get, we should be able to load from the table without a bounds-check or any Spectre masking.Other further optimizations for this pattern might be possible if we rearrange the table and function-reference data structures, and the lazy-initialization scheme thereof, but the above should be agnostic to that.