Switch to direct-mapped indirect-call cache.#8870
Switch to direct-mapped indirect-call cache.#8870cfallin wants to merge 3 commits intobytecodealliance:mainfrom
Conversation
d5a2e06 to
13b9f23
Compare
Currently, the indirect-call cache operates on a basis of *one slot per
callsite*: each `call_indirect` instruction in a module, up to a limit,
has its own slot of storage in the VM context struct that caches a
called-table-index to called-raw-function-pointer mapping.
This is fine but means the storage requirement scales with the module
size; hence "up to a limit" above. It also means that each callsite
needs to "warm up" separately, whereas we could in theory reuse the
resolved index->code pointer mapping for the same index across
callsites.
This PR switches instead to a "direct-mapped cache": that is, we have a
fixed number of cache slots per table per instance, of user-configurable
count, and we look in a slot selected by the called table index (modulo
the cache size). As before, if the "tag" (cache key, called table index)
matches, we use the "value" (raw code pointer).
The main advantage of this scheme, and my motivation for making the
switch, is that the storage size is fixed and quite small, even for
arbitrarily-large modules: for example, on a 64-bit platform with
12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with
one funcptr table, a 1K-entry cache uses 12KiB per instance. That's much
smaller than the total VMFuncRef array size in large modules and should
be no problem. My goal in getting to this constant size offset is that
turning this on by default eventually will be easier to justify, and
that we won't have unexpected perf cliffs for callsites beyond a certain
index.
This also means that if one callsite resolves index 23 to some raw code
pointer, other callsites that call index 23 also receive a "hit" from
that warmup. This could be beneficial when there are many callsites but
a relatively smaller pool of called functions (e.g., ICs).
The downside to caching indexed on callee rather than callsite is that
if there are a large number of callees, we can expect more cache
conflicts and hence misses. (If funcref table indices 1 and 1025 are
both frequently called, a 1024-entry direct-mapped cache will thrash.)
But I expect with ICs in particular to have a lot of callsites and
relatively few (shared) callees.
On Octane-compiled-to-Wasm with my JS AOT compilation tooling using
`call_indirect` for all ICs, I see: baseline score (higher is better,
proportional to runtime speed) of 2406, score with old
one-entry-per-callsite scheme of 2479, score with this scheme of
2509. So it's slightly faster as well, probably due to a combination of
the warmup benefit and a smaller cache footprint, even with the more
involved logic to compute the slot address. (This also tells me the
benefit of this cache is smaller than I had originally measured on a
microbenchmark (20%) -- at 5% on all of Octane -- but that's still worth
it, IMHO.)
(*) Note that slots are not actually contiguous: I did a
struct-of-arrays trick, separating cache tags from cache values, so
that the assembly lowering can use scaling amodes (`vmctx + offset +
4*idx` for u32 accesses, and `8*idx` for u64 accesses) for more
efficient code.
13b9f23 to
0d6d1c0
Compare
Subscribe to Label Actioncc @fitzgen DetailsThis issue or pull request has been labeled: "fuzzing", "wasmtime:api", "wasmtime:config", "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
Label Messager: wasmtime:configIt looks like you are changing Wasmtime's configuration options. Make sure to
DetailsTo modify this label's message, edit the To add new label messages or remove existing label messages, edit the |
alexcrichton
left a comment
There was a problem hiding this comment.
Before I dive too deep into this, I don't think that what's implemented is sufficient. For example this code:
(module
(func (export "_start")
i32.const 0
call_indirect
i32.const 0
call_indirect (result i32)
drop
)
(func $a)
(table 10 10 funcref)
(elem (offset (i32.const 1)) func $a)
)runs as:
$ cargo run -q run -O cache-call-indirects=n foo.wat
Error: failed to run main module `foo.wat`
Caused by:
0: failed to invoke command default
1: error while executing at wasm backtrace:
0: 0x3a - <unknown>!<wasm function 0>
2: wasm trap: uninitialized element
$ cargo run -q run -O cache-call-indirects=y foo.wat
zsh: segmentation fault cargo run -q run -O cache-call-indirects=y foo.wat
Notably I think that the cache key can't just be the index but must also be the type used as well? I'm not sure how much the load of the VMSharedTypeIndex will affect perf, so I figure it's probably best to figure this out first before diving too deep in review.
|
Er I actually flubbed the test a bit and discovered a bug I didn't mean to discover. This is what I intended: (module
(func (export "_start")
i32.const 1 ;; not 0 unlike above
call_indirect
i32.const 1 ;; not 0 unlike above
call_indirect (result i32)
drop
)
(func $a)
(table 10 10 funcref)
(elem (offset (i32.const 1)) func $a)
)which runs successfully with Out of curiosity I was also curious, had I not caught this, if fuzzing would catch it. Forcing differential execution of Wasmtime-against-Wasmtime and module generation with |
Ah, could you say more why the type is necessary? In my (possibly naive) view, we are choosing the cache slot to look at a different way, but nothing else should change (obviously there is some breakage here which I'll take a look at -- team meeting next two days so slight delay). |
|
The behavior of The basic tl;dr; though is that this PR as-is violates wasm-semantics and isn't memory safe. (insofar that you can successfully call a function with the wrong signature) |
|
Ah, I completely missed the dynamic signature check -- will have to incorporate that probably by baking it into the cache key somehow. (I don't think we'll need a load on the hot path at all.) Solution TBD, thanks! |
fitzgen
left a comment
There was a problem hiding this comment.
Overall, modulo Alex's point about types, this looks great! I like this a lot better than having a cliff where call sites beyond the cliff don't get caching.
In addition to the point Alex made about types, I think it is worth thinking about (and testing!) the case where we have two tables, both sharing the same cache entries, but where one is a table of untyped funcrefs and the other is a table of (ref $some_specific_func_type).
In general, I think it is also worth adding a few combinatorial-y unit tests where we have cache_size_log2 == 0 (i.e. a single, shared cache entry) and we hammer on that cache entry with a variety of different calls from different sites with the same and different expected types and via different tables with different types of funcrefs.
| .generate_address_map(self.wasmtime.generate_address_map) | ||
| .cache_call_indirects(self.wasmtime.cache_call_indirects) | ||
| .max_call_indirect_cache_slots(self.wasmtime.max_call_indirect_cache_slots); | ||
| .call_indirect_cache_size_log2(self.wasmtime.call_indirect_cache_size_log2); |
There was a problem hiding this comment.
Might want to clamp this value to a reasonable range?
There was a problem hiding this comment.
Ah, it's already clamped in the method on Config, so preferred to let fuzzing maximally poke the API surface here (no special limits). I guess we could do something that avoids biasing the randomness (as-is, the Arbitrary range on the u8 maps 0..=255 to 0..=16 with 15/16ths of the space going to 16) but I don't think it's too important to evenly spread effort across sizes that way...
|
Excellent finds on both -- fixes pushed to (i) tag cache entries with signature ID as well, (ii) check for null code pointer (the feature on |
|
On testing the newest version of this PR, that correctly keeps a cache tag for signature ID and also correctly checks for null(*), I'm now seeing a 2.5% slowdown on Octane in AOT JS relative to no caching at all. So it seems this has crossed the instruction-count tradeoff threshold (the benchmarks have quite high IPC so it's about bulk of instruction stream, not cache misses). Given the smaller delta I'm seeing now with indirect-call caching in full benchmarks, vs. the microbenchmark I was using to guide me at the time, I think I will actually opt to remove indirect-call caching altogether: the null handling in (*) I had previously incorrectly remembered the SIGSEGV-for-call-to-null-funcref mechanism as working by catching a call to address 0; actually it works by catching a SIGSEGV on a load from the |
In the original development of this feature, guided by JS AOT compilation to Wasm of a microbenchmark heavily focused on IC sites, I was seeing a ~20% speedup. However, in more recent measurements, on full programs (e.g., the Octane benchmark suite), the benefit is more like 5%. Moreover, in bytecodealliance#8870, I attempted to switch over to a direct-mapped cache, to address a current shortcoming of the design, namely that it has a hard-capped number of callsites it can apply to (50k) to limit impact on VMContext struct size. With all of the needed checks for correctness, though, that change results in a 2.5% slowdown relative to no caching at all, so it was dropped. In the process of thinking through that, I discovered the current design on `main` incorrectly handles null funcrefs: it invokes a null code pointer, rather than loading a field from a null struct pointer. The latter was specifically designed to cause the necessary Wasm trap in bytecodealliance#8159, but I had missed that the call to a null code pointer would not have the same effect. As a result, we actually can crash the VM (safely at least, but still no good vs. a proper Wasm trap!) with the feature enabled. (It's off by default still.) That could be fixed too, but at this point with the small benefit on real programs, together with the limitation on module size for full benefit, I think I'd rather opt for simplicity and remove the cache entirely. Thus, this PR removes call-indirect caching. It's not a direct revert because the original PR refactored the call-indirect generation into smaller helpers and IMHO it's a bit nicer to keep that. But otherwise all traces of the setting, code pre-scan during compilation and special conditions tracked on tables, and codegen changes are gone.
In the original development of this feature, guided by JS AOT compilation to Wasm of a microbenchmark heavily focused on IC sites, I was seeing a ~20% speedup. However, in more recent measurements, on full programs (e.g., the Octane benchmark suite), the benefit is more like 5%. Moreover, in bytecodealliance#8870, I attempted to switch over to a direct-mapped cache, to address a current shortcoming of the design, namely that it has a hard-capped number of callsites it can apply to (50k) to limit impact on VMContext struct size. With all of the needed checks for correctness, though, that change results in a 2.5% slowdown relative to no caching at all, so it was dropped. In the process of thinking through that, I discovered the current design on `main` incorrectly handles null funcrefs: it invokes a null code pointer, rather than loading a field from a null struct pointer. The latter was specifically designed to cause the necessary Wasm trap in bytecodealliance#8159, but I had missed that the call to a null code pointer would not have the same effect. As a result, we actually can crash the VM (safely at least, but still no good vs. a proper Wasm trap!) with the feature enabled. (It's off by default still.) That could be fixed too, but at this point with the small benefit on real programs, together with the limitation on module size for full benefit, I think I'd rather opt for simplicity and remove the cache entirely. Thus, this PR removes call-indirect caching. It's not a direct revert because the original PR refactored the call-indirect generation into smaller helpers and IMHO it's a bit nicer to keep that. But otherwise all traces of the setting, code pre-scan during compilation and special conditions tracked on tables, and codegen changes are gone.
In the original development of this feature, guided by JS AOT compilation to Wasm of a microbenchmark heavily focused on IC sites, I was seeing a ~20% speedup. However, in more recent measurements, on full programs (e.g., the Octane benchmark suite), the benefit is more like 5%. Moreover, in bytecodealliance#8870, I attempted to switch over to a direct-mapped cache, to address a current shortcoming of the design, namely that it has a hard-capped number of callsites it can apply to (50k) to limit impact on VMContext struct size. With all of the needed checks for correctness, though, that change results in a 2.5% slowdown relative to no caching at all, so it was dropped. In the process of thinking through that, I discovered the current design on `main` incorrectly handles null funcrefs: it invokes a null code pointer, rather than loading a field from a null struct pointer. The latter was specifically designed to cause the necessary Wasm trap in bytecodealliance#8159, but I had missed that the call to a null code pointer would not have the same effect. As a result, we actually can crash the VM (safely at least, but still no good vs. a proper Wasm trap!) with the feature enabled. (It's off by default still.) That could be fixed too, but at this point with the small benefit on real programs, together with the limitation on module size for full benefit, I think I'd rather opt for simplicity and remove the cache entirely. Thus, this PR removes call-indirect caching. It's not a direct revert because the original PR refactored the call-indirect generation into smaller helpers and IMHO it's a bit nicer to keep that. But otherwise all traces of the setting, code pre-scan during compilation and special conditions tracked on tables, and codegen changes are gone.
In the original development of this feature, guided by JS AOT compilation to Wasm of a microbenchmark heavily focused on IC sites, I was seeing a ~20% speedup. However, in more recent measurements, on full programs (e.g., the Octane benchmark suite), the benefit is more like 5%. Moreover, in #8870, I attempted to switch over to a direct-mapped cache, to address a current shortcoming of the design, namely that it has a hard-capped number of callsites it can apply to (50k) to limit impact on VMContext struct size. With all of the needed checks for correctness, though, that change results in a 2.5% slowdown relative to no caching at all, so it was dropped. In the process of thinking through that, I discovered the current design on `main` incorrectly handles null funcrefs: it invokes a null code pointer, rather than loading a field from a null struct pointer. The latter was specifically designed to cause the necessary Wasm trap in #8159, but I had missed that the call to a null code pointer would not have the same effect. As a result, we actually can crash the VM (safely at least, but still no good vs. a proper Wasm trap!) with the feature enabled. (It's off by default still.) That could be fixed too, but at this point with the small benefit on real programs, together with the limitation on module size for full benefit, I think I'd rather opt for simplicity and remove the cache entirely. Thus, this PR removes call-indirect caching. It's not a direct revert because the original PR refactored the call-indirect generation into smaller helpers and IMHO it's a bit nicer to keep that. But otherwise all traces of the setting, code pre-scan during compilation and special conditions tracked on tables, and codegen changes are gone.
Currently, the indirect-call cache operates on a basis of one slot per callsite: each
call_indirectinstruction in a module, up to a limit, has its own slot of storage in the VM context struct that caches a called-table-index to called-raw-function-pointer mapping.This is fine but means the storage requirement scales with the module size; hence "up to a limit" above. It also means that each callsite needs to "warm up" separately, whereas we could in theory reuse the resolved index->code pointer mapping for the same index across callsites.
This PR switches instead to a "direct-mapped cache": that is, we have a fixed number of cache slots per table per instance, of user-configurable count, and we look in a slot selected by the called table index (modulo the cache size). As before, if the "tag" (cache key, called table index) matches, we use the "value" (raw code pointer).
The main advantage of this scheme, and my motivation for making the switch, is that the storage size is fixed and quite small, even for arbitrarily-large modules: for example, on a 64-bit platform with 12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with one funcptr table, a 1K-entry cache uses 12KiB per instance. That's much smaller than the total VMFuncRef array size in large modules and should be no problem. My goal in getting to this constant size offset is that turning this on by default eventually will be easier to justify, and that we won't have unexpected perf cliffs for callsites beyond a certain index.
This also means that if one callsite resolves index 23 to some raw code pointer, other callsites that call index 23 also receive a "hit" from that warmup. This could be beneficial when there are many callsites but a relatively smaller pool of called functions (e.g., ICs).
The downside to caching indexed on callee rather than callsite is that if there are a large number of callees, we can expect more cache conflicts and hence misses. (If funcref table indices 1 and 1025 are both frequently called, a 1024-entry direct-mapped cache will thrash.) But I expect with ICs in particular to have a lot of callsites and relatively few (shared) callees.
On Octane-compiled-to-Wasm with my JS AOT compilation tooling using
call_indirectfor all ICs, I see: baseline score (higher is better, proportional to runtime speed) of 2406, score with old one-entry-per-callsite scheme of 2479, score with this scheme of 2509. So it's slightly faster as well, probably due to a combination of the warmup benefit and a smaller cache footprint, even with the more involved logic to compute the slot address. (This also tells me the benefit of this cache is smaller than I had originally measured on a microbenchmark (20%) -- at 5% on all of Octane -- but that's still worth it, IMHO.)(*) Note that slots are not actually contiguous: I did a struct-of-arrays trick, separating cache tags from cache values, so that the assembly lowering can use scaling amodes (
vmctx + offset + 4*idxfor u32 accesses, and8*idxfor u64 accesses) for more efficient code.