Lazily allocate the bump-alloc chunk in the externref table by alexcrichton · Pull Request #3739 · bytecodealliance/wasmtime

alexcrichton · 2022-01-28T20:29:04Z

This commit updates the allocation of a VMExternRefActivationsTable
structure to perform zero malloc memory allocations. Previously it would
allocate a page-size of chunk plus some space in hash sets for future
insertions. The main trick here implemented is that after the first gc
during the slow path the fast chunk allocation is allocated and
configured.

The motivation for this PR is that given our recent work to further
refine and optimize the instantiation process this allocation started to
show up in a nontrivial fashion. Most modules today never touch this
table anyway as almost none of them use reference types, so the time
spent allocation and deallocating the table per-store was largely wasted
time.

Concretely on a microbenchmark this PR speeds up instantiation of a
module with one function by 30%, decreasing the instantiation cost from
1.8us to 1.2us. Overall a pretty minor win but when the instantiation
times we're measuring start being in the single-digit microseconds this
win ends up getting magnified!

github-actions · 2022-01-28T20:45:17Z

Subscribe to Label Action

cc @fitzgen

Details

This issue or pull request has been labeled: "wasmtime:ref-types"

Thus the following users have been cc'd because of the following labels:

fitzgen: wasmtime:ref-types

To subscribe or unsubscribe from this label, edit the .github/subscribe-to-label.json configuration file.

Learn more.

crates/runtime/src/externref.rs

fitzgen

Nice, I was a little worried that this would involve additional null checks in the inline barriers, and I'm pleased to see that is not the case :)

This is another PR along the lines of "let's squeeze all possible performance we can out of instantiation". Before this PR we would copy, by value, the contents of `VMBuiltinFunctionsArray` into each `VMContext` allocated. This array of function pointers is modestly-sized but growing over time as we add various intrinsics. Additionally it's the exact same for all `VMContext` allocations. This PR attempts to speed up instantiation slightly by instead storing an indirection to the function array. This means that calling a builtin intrinsic is a tad bit slower since it requires two loads instead of one (one to get the base pointer, another to get the actual address). Otherwise though `VMContext` initialization is now simply setting one pointer instead of doing a `memcpy` from one location to another. With some macro-magic this commit also replaces the previous implementation with one that's more `const`-friendly which also gets us compile-time type-checks of libcalls as well as compile-time verification that all libcalls are defined. Overall, as with bytecodealliance#3739, the win is very modest here. Locally I measured a speedup from 1.9us to 1.7us taken to instantiate an empty module with one function. While small at these scales it's still a 10% improvement!

This commit updates the allocation of a `VMExternRefActivationsTable` structure to perform zero malloc memory allocations. Previously it would allocate a page-size of `chunk` plus some space in hash sets for future insertions. The main trick here implemented is that after the first gc during the slow path the fast chunk allocation is allocated and configured. The motivation for this PR is that given our recent work to further refine and optimize the instantiation process this allocation started to show up in a nontrivial fashion. Most modules today never touch this table anyway as almost none of them use reference types, so the time spent allocation and deallocating the table per-store was largely wasted time. Concretely on a microbenchmark this PR speeds up instantiation of a module with one function by 30%, decreasing the instantiation cost from 1.8us to 1.2us. Overall a pretty minor win but when the instantiation times we're measuring start being in the single-digit microseconds this win ends up getting magnified!

* Don't copy `VMBuiltinFunctionsArray` into each `VMContext` This is another PR along the lines of "let's squeeze all possible performance we can out of instantiation". Before this PR we would copy, by value, the contents of `VMBuiltinFunctionsArray` into each `VMContext` allocated. This array of function pointers is modestly-sized but growing over time as we add various intrinsics. Additionally it's the exact same for all `VMContext` allocations. This PR attempts to speed up instantiation slightly by instead storing an indirection to the function array. This means that calling a builtin intrinsic is a tad bit slower since it requires two loads instead of one (one to get the base pointer, another to get the actual address). Otherwise though `VMContext` initialization is now simply setting one pointer instead of doing a `memcpy` from one location to another. With some macro-magic this commit also replaces the previous implementation with one that's more `const`-friendly which also gets us compile-time type-checks of libcalls as well as compile-time verification that all libcalls are defined. Overall, as with #3739, the win is very modest here. Locally I measured a speedup from 1.9us to 1.7us taken to instantiate an empty module with one function. While small at these scales it's still a 10% improvement! * Review comments

…alliance#3739) This commit updates the allocation of a `VMExternRefActivationsTable` structure to perform zero malloc memory allocations. Previously it would allocate a page-size of `chunk` plus some space in hash sets for future insertions. The main trick here implemented is that after the first gc during the slow path the fast chunk allocation is allocated and configured. The motivation for this PR is that given our recent work to further refine and optimize the instantiation process this allocation started to show up in a nontrivial fashion. Most modules today never touch this table anyway as almost none of them use reference types, so the time spent allocation and deallocating the table per-store was largely wasted time. Concretely on a microbenchmark this PR speeds up instantiation of a module with one function by 30%, decreasing the instantiation cost from 1.8us to 1.2us. Overall a pretty minor win but when the instantiation times we're measuring start being in the single-digit microseconds this win ends up getting magnified!

…lliance#3741) * Don't copy `VMBuiltinFunctionsArray` into each `VMContext` This is another PR along the lines of "let's squeeze all possible performance we can out of instantiation". Before this PR we would copy, by value, the contents of `VMBuiltinFunctionsArray` into each `VMContext` allocated. This array of function pointers is modestly-sized but growing over time as we add various intrinsics. Additionally it's the exact same for all `VMContext` allocations. This PR attempts to speed up instantiation slightly by instead storing an indirection to the function array. This means that calling a builtin intrinsic is a tad bit slower since it requires two loads instead of one (one to get the base pointer, another to get the actual address). Otherwise though `VMContext` initialization is now simply setting one pointer instead of doing a `memcpy` from one location to another. With some macro-magic this commit also replaces the previous implementation with one that's more `const`-friendly which also gets us compile-time type-checks of libcalls as well as compile-time verification that all libcalls are defined. Overall, as with bytecodealliance#3739, the win is very modest here. Locally I measured a speedup from 1.9us to 1.7us taken to instantiate an empty module with one function. While small at these scales it's still a 10% improvement! * Review comments

alexcrichton requested a review from fitzgen January 28, 2022 20:29

github-actions bot added the wasmtime:ref-types Issues related to reference types and GC in Wasmtime label Jan 28, 2022

bjorn3 reviewed Jan 28, 2022

View reviewed changes

crates/runtime/src/externref.rs Outdated Show resolved Hide resolved

fitzgen approved these changes Jan 28, 2022

View reviewed changes

alexcrichton mentioned this pull request Jan 28, 2022

Don't copy VMBuiltinFunctionsArray into each VMContext #3741

Merged

alexcrichton force-pushed the less-alloc branch from a355c8f to 4563f51 Compare January 28, 2022 21:24

alexcrichton merged commit 2f49424 into bytecodealliance:main Jan 28, 2022

alexcrichton deleted the less-alloc branch January 28, 2022 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lazily allocate the bump-alloc chunk in the externref table#3739

Lazily allocate the bump-alloc chunk in the externref table#3739
alexcrichton merged 1 commit intobytecodealliance:mainfrom
alexcrichton:less-alloc

alexcrichton commented Jan 28, 2022

Uh oh!

github-actions bot commented Jan 28, 2022

Uh oh!

Uh oh!

fitzgen left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alexcrichton commented Jan 28, 2022

Uh oh!

github-actions bot commented Jan 28, 2022

Subscribe to Label Action

Uh oh!

Uh oh!

fitzgen left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants