Lazily allocate the bump-alloc chunk in the externref table#3739
Merged
alexcrichton merged 1 commit intobytecodealliance:mainfrom Jan 28, 2022
Merged
Lazily allocate the bump-alloc chunk in the externref table#3739alexcrichton merged 1 commit intobytecodealliance:mainfrom
alexcrichton merged 1 commit intobytecodealliance:mainfrom
Conversation
Subscribe to Label Actioncc @fitzgen DetailsThis issue or pull request has been labeled: "wasmtime:ref-types"Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
bjorn3
reviewed
Jan 28, 2022
fitzgen
approved these changes
Jan 28, 2022
Member
fitzgen
left a comment
There was a problem hiding this comment.
Nice, I was a little worried that this would involve additional null checks in the inline barriers, and I'm pleased to see that is not the case :)
alexcrichton
added a commit
to alexcrichton/wasmtime
that referenced
this pull request
Jan 28, 2022
This is another PR along the lines of "let's squeeze all possible performance we can out of instantiation". Before this PR we would copy, by value, the contents of `VMBuiltinFunctionsArray` into each `VMContext` allocated. This array of function pointers is modestly-sized but growing over time as we add various intrinsics. Additionally it's the exact same for all `VMContext` allocations. This PR attempts to speed up instantiation slightly by instead storing an indirection to the function array. This means that calling a builtin intrinsic is a tad bit slower since it requires two loads instead of one (one to get the base pointer, another to get the actual address). Otherwise though `VMContext` initialization is now simply setting one pointer instead of doing a `memcpy` from one location to another. With some macro-magic this commit also replaces the previous implementation with one that's more `const`-friendly which also gets us compile-time type-checks of libcalls as well as compile-time verification that all libcalls are defined. Overall, as with bytecodealliance#3739, the win is very modest here. Locally I measured a speedup from 1.9us to 1.7us taken to instantiate an empty module with one function. While small at these scales it's still a 10% improvement!
This commit updates the allocation of a `VMExternRefActivationsTable` structure to perform zero malloc memory allocations. Previously it would allocate a page-size of `chunk` plus some space in hash sets for future insertions. The main trick here implemented is that after the first gc during the slow path the fast chunk allocation is allocated and configured. The motivation for this PR is that given our recent work to further refine and optimize the instantiation process this allocation started to show up in a nontrivial fashion. Most modules today never touch this table anyway as almost none of them use reference types, so the time spent allocation and deallocating the table per-store was largely wasted time. Concretely on a microbenchmark this PR speeds up instantiation of a module with one function by 30%, decreasing the instantiation cost from 1.8us to 1.2us. Overall a pretty minor win but when the instantiation times we're measuring start being in the single-digit microseconds this win ends up getting magnified!
a355c8f to
4563f51
Compare
alexcrichton
added a commit
that referenced
this pull request
Jan 28, 2022
* Don't copy `VMBuiltinFunctionsArray` into each `VMContext` This is another PR along the lines of "let's squeeze all possible performance we can out of instantiation". Before this PR we would copy, by value, the contents of `VMBuiltinFunctionsArray` into each `VMContext` allocated. This array of function pointers is modestly-sized but growing over time as we add various intrinsics. Additionally it's the exact same for all `VMContext` allocations. This PR attempts to speed up instantiation slightly by instead storing an indirection to the function array. This means that calling a builtin intrinsic is a tad bit slower since it requires two loads instead of one (one to get the base pointer, another to get the actual address). Otherwise though `VMContext` initialization is now simply setting one pointer instead of doing a `memcpy` from one location to another. With some macro-magic this commit also replaces the previous implementation with one that's more `const`-friendly which also gets us compile-time type-checks of libcalls as well as compile-time verification that all libcalls are defined. Overall, as with #3739, the win is very modest here. Locally I measured a speedup from 1.9us to 1.7us taken to instantiate an empty module with one function. While small at these scales it's still a 10% improvement! * Review comments
alexcrichton
added a commit
to alexcrichton/wasmtime
that referenced
this pull request
Feb 2, 2022
…alliance#3739) This commit updates the allocation of a `VMExternRefActivationsTable` structure to perform zero malloc memory allocations. Previously it would allocate a page-size of `chunk` plus some space in hash sets for future insertions. The main trick here implemented is that after the first gc during the slow path the fast chunk allocation is allocated and configured. The motivation for this PR is that given our recent work to further refine and optimize the instantiation process this allocation started to show up in a nontrivial fashion. Most modules today never touch this table anyway as almost none of them use reference types, so the time spent allocation and deallocating the table per-store was largely wasted time. Concretely on a microbenchmark this PR speeds up instantiation of a module with one function by 30%, decreasing the instantiation cost from 1.8us to 1.2us. Overall a pretty minor win but when the instantiation times we're measuring start being in the single-digit microseconds this win ends up getting magnified!
alexcrichton
added a commit
to alexcrichton/wasmtime
that referenced
this pull request
Feb 2, 2022
…lliance#3741) * Don't copy `VMBuiltinFunctionsArray` into each `VMContext` This is another PR along the lines of "let's squeeze all possible performance we can out of instantiation". Before this PR we would copy, by value, the contents of `VMBuiltinFunctionsArray` into each `VMContext` allocated. This array of function pointers is modestly-sized but growing over time as we add various intrinsics. Additionally it's the exact same for all `VMContext` allocations. This PR attempts to speed up instantiation slightly by instead storing an indirection to the function array. This means that calling a builtin intrinsic is a tad bit slower since it requires two loads instead of one (one to get the base pointer, another to get the actual address). Otherwise though `VMContext` initialization is now simply setting one pointer instead of doing a `memcpy` from one location to another. With some macro-magic this commit also replaces the previous implementation with one that's more `const`-friendly which also gets us compile-time type-checks of libcalls as well as compile-time verification that all libcalls are defined. Overall, as with bytecodealliance#3739, the win is very modest here. Locally I measured a speedup from 1.9us to 1.7us taken to instantiate an empty module with one function. While small at these scales it's still a 10% improvement! * Review comments
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit updates the allocation of a
VMExternRefActivationsTablestructure to perform zero malloc memory allocations. Previously it would
allocate a page-size of
chunkplus some space in hash sets for futureinsertions. The main trick here implemented is that after the first gc
during the slow path the fast chunk allocation is allocated and
configured.
The motivation for this PR is that given our recent work to further
refine and optimize the instantiation process this allocation started to
show up in a nontrivial fashion. Most modules today never touch this
table anyway as almost none of them use reference types, so the time
spent allocation and deallocating the table per-store was largely wasted
time.
Concretely on a microbenchmark this PR speeds up instantiation of a
module with one function by 30%, decreasing the instantiation cost from
1.8us to 1.2us. Overall a pretty minor win but when the instantiation
times we're measuring start being in the single-digit microseconds this
win ends up getting magnified!