Skip to content
This repository was archived by the owner on Nov 15, 2023. It is now read-only.

Ensure clean wasm instances#2931

Closed
cmichi wants to merge 8 commits intomasterfrom
cmichi-ensure-clean-wasm-instances
Closed

Ensure clean wasm instances#2931
cmichi wants to merge 8 commits intomasterfrom
cmichi-ensure-clean-wasm-instances

Conversation

@cmichi
Copy link
Contributor

@cmichi cmichi commented Jun 24, 2019

Addresses #2051.

The runtime cache which substrate currently uses does reuse one runtime instance for every call.
In order to enable this the instance is cleaned up for every call. This leads to the problems detailed in
#2051.

Following up to private discussions with @pepyakin and @folsen, this PR tries out an alternative
approach: it replaces the runtime cache with a pool of pre-created runtime instances. For every call a "fresh" instance is taken out of the pool, a background thread monitors the pool and adds new instances.

In order to make this possible I had to implement threadsafe types for wasmi (wasmi-labs/wasmi#189, not yet merged).

I did a first (very very simple) benchmark by timing invocations of call_in_wasm_module and running cargo run --release -- --dev for 40 blocks:

AuraApi_authorities:
master:    169  µs (100%)
pool:      196  µs (115%)

AuraApi_slot_duration:
master:    2192 µs (100%)
pool:      96   µs (4%)

BlockBuilder_apply_extrinsic:
master:    1805 µs (100%)
pool:      2551 µs (141%)

BlockBuilder_finalize_block:
master:    6035 µs (100%)
pool:      4774 µs (79%)

BlockBuilder_inherent_extrinsics:
master:    996  µs (100%)
pool:      1416 µs (142%)

Core_execute_block:
master:    1335 µs (100%)
pool:      965  µs (72%)

Core_initialize_block:
master:    1731 µs (100%)
pool:      2143 µs (123%)

GrandpaApi_grandpa_authorities:
master:    2357 µs (100%)
pool:      150  µs (6%)

GrandpaApi_grandpa_forced_change:
master:    136  µs (100%)
pool:      155  µs (113%)

GrandpaApi_grandpa_pending_change:
master:    122  µs (100%)
pool:      121  µs (99%)

OffchainWorkerApi_offchain_worker:
master:    3    µs (100%)
pool:      11   µs (366%)

There is some room for optimization though (Jef had some comments on the wasmi PR).

At first we had intended for the runtime pool to do more complex actions when adding instances to the pool. Right now it just clones a runtime instance from a "template instance". I measured the clone operation in release mode on dev over 40 blocks and it takes on average about 1 µs on my machine. So it seems rather cheap and I'm wondering if we could maybe even do this synchronously without having the overhead of a separate thread?

(The CI fails because I use wasmi = { path = "../../../wasmi", features = [ "threadsafe" ] } in the code, since that PR is not yet merged. That package is not (yet) available in the CI. All tests run locally though.)

cmichi added 4 commits June 24, 2019 10:22
Original is from @pepyakin in 3d7b27f.
I adapted it to work with the latest master.
...to have the threadsafe version available.
A background thread monitors the pool and adds
new instances.
@cmichi cmichi added the A3-in_progress Pull request is in progress. No review needed at this stage. label Jun 24, 2019
@cmichi cmichi requested a review from pepyakin June 24, 2019 08:30
@bkchr
Copy link
Member

bkchr commented Jun 24, 2019

Wouldn't it be much more clean to just restore the original data section?
This all looks like a big hack.

@pepyakin
Copy link
Contributor

pepyakin commented Jun 24, 2019

Wouldn't it be much more clean to just restore the original data section?

I assume you meant restoration of the original data snapshot as well as other components of an instance (basically, cloning an wasm instance), otherwise it would have the same problems as #2051 to some degree.

I think we might have tried to restore/to clean the whole memory instance, but it was taking considerably more time. I think that is the reason why we got the current optimization that attempts at zeroing the (heuristically detected) dirty areas of memory (I think @arkpar might have some info on this).

The proposed approach, at least conceptially (I haven't had a look at the code yet), is very simple, so I think I can repeat it here:

  1. Create a pool of instances.
  2. On one side of the pool is producers which replenish the pool in background threads.
  3. On another side of the pool is consumers - wasm_executor which asks for an instance.

The general assumption behind this is the typical workload is more or less synchronous and thus the latency of wasm runtime execution matters more than actual time spent on preparing these instances. While threading and interthread communications definitely add some overhead, we are lowering the latency on the critical path to almost zero on average. Besides that, it doesn't use any specific feature of a wasm execution engine (wasmi at the moment), so this mechanism can be easily reused for another execution engine (i.e. #2634).

All of this and simplicity of the approach indicated that it is a good path to explore.

Although, it uncovered a problem with wasmi: wasmi in the current state is unusable in multithreaded environment, and I am not happy about it as well. That said, I think the general direction is still worth exploring.

Do you have anything in particular that you don't like in this solution (conceptually)?

cmichi and others added 2 commits June 24, 2019 13:36
Co-Authored-By: Sergei Pepyakin <s.pepyakin@gmail.com>
@pepyakin
Copy link
Contributor

Updates:

  1. This approach uses a raw thread which is step backwards with respect to
    Add a light client code that compiles in the browser #2416
  2. If the inital assumption that copying of whole memory takes too long is not true, then plain cloning of the runtime instance template is actually very tempting because of its simplicity.

@cmichi
Copy link
Contributor Author

cmichi commented Jun 24, 2019

At first we had intended for the runtime pool to do more complex actions when adding instances to the pool. Right now it just clones a runtime instance from a "template instance". I measured the clone operation in release mode on dev over 40 blocks and it takes on average about 1 µs on my machine. So it seems rather cheap and I'm wondering if we could maybe even do this synchronously without having the overhead of a separate thread?

If the inital assumption that copying of whole memory takes too long is not true, then plain cloning of the runtime instance template is actually very tempting because of its simplicity.

I'll try it out in a separate branch/PR. This would also save us from having to switch to thread-safe types for wasmi.

@bkchr
Copy link
Member

bkchr commented Jun 24, 2019

My complain was mostly about copying everything ^^ But if there is no better way, okay..

Conceptually I did not look that much into the code. In general you are right, we should not spawn raw threads in here. Then we need to take a look, how we evict old runtimes from the cache. Maybe it makes sense to remove them after some blocks and not directly as we see a new runtime.
And if we already touch this, I would really like to see the global cache variable gone and that the cache is passed in from the outside. Having global variables, was not a good solution from the beginning.

@arkpar
Copy link
Member

arkpar commented Jun 24, 2019

The problem is with the way rust-to-wasm toolchain organizes wasm memory. IIRC it allocates 1MB for the stack and and the data segment follows it. So each new runtime instance requires a 1MB write to memory, which impacts the performance when you have a lot of runtimes created for simple calls into WASM.

I'd really prefer this to be fundamentally fixed by avoiding that huge unnecessary work, rather than working around it. Pooling would waste CPU and will degrade heavily under load. E.g. when you have a lot of RPC requests coming in parallel, which is quite typical in various blockchain applications.

@pepyakin
Copy link
Contributor

Pooling would waste CPU and will degrade heavily under load. E.g. when you have a lot of RPC requests coming in parallel, which is quite typical in various blockchain applications.

This is really good input. As I mentioned, the assumption was that the node has spare CPU time. If you think that is not the case then this would be an issue.

I have some ideas in mind how to fix itproperly. However, they would require non-trivial amount of modification to the current execution execution engine - wasmi - and for the execution engines that would be introduced in the future (#2634). Opting into this more sophisticated approach would inhibit experimentation as well.

But most importantly, this issue is pressing due to an upcoming audit. This solution seemed to be pretty simple and shouldn't have regressed the performance very much.

That said, I really agree that we need to use a proper solution eventually and I see the proposed solution as a stop-gap.

@arkpar
Copy link
Member

arkpar commented Jun 26, 2019

Would still love to see how it affects real block import performance. I.e. import 10000 testnet blocks from a file. Also please compare CPU load.

@cmichi
Copy link
Contributor Author

cmichi commented Jun 28, 2019

I'm closing this PR as we are currently favoring the approach from #2938 and have stopped working on this one. If the other one turns out to be a dead end we can always reopen this PR.

@cmichi cmichi closed this Jun 28, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

A3-in_progress Pull request is in progress. No review needed at this stage.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants