Update to Faabric batch scheduling #411

Shillaker · 2021-04-22T09:56:17Z

This PR has totally spiralled, it's exposed a few concurrency issues that have led to some whack-a-mole rewriting and refactoring. In general I've tried to remove complexity wherever possible, and add assertions and checks to catch issues.

Summary of changes:

Update Faasm to the simpler executor model introduced in Faabric: Unifying threads, functions and thread pooling faabric#83
Make a better distinction between zygotes, cached modules and snapshots
Isolate WAVM module caching inside the wavm module (as this is very specific). Previously this was done through a separate module_cache module which made the logic hard to follow.
Merge the ir_cache module into the wavm module as it's also WAVM-specific (and maybe a candidate for complete removal?)
Remove simple_runner and replace with func_runner as now that the executor pooling has been removed, they are essentially equivalent.
Avoid use of destructors unless absolutely necessary to release resources, they had become used for general tidy-up and introduced some nasty errors when they went wrong.
Encapsulate the indexing of NetworkNamespace objects inside the class, rather than having wasm modules manage this
Avoid thread_local global variables where possible, as thread-locality isn't always a guarantee of separation between modules/ functions. Where they do still exist (e.g. with get/setExecutingModule), I've added some relatively strict asserts.

There are some OpenMP-specific changes too:

To pass control of distributed threading to Faabric, the OpenMP thread context now needs to be serialised and sent with the batch execution requests (rather than managed by the wasm module).
The synchronisation operations such as barriers, mutexes and critical sections need to support look-up by the Level id, to allow distributed threads to find them (without relying on them being available in-memory)

…eads

Shillaker · 2021-04-22T09:56:28Z

docker-compose.yml

    privileged: yes
    volumes:
      - ./dev/faasm/build/:${FAASM_BUILD_MOUNT}
+      - ./dev/container/shared_store/:/usr/local/faasm/shared_store/


Revert this

Shillaker · 2021-05-12T15:33:41Z

tests/utils/worker_utils.cpp

@@ -1,15 +1,17 @@
+#include "faabric/util/config.h"


I've tried to make this file more high-level and do less interrogation of the Faabric internals, hence there's a lot of red (as with the associated utils.h header)

Shillaker · 2021-05-12T15:35:34Z

tests/test/wasm/test_wasm.cpp

-
-    bool success = module.tearDown();
-    REQUIRE(success);
-}


WAVM GC will now fail an assertion if it fails. This should never happen unless we change the code in a way that makes it not possible to GC, in which case we should fix it.

Shillaker · 2021-05-12T15:38:19Z

tests/test/wasm/test_func_ptr.cpp

-    REQUIRE(success);
-
-    // Return code will be equal to the failure case of the function
-    REQUIRE(call.returnvalue() == 101);


This test is really checking the internals of the WAVM wasm module, and so was very brittle. The ability to execute functions from pointers is implicitly tested all over the place.

Shillaker · 2021-05-12T15:38:55Z

tests/test/wasm/test_cloning.cpp

    msg.set_ispython(true);
 }

-TEST_CASE("Test cloning empty modules doesn't break", "[wasm]")


This will now break as we should never be doing it.

Shillaker · 2021-05-12T15:40:22Z

tests/test/faaslet/test_threads.cpp

-TEST_CASE("Run distributed threading check", "[threads]")
-{
-    runTestDistributed("threads_dist");
-}


These two functions weren't really testing distributed execution, they were just doing a lot of fiddling with internals to fake up what we think distributed execution will do. Instead I'll leave this to another PR on doing "proper" distributed testing with multiple containers.

Shillaker · 2021-05-12T15:42:50Z

src/wavm/threads.cpp

-// Map of tid to message ID for chained calls
-static thread_local std::unordered_map<I32, unsigned int> chainedThreads;
-static thread_local std::unordered_map<I32, std::future<int32_t>>
-  localThreadFutures;


These are really specific to the module and not any given thread.

Shillaker · 2021-05-12T15:45:07Z

src/wavm/openmp.cpp

    Runtime::Memory* memoryPtr = parentModule->defaultMemory;
    faabric::Message* parentCall = getExecutingCall();

-    // Set up number of threads for next level


Changes from here on are mostly to do with refactoring to the new Faabric set-up where we don't need to worry about spawning our own tasks etc.

Shillaker · 2021-05-12T15:46:49Z

src/wavm/WAVMWasmModule.cpp

+        // removed
+        [[maybe_unused]] bool compartmentCleared =
+          Runtime::tryCollectCompartment(std::move(compartment));
+        assert(compartmentCleared);


This [[maybe_unused]] is to fix an error when doing the release build, where it complains that compartmentCleared is unused (because the assert is no-oped)

Shillaker · 2021-05-12T15:47:21Z

src/wavm/WAVMWasmModule.cpp

-    globalOffsetMemoryMap.clear();
-    missingGlobalOffsetEntries.clear();
-
-    dynamicPathToHandleMap.clear();


Resetting normal instance variables will be handled by the default destructor.

Shillaker · 2021-05-12T15:48:07Z

src/wavm/WAVMModuleCache.cpp

@@ -0,0 +1,61 @@
+#include <wavm/WAVMWasmModule.h>


This whole WAVMModuleCache class used to be the WasmModuleCache and have a lot more unnecessary complexity.

Shillaker · 2021-05-12T15:49:08Z

src/wasm/WasmModule.cpp

+            std::string newOutput = moduleStdout + "\n" + msg.outputdata();
+            msg.set_outputdata(newOutput);

-    // Check if we can add a thread


All this thread pooling stuff is handled in Faabric now

csegarragonz

LGTM. Made a small point on SharedVars and maybe introducing assertm a macro for assert + message

csegarragonz · 2021-05-13T12:07:43Z

include/wasm/WasmModule.h

+// Note - avoid a zero default on the thread request type otherwise it can
+// go unset without noticing
+enum ThreadRequestType
+{


I sometimes try and add a 0 default myself, just in case. Not sure if it is actually necessary.

csegarragonz · 2021-05-13T13:42:35Z

src/threads/ThreadState.cpp

+
+void Level::setSharedVars(uint32_t* ptr, int nVars)
+{
+    sharedVars = new uint32_t[nVars];


I am wondering if it is worth creating an abstraction for a SharedVar that basically wraps around uint32_t. I am concerned with the amount of times in this file we rely on shared vars being uint32_t (like serialize, deserialize).

First off sharedVars is probably not a great name as they're really wasm offsets, so sharedVarOffsets might be better. Because they're wasm offsets, they'll always be uint32_t.

I'm not sure what sort of abstraction would make things any safer or clearer here. It would have to be serialisable so would need to be composed of primitives itself, so it could be a struct that held the value and size, e.g.

struct SharedVarPtr { uint32_t val; size_t size = sizeof(uint32_t); }

but I'm not sure that adds much clarity or extra safety (plus the size would always be constant). I could add a #define SHARED_VAR_OFFSET_SIZE sizeof(uint32_t) at the top to make that part explicit, but it's only used in two places so I'm not sure it's worth it.

Any mistakes in the serialisation and deserialisation should be picked up in the tests, so I'm happy to leave this as is.

csegarragonz · 2021-05-13T14:09:33Z

src/threads/ThreadState.cpp

+          depth);
+    }
+
+    assert(localThreadNum >= 0);


Checking, throwing an error, and asserting seems a bit counter-intuitive/verbose to me. What do you think about using something like:

#define assertm(exp, msg) assert(((void)msg, exp)) ... assertm(localThreadNum >= 0, fmt::format("Local thread num negative {} - {} @ {}", msg->appindex(), globalTidOffset, depth));

taken from cppreference.

Hmm yes you're right, I'll just add an exception in this case.

csegarragonz · 2021-05-13T14:17:32Z

src/wavm/threads.cpp

+    WAVMWasmModule* thisModule = getExecutingWAVMModule();
+    unsigned int callId = thisModule->chainedThreads[pthreadPtr];
+    logger->debug("Awaiting pthread: {} ({})", pthreadPtr, callId);
+    faabric::scheduler::Scheduler& sch = faabric::scheduler::getScheduler();


I think we decided it was ok to use auto& for the scheduler's getter

csegarragonz · 2021-05-13T14:31:26Z

tests/test/wasm/test_openmp.cpp

-    conf::FaasmConfig& conf = conf::getFaasmConfig();
-    int32_t defaultPoolSize = conf.moduleThreadPoolSize;
-    conf.moduleThreadPoolSize = 15;
+    faabric::util::SystemConfig& conf = faabric::util::getSystemConfig();


Should we use auto& here as well? (Feel free to ignore)

We should use auto to reduce verbosity on very commonly used things, but not when it would impact the readability/ understanding of the code. The logger is a good example of where it's used in almost every file, its type name is long, and we only have one of them. Adding auto on the Scheduler is definitely ok in places where it's used a lot in the same file (e.g. tests), plus we only have one scheduler so you know what type it is. Given that we have two configs (faasm and faabric) I would say it's less clear whether to have a blanket ruling on conf as it might harm the readability, but it's probably not that big a deal either way.

csegarragonz · 2021-05-13T14:38:37Z

src/wavm/WAVMWasmModule.cpp

-            throw std::runtime_error(
-              "Cannot execute function on module bound to another");
-        }
+        throw std::runtime_error("Module must be bound before executing");


Maybe another use case for assertm()?

Hmm it's a good question. I think assertm is probably never preferable to an if and an exception, other than the fact that it saves some lines. I do like the fact that assertm can be done in one line, hence doesn't clutter the code, but because it's removed in non-debug builds it risks missing errors in deployments. It would be nice to save some lines on checks like this though, so perhaps it should be replaced with an assertm-like macro that doesn't get removed.

I would say we should only use assert in places where they will never fail unless there is something wrong with the logic contained wholly within that method, or a threading issue like a race condition, i.e. places where you wouldn't otherwise bother with an if/ exception combo and that we can be pretty sure will be caught in tests. I see asserts as a mix of documentation and testing, i.e. they tell you as the reader something relatively obvious e.g. "this variable in the loop should be zero by the end", or "this thing will never be null, so you can always make that assumption in code that uses this function".

There's lots of arguments against assert though, so if we had a that macro like assertm that didn't actually use assert, we could just replace all error checking and assertions with that.

csegarragonz · 2021-05-13T14:57:02Z

src/wasm/WasmModule.cpp

-        if (t.second.joinable()) {
-            t.second.join();
-        }
+    if (returnValue != 0) {


assertm() maybe?

Shillaker and others added 11 commits April 21, 2021 07:13

Latest cpp

d9b02c1

Local cluster dev

001eaa1

Reuse openmp contexts

2267346

More OpenMP threading

3fe5dcc

Support nested indexes

397e7d6

Fix broken log statement

ce97f34

Tidying up threading code

db99da0

Tidy up

6adc601

Fix thread pool size in GOT test

d0852e0

Merge branch 'openm-threads' of github.com:faasm/faasm into openm-thr…

166484e

…eads

Removed covid_runner

fad2150

Shillaker commented Apr 22, 2021

View reviewed changes

Shillaker added 2 commits April 22, 2021 10:48

Mount shared files in dev cluster

358a0d1

Started on batch exec changes

a15871f

Shillaker force-pushed the batch-execution branch from 9d83d31 to a15871f Compare April 22, 2021 10:50

Shillaker added 4 commits April 22, 2021 12:36

Continuing with refactor around batch execution messages

3c91ce0

Serialise/ deserialise thread pool

6972901

Latest faabric

2839a2d

Merge branch 'master' into batch-execution

47b4410

Shillaker changed the title ~~Batch execution improvements~~ Move thread pooling into Faabric Apr 22, 2021

Shillaker added 10 commits April 22, 2021 18:01

Latest faabric

ae325e4

Latest faabric

bce6cbf

Latest faabric

1e54a70

Refactoring to fit new faabric

8c5b1d4

More thread refactoring

c95d4ca

More testing

0d20f77

Refactoring Faaslets

e64436f

Compiling

43b3e63

Fixing up error test

fa7c110

Moved restore into faabric

77bac5b

Shillaker added 6 commits May 11, 2021 15:16

Sharing bindToFunction

9e8684c

More asserts/ checks

c3e159d

Levels test

0d515ec

Reinstating thread-local executing call

b0076e4

More test reworking

f759b80

Revert shared store mount

d64bd1c

Shillaker marked this pull request as ready for review May 12, 2021 15:25

Shillaker commented May 12, 2021

View reviewed changes

Shillaker added 6 commits May 12, 2021 16:40

Self-review

80c41e6

Switch from count to find

797f473

More thread state tests

e4d3323

Test for nowait barrier

47d9af2

Update nowait test

a22e48a

Change tests to Debug

b11ec1f

Shillaker requested a review from csegarragonz May 13, 2021 12:00

csegarragonz approved these changes May 13, 2021

View reviewed changes

Review comments

f51e456

Shillaker merged commit 71f2e47 into master May 14, 2021

Shillaker deleted the batch-execution branch May 14, 2021 06:39

Shillaker mentioned this pull request Oct 11, 2021

Use snapshots in module reset (rather than memcpy) #495

Merged

Update to Faabric batch scheduling #411

Update to Faabric batch scheduling #411

Uh oh!

Conversation

Shillaker commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csegarragonz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shillaker commented Apr 22, 2021 •

edited

Loading

Shillaker May 12, 2021 •

edited

Loading

Shillaker May 14, 2021 •

edited

Loading

Shillaker May 14, 2021 •

edited

Loading

Shillaker May 14, 2021 •

edited

Loading