Skip to content

Conversation

@Shillaker
Copy link
Collaborator

@Shillaker Shillaker commented Apr 22, 2021

This PR has totally spiralled, it's exposed a few concurrency issues that have led to some whack-a-mole rewriting and refactoring. In general I've tried to remove complexity wherever possible, and add assertions and checks to catch issues.

Summary of changes:

  • Update Faasm to the simpler executor model introduced in Faabric: Unifying threads, functions and thread pooling faabric#83
  • Make a better distinction between zygotes, cached modules and snapshots
  • Isolate WAVM module caching inside the wavm module (as this is very specific). Previously this was done through a separate module_cache module which made the logic hard to follow.
  • Merge the ir_cache module into the wavm module as it's also WAVM-specific (and maybe a candidate for complete removal?)
  • Remove simple_runner and replace with func_runner as now that the executor pooling has been removed, they are essentially equivalent.
  • Avoid use of destructors unless absolutely necessary to release resources, they had become used for general tidy-up and introduced some nasty errors when they went wrong.
  • Encapsulate the indexing of NetworkNamespace objects inside the class, rather than having wasm modules manage this
  • Avoid thread_local global variables where possible, as thread-locality isn't always a guarantee of separation between modules/ functions. Where they do still exist (e.g. with get/setExecutingModule), I've added some relatively strict asserts.

There are some OpenMP-specific changes too:

  • To pass control of distributed threading to Faabric, the OpenMP thread context now needs to be serialised and sent with the batch execution requests (rather than managed by the wasm module).
  • The synchronisation operations such as barriers, mutexes and critical sections need to support look-up by the Level id, to allow distributed threads to find them (without relying on them being available in-memory)

privileged: yes
volumes:
- ./dev/faasm/build/:${FAASM_BUILD_MOUNT}
- ./dev/container/shared_store/:/usr/local/faasm/shared_store/
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revert this

@Shillaker Shillaker changed the title Batch execution improvements Move thread pooling into Faabric Apr 22, 2021
@Shillaker Shillaker marked this pull request as ready for review May 12, 2021 15:25
@@ -1,15 +1,17 @@
#include "faabric/util/config.h"
Copy link
Collaborator Author

@Shillaker Shillaker May 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to make this file more high-level and do less interrogation of the Faabric internals, hence there's a lot of red (as with the associated utils.h header)


bool success = module.tearDown();
REQUIRE(success);
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WAVM GC will now fail an assertion if it fails. This should never happen unless we change the code in a way that makes it not possible to GC, in which case we should fix it.

REQUIRE(success);

// Return code will be equal to the failure case of the function
REQUIRE(call.returnvalue() == 101);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is really checking the internals of the WAVM wasm module, and so was very brittle. The ability to execute functions from pointers is implicitly tested all over the place.

msg.set_ispython(true);
}

TEST_CASE("Test cloning empty modules doesn't break", "[wasm]")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will now break as we should never be doing it.

TEST_CASE("Run distributed threading check", "[threads]")
{
runTestDistributed("threads_dist");
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions weren't really testing distributed execution, they were just doing a lot of fiddling with internals to fake up what we think distributed execution will do. Instead I'll leave this to another PR on doing "proper" distributed testing with multiple containers.

// Map of tid to message ID for chained calls
static thread_local std::unordered_map<I32, unsigned int> chainedThreads;
static thread_local std::unordered_map<I32, std::future<int32_t>>
localThreadFutures;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are really specific to the module and not any given thread.

Runtime::Memory* memoryPtr = parentModule->defaultMemory;
faabric::Message* parentCall = getExecutingCall();

// Set up number of threads for next level
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes from here on are mostly to do with refactoring to the new Faabric set-up where we don't need to worry about spawning our own tasks etc.

// removed
[[maybe_unused]] bool compartmentCleared =
Runtime::tryCollectCompartment(std::move(compartment));
assert(compartmentCleared);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This [[maybe_unused]] is to fix an error when doing the release build, where it complains that compartmentCleared is unused (because the assert is no-oped)

globalOffsetMemoryMap.clear();
missingGlobalOffsetEntries.clear();

dynamicPathToHandleMap.clear();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resetting normal instance variables will be handled by the default destructor.

@@ -0,0 +1,61 @@
#include <wavm/WAVMWasmModule.h>
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole WAVMModuleCache class used to be the WasmModuleCache and have a lot more unnecessary complexity.

std::string newOutput = moduleStdout + "\n" + msg.outputdata();
msg.set_outputdata(newOutput);

// Check if we can add a thread
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this thread pooling stuff is handled in Faabric now

@Shillaker Shillaker requested a review from csegarragonz May 13, 2021 12:00
Copy link
Collaborator

@csegarragonz csegarragonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Made a small point on SharedVars and maybe introducing assertm a macro for assert + message

// Note - avoid a zero default on the thread request type otherwise it can
// go unset without noticing
enum ThreadRequestType
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sometimes try and add a 0 default myself, just in case. Not sure if it is actually necessary.


void Level::setSharedVars(uint32_t* ptr, int nVars)
{
sharedVars = new uint32_t[nVars];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if it is worth creating an abstraction for a SharedVar that basically wraps around uint32_t. I am concerned with the amount of times in this file we rely on shared vars being uint32_t (like serialize, deserialize).

Copy link
Collaborator Author

@Shillaker Shillaker May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First off sharedVars is probably not a great name as they're really wasm offsets, so sharedVarOffsets might be better. Because they're wasm offsets, they'll always be uint32_t.

I'm not sure what sort of abstraction would make things any safer or clearer here. It would have to be serialisable so would need to be composed of primitives itself, so it could be a struct that held the value and size, e.g.

struct SharedVarPtr {
    uint32_t val;
    size_t size = sizeof(uint32_t);
}

but I'm not sure that adds much clarity or extra safety (plus the size would always be constant). I could add a #define SHARED_VAR_OFFSET_SIZE sizeof(uint32_t) at the top to make that part explicit, but it's only used in two places so I'm not sure it's worth it.

Any mistakes in the serialisation and deserialisation should be picked up in the tests, so I'm happy to leave this as is.

depth);
}

assert(localThreadNum >= 0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking, throwing an error, and asserting seems a bit counter-intuitive/verbose to me. What do you think about using something like:

#define assertm(exp, msg) assert(((void)msg, exp))
...
assertm(localThreadNum >= 0, fmt::format("Local thread num negative {} - {} @ {}", msg->appindex(), globalTidOffset, depth));

taken from cppreference.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yes you're right, I'll just add an exception in this case.

WAVMWasmModule* thisModule = getExecutingWAVMModule();
unsigned int callId = thisModule->chainedThreads[pthreadPtr];
logger->debug("Awaiting pthread: {} ({})", pthreadPtr, callId);
faabric::scheduler::Scheduler& sch = faabric::scheduler::getScheduler();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we decided it was ok to use auto& for the scheduler's getter

conf::FaasmConfig& conf = conf::getFaasmConfig();
int32_t defaultPoolSize = conf.moduleThreadPoolSize;
conf.moduleThreadPoolSize = 15;
faabric::util::SystemConfig& conf = faabric::util::getSystemConfig();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use auto& here as well? (Feel free to ignore)

Copy link
Collaborator Author

@Shillaker Shillaker May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use auto to reduce verbosity on very commonly used things, but not when it would impact the readability/ understanding of the code. The logger is a good example of where it's used in almost every file, its type name is long, and we only have one of them. Adding auto on the Scheduler is definitely ok in places where it's used a lot in the same file (e.g. tests), plus we only have one scheduler so you know what type it is. Given that we have two configs (faasm and faabric) I would say it's less clear whether to have a blanket ruling on conf as it might harm the readability, but it's probably not that big a deal either way.

throw std::runtime_error(
"Cannot execute function on module bound to another");
}
throw std::runtime_error("Module must be bound before executing");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another use case for assertm()?

Copy link
Collaborator Author

@Shillaker Shillaker May 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm it's a good question. I think assertm is probably never preferable to an if and an exception, other than the fact that it saves some lines. I do like the fact that assertm can be done in one line, hence doesn't clutter the code, but because it's removed in non-debug builds it risks missing errors in deployments. It would be nice to save some lines on checks like this though, so perhaps it should be replaced with an assertm-like macro that doesn't get removed.

I would say we should only use assert in places where they will never fail unless there is something wrong with the logic contained wholly within that method, or a threading issue like a race condition, i.e. places where you wouldn't otherwise bother with an if/ exception combo and that we can be pretty sure will be caught in tests. I see asserts as a mix of documentation and testing, i.e. they tell you as the reader something relatively obvious e.g. "this variable in the loop should be zero by the end", or "this thing will never be null, so you can always make that assumption in code that uses this function".

There's lots of arguments against assert though, so if we had a that macro like assertm that didn't actually use assert, we could just replace all error checking and assertions with that.

if (t.second.joinable()) {
t.second.join();
}
if (returnValue != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assertm() maybe?

@Shillaker Shillaker merged commit 71f2e47 into master May 14, 2021
@Shillaker Shillaker deleted the batch-execution branch May 14, 2021 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants