-
Notifications
You must be signed in to change notification settings - Fork 70
Update to Faabric batch scheduling #411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
docker-compose.yml
Outdated
| privileged: yes | ||
| volumes: | ||
| - ./dev/faasm/build/:${FAASM_BUILD_MOUNT} | ||
| - ./dev/container/shared_store/:/usr/local/faasm/shared_store/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revert this
9d83d31 to
a15871f
Compare
tests/utils/worker_utils.cpp
Outdated
| @@ -1,15 +1,17 @@ | |||
| #include "faabric/util/config.h" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to make this file more high-level and do less interrogation of the Faabric internals, hence there's a lot of red (as with the associated utils.h header)
|
|
||
| bool success = module.tearDown(); | ||
| REQUIRE(success); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WAVM GC will now fail an assertion if it fails. This should never happen unless we change the code in a way that makes it not possible to GC, in which case we should fix it.
| REQUIRE(success); | ||
|
|
||
| // Return code will be equal to the failure case of the function | ||
| REQUIRE(call.returnvalue() == 101); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is really checking the internals of the WAVM wasm module, and so was very brittle. The ability to execute functions from pointers is implicitly tested all over the place.
| msg.set_ispython(true); | ||
| } | ||
|
|
||
| TEST_CASE("Test cloning empty modules doesn't break", "[wasm]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will now break as we should never be doing it.
| TEST_CASE("Run distributed threading check", "[threads]") | ||
| { | ||
| runTestDistributed("threads_dist"); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two functions weren't really testing distributed execution, they were just doing a lot of fiddling with internals to fake up what we think distributed execution will do. Instead I'll leave this to another PR on doing "proper" distributed testing with multiple containers.
| // Map of tid to message ID for chained calls | ||
| static thread_local std::unordered_map<I32, unsigned int> chainedThreads; | ||
| static thread_local std::unordered_map<I32, std::future<int32_t>> | ||
| localThreadFutures; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are really specific to the module and not any given thread.
| Runtime::Memory* memoryPtr = parentModule->defaultMemory; | ||
| faabric::Message* parentCall = getExecutingCall(); | ||
|
|
||
| // Set up number of threads for next level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes from here on are mostly to do with refactoring to the new Faabric set-up where we don't need to worry about spawning our own tasks etc.
| // removed | ||
| [[maybe_unused]] bool compartmentCleared = | ||
| Runtime::tryCollectCompartment(std::move(compartment)); | ||
| assert(compartmentCleared); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This [[maybe_unused]] is to fix an error when doing the release build, where it complains that compartmentCleared is unused (because the assert is no-oped)
| globalOffsetMemoryMap.clear(); | ||
| missingGlobalOffsetEntries.clear(); | ||
|
|
||
| dynamicPathToHandleMap.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resetting normal instance variables will be handled by the default destructor.
| @@ -0,0 +1,61 @@ | |||
| #include <wavm/WAVMWasmModule.h> | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole WAVMModuleCache class used to be the WasmModuleCache and have a lot more unnecessary complexity.
| std::string newOutput = moduleStdout + "\n" + msg.outputdata(); | ||
| msg.set_outputdata(newOutput); | ||
|
|
||
| // Check if we can add a thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this thread pooling stuff is handled in Faabric now
csegarragonz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Made a small point on SharedVars and maybe introducing assertm a macro for assert + message
| // Note - avoid a zero default on the thread request type otherwise it can | ||
| // go unset without noticing | ||
| enum ThreadRequestType | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I sometimes try and add a 0 default myself, just in case. Not sure if it is actually necessary.
src/threads/ThreadState.cpp
Outdated
|
|
||
| void Level::setSharedVars(uint32_t* ptr, int nVars) | ||
| { | ||
| sharedVars = new uint32_t[nVars]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if it is worth creating an abstraction for a SharedVar that basically wraps around uint32_t. I am concerned with the amount of times in this file we rely on shared vars being uint32_t (like serialize, deserialize).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First off sharedVars is probably not a great name as they're really wasm offsets, so sharedVarOffsets might be better. Because they're wasm offsets, they'll always be uint32_t.
I'm not sure what sort of abstraction would make things any safer or clearer here. It would have to be serialisable so would need to be composed of primitives itself, so it could be a struct that held the value and size, e.g.
struct SharedVarPtr {
uint32_t val;
size_t size = sizeof(uint32_t);
}
but I'm not sure that adds much clarity or extra safety (plus the size would always be constant). I could add a #define SHARED_VAR_OFFSET_SIZE sizeof(uint32_t) at the top to make that part explicit, but it's only used in two places so I'm not sure it's worth it.
Any mistakes in the serialisation and deserialisation should be picked up in the tests, so I'm happy to leave this as is.
src/threads/ThreadState.cpp
Outdated
| depth); | ||
| } | ||
|
|
||
| assert(localThreadNum >= 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking, throwing an error, and asserting seems a bit counter-intuitive/verbose to me. What do you think about using something like:
#define assertm(exp, msg) assert(((void)msg, exp))
...
assertm(localThreadNum >= 0, fmt::format("Local thread num negative {} - {} @ {}", msg->appindex(), globalTidOffset, depth));taken from cppreference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yes you're right, I'll just add an exception in this case.
src/wavm/threads.cpp
Outdated
| WAVMWasmModule* thisModule = getExecutingWAVMModule(); | ||
| unsigned int callId = thisModule->chainedThreads[pthreadPtr]; | ||
| logger->debug("Awaiting pthread: {} ({})", pthreadPtr, callId); | ||
| faabric::scheduler::Scheduler& sch = faabric::scheduler::getScheduler(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we decided it was ok to use auto& for the scheduler's getter
| conf::FaasmConfig& conf = conf::getFaasmConfig(); | ||
| int32_t defaultPoolSize = conf.moduleThreadPoolSize; | ||
| conf.moduleThreadPoolSize = 15; | ||
| faabric::util::SystemConfig& conf = faabric::util::getSystemConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use auto& here as well? (Feel free to ignore)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use auto to reduce verbosity on very commonly used things, but not when it would impact the readability/ understanding of the code. The logger is a good example of where it's used in almost every file, its type name is long, and we only have one of them. Adding auto on the Scheduler is definitely ok in places where it's used a lot in the same file (e.g. tests), plus we only have one scheduler so you know what type it is. Given that we have two configs (faasm and faabric) I would say it's less clear whether to have a blanket ruling on conf as it might harm the readability, but it's probably not that big a deal either way.
| throw std::runtime_error( | ||
| "Cannot execute function on module bound to another"); | ||
| } | ||
| throw std::runtime_error("Module must be bound before executing"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe another use case for assertm()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm it's a good question. I think assertm is probably never preferable to an if and an exception, other than the fact that it saves some lines. I do like the fact that assertm can be done in one line, hence doesn't clutter the code, but because it's removed in non-debug builds it risks missing errors in deployments. It would be nice to save some lines on checks like this though, so perhaps it should be replaced with an assertm-like macro that doesn't get removed.
I would say we should only use assert in places where they will never fail unless there is something wrong with the logic contained wholly within that method, or a threading issue like a race condition, i.e. places where you wouldn't otherwise bother with an if/ exception combo and that we can be pretty sure will be caught in tests. I see asserts as a mix of documentation and testing, i.e. they tell you as the reader something relatively obvious e.g. "this variable in the loop should be zero by the end", or "this thing will never be null, so you can always make that assumption in code that uses this function".
There's lots of arguments against assert though, so if we had a that macro like assertm that didn't actually use assert, we could just replace all error checking and assertions with that.
| if (t.second.joinable()) { | ||
| t.second.join(); | ||
| } | ||
| if (returnValue != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assertm() maybe?
This PR has totally spiralled, it's exposed a few concurrency issues that have led to some whack-a-mole rewriting and refactoring. In general I've tried to remove complexity wherever possible, and add assertions and checks to catch issues.
Summary of changes:
wavmmodule (as this is very specific). Previously this was done through a separatemodule_cachemodule which made the logic hard to follow.ir_cachemodule into thewavmmodule as it's also WAVM-specific (and maybe a candidate for complete removal?)simple_runnerand replace withfunc_runneras now that the executor pooling has been removed, they are essentially equivalent.NetworkNamespaceobjects inside the class, rather than having wasm modules manage thisthread_localglobal variables where possible, as thread-locality isn't always a guarantee of separation between modules/ functions. Where they do still exist (e.g. withget/setExecutingModule), I've added some relatively strictasserts.There are some OpenMP-specific changes too:
Levelid, to allow distributed threads to find them (without relying on them being available in-memory)