Unifying threads, functions and thread pooling #83

Shillaker · 2021-04-22T10:43:26Z

The purpose of this change is to unify the treatment of threads and functions into a single simple executor interface. This is necessary to support remote threading and scheduling of distributed threads, making them first class citizens alongside functions (in terms of scheduling decisions).

The key changes are:

The scheduler now controls the creation and destruction of executors (rather than having a queue between the scheduler and a pool of executors)
Rather than running each executor in its own thread, executors will lazily create their own worker threads when needed. For a simple function, this means the executor will create a single worker thread to execute that function. If that function then spawns more threads, the executor's thread pool will grow.
To simplify the logic, there is no longer the concept of cold and warm executors. Executors only exist bound to a function or don't exist. This removes an extra layer of messaging and queueing.
The interface required from Executor subclasses is much simpler, with each only needing to implement executeTask, which takes a bulk request and a set of indexes saying which messages from that bulk request it needs to execute. For a simple function this request will hold one message and the list of indexes will be {0}.
Protobuf messages involved in function execution are not copied unless absolutely necessary (when receiving in the function call server).
Faabric tests now include more testing of the Executor subclasses, including error handling, how they execute functions and threads etc.
I've remove the use of the word "faaslet" wherever I found it, as it's Faasm-specific. Wherever it was used is now "executor".
Closes Potential scheduler bug? nextMsgIdx not updated when scheduling to other hosts #65 by simplifying the scheduler logic

Shillaker · 2021-04-26T14:24:33Z

src/proto/faabric.proto

+    int32 id = 1;
+    int32 appId = 2;
+    int32 appIndex = 3;
+    string masterHost = 4;


I introduced these fields, but rearranged and renumbered the rest as they didn't make sense. According to the protobuf docs it's also most efficient to use numbers 1-15 for regularly used fields, so I made sure that's the case.

Shillaker · 2021-04-30T08:10:35Z

include/faabric/scheduler/DummyExecutor.h

@@ -0,0 +1,22 @@
+#pragma once


DummyExecutor and the associated factory are needed to set a default in tests

Shillaker · 2021-04-30T10:46:22Z

mpi-native/examples/mpi_allgather.cpp

+
 int main(int argc, char** argv)
 {
-    auto logger = faabric::util::getLogger();


This chunk of code seemed to be duplicated over all MPI native examples so I factored it out

csegarragonz

LGTM. We can go over the details offline.

csegarragonz · 2021-05-04T06:03:24Z

src/runner/CMakeLists.txt

+        FaabricMain.cpp
+        )
+
+faabric_lib(runner "${LIB_FILES}")


Do you think we could remove the use of LIB_FILES? I wouldn't point it out but given that the file is fully new, we can already start applying faasm/faasm#413

I'm not 100% sure whether it works with the faabric_lib function, if it was add_library it would be fine. I'll see if it's a straight swap, if not we can update all uses of faabric_lib as part of that issue.

src/scheduler/Executor.cpp

tests/test/scheduler/test_executor.cpp

csegarragonz · 2021-05-04T06:31:23Z

tests/test/scheduler/test_executor.cpp

+    executeWithTestExecutor(req, false);
+
+    auto& sch = faabric::scheduler::getScheduler();
+    faabric::Message result = sch.getFunctionResult(msgId, 1000);


Could we use conf.boundTimeout instead of the numbers? (Seen it elsewhere as well)

Yes having magic numbers like this isn't great. They're there to make sure the tests time out if they go wrong, but to give the system time to perform the action we're testing in the background. The bound timeout is usually longer than we'd want a test to hang before failing, but perhaps I can find a nicer way to set these values. (In this case boundTimeout has already been set back to its default value which is too long)

csegarragonz · 2021-05-04T06:35:41Z

src/scheduler/Scheduler.cpp

-    bindQueue = std::make_shared<InMemoryMessageQueue>();
-
    // Set up the initial resources
    int cores = faabric::util::getUsableCores();


Do we want to keep using the word cores? Feels weird to initialize a int cores variable and then set_slots with it.

Not sure I agree, this number is the number of cores, so calling it cores makes sense. We're then setting the number of slots equal to the number of cores, but the two are different things (i.e. we could have more slots than cores if we found that overloading didn't make much of a performance difference). They just happen to be 1:1 at the moment.

csegarragonz · 2021-05-04T06:53:14Z

src/scheduler/Scheduler.cpp

-        throw std::runtime_error("Message without master host");
+    std::string funcStr = faabric::util::funcToString(msg, false);
+
+    // Remove from warm executors


Maybe this operations could be simplified/made more efficient with std::remove_if. Namely, we can do something like:

warmExecutors[funcStr].erase(std::remove_if( warmExecutors[funcStr].begin(), warmExecutors[funcStr].end(), [](const auto& execPtr) { execPtr.id == exec->id; }), v.end());

It's also worth having this other bad boy in mind: std::remove_copy_if, which copies matching elements from one container to another.

csegarragonz · 2021-05-04T06:55:37Z

src/scheduler/Scheduler.cpp

-    return scheduler;
-}
-
 std::vector<std::string> Scheduler::callFunctions(


Edited to point to the right line number.

src/scheduler/Scheduler.cpp

csegarragonz · 2021-05-04T07:05:08Z

src/scheduler/Scheduler.cpp

-                std::unordered_set<std::string> allHosts = getAvailableHosts();
+        if (offset < nMessages) {
+            // At this point we know we need to enlist unregistered hosts
+            std::unordered_set<std::string> allHosts = getAvailableHosts();


If we were to use an ordered container like std::set instead of std::unordered_set we could use std::set_difference, which feels a bit like what we are doing.

Yes that might make sense. In general I avoid the ordered collections unless absolutely necessary as I've seen the performance of std::map to be noticeably different to std::unordered_map when doing inserts (i.e. enough to actually care about in something like the scheduler). The verbosity of std::set_difference also means in this case it would actually be more lines of code than what's there now. We will also still have to have the loop to iterate through the resulting diff, so I'm not sure it's an easy decision here, but I'll have a go.

The bigger worry in this bit of code is that we're querying Redis to get the list of available hosts, which will dwarf any performance benefit from using set_difference, so I'll actually have a look and see if I can remove that too...

csegarragonz · 2021-05-04T07:08:00Z

src/scheduler/Scheduler.cpp


 void Scheduler::callFunction(faabric::Message& msg, bool forceLocal)
 {
    // TODO - avoid this copy


AFAICT we are not copying anymore? Could we remove the TODO?

I'm pretty sure *req->add_messages() = msg; is doing a copy. If this function weren't so widely used I'd get rid of it altogether; all scheduling of functions should be done through callFunctions. Fortunately it's mostly just used in tests, and I'll port the important bits of Faasm (like the function chaining calls) to use callFunctions.

Configurable batch execute requests

65e34e2

Shillaker self-assigned this Apr 22, 2021

Shillaker added 2 commits April 22, 2021 13:28

start moving thread pool into faabric

5f80133

More porting thread pool

4c0028c

Shillaker changed the title ~~Updates to batch execution~~ Move thread pooling into Faabric Apr 22, 2021

Shillaker added 13 commits April 22, 2021 14:19

More refactoring

79137b8

more refactoring

d2f0752

Reworking scheduler

0544b58

Invasion of shared ptrs

71da5b8

Fixing scheduler tests

45f273e

Started on executor tests

0b3d353

Fixing up executor compilation

9ba9fb1

More executor

3702d25

More executor testing

8ce8ce7

Logging

8918342

Fixing up executor test

b361cea

all tests workign

1b7bcc5

formatting

f9830db

Shillaker commented Apr 26, 2021

View reviewed changes

Shillaker added 2 commits April 26, 2021 14:28

Tidy-up

42a74c4

Adding thread results

ec04077

Shillaker changed the title ~~Move thread pooling into Faabric~~ Remote thread execution with thread pooling Apr 26, 2021

Fixing tests

0d6caf2

Shillaker changed the title ~~Remote thread execution with thread pooling~~ Remote thread execution and thread pooling Apr 27, 2021

Shillaker added 6 commits April 27, 2021 12:38

Add batch init to executors

18ecb49

Tidy-up

68c9596

Testing function that spawns threads

18c5c8f

Yielding in executors

ef014ea

Started ripping up executor

07534e6

Remove old executor and executor pool

b3f837e

Fixing up tests

7defc04

Shillaker commented Apr 30, 2021

View reviewed changes

Shillaker added 2 commits April 30, 2021 10:03

Moved dummy executor into tests

c0ad63f

Updated MPI native

cf344a6

Shillaker commented Apr 30, 2021

View reviewed changes

Shillaker added 15 commits April 30, 2021 10:52

Formatting

d3e36ba

Add simple runner

1b17cd0

Virtual factory destructor

ed400f7

Removing in-flight counts

d5f429b

Return executors to warm list

1d25beb

Restore error handling

a27367e

Moved restore into faabric

8a661cd

More testing

5367ea9

Fixing up faabric tests again

ffa71fe

Tests for setting thread result

c305bd0

Formatting

b2d75fe

Small tidy-up

6775800

Tests for errors in executor

33ef9c5

Add test for thread results getting returned across hosts

1bce3c4

Test for restoring

443736b

Shillaker marked this pull request as ready for review May 3, 2021 15:16

Shillaker changed the title ~~Remote thread execution and thread pooling~~ Unifying threads, functions and thread pooling May 3, 2021

Shillaker requested a review from csegarragonz May 3, 2021 15:41

csegarragonz approved these changes May 4, 2021

View reviewed changes

Shillaker mentioned this pull request May 4, 2021

Update to Faabric batch scheduling faasm/faasm#411

Merged

Shillaker added 3 commits May 4, 2021 08:10

Review comments

8469cae

Refactoring vector operations

692723d

Add Wall

af46bda

Shillaker merged commit 7d1badf into master May 4, 2021

Shillaker deleted the batch-execution branch May 4, 2021 09:46

Unifying threads, functions and thread pooling #83

Unifying threads, functions and thread pooling #83

Uh oh!

Conversation

Shillaker commented Apr 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Shillaker Apr 26, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

csegarragonz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Shillaker May 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Shillaker commented Apr 22, 2021 •

edited

Loading

Shillaker Apr 26, 2021 •

edited

Loading

Shillaker May 4, 2021 •

edited

Loading

Shillaker May 4, 2021 •

edited

Loading

Shillaker May 4, 2021 •

edited

Loading

Shillaker May 4, 2021 •

edited

Loading