-
-
Notifications
You must be signed in to change notification settings - Fork 757
Description
When creating a DAG, e.g., with delayed, it is possible to declare a task as "impure", meaning that a new unique key is generated even when the input arguments are identical.
I am wondering if there is any appetite for a similar concept on the scheduler: where a task-key is annotated as side-effect only and having no useful return value. Use cases might include IO on some external storage, where we want to ensure than an operation happened, but if the worker that executed it goes down, there is no need to repeat it. In other words, when the scheduler state for the task would normally go to in-memory, it can now be just "completed" (or released) and any task that depends on it can be allowed to run without having to fetch any results. A set of CSV write tasks with a finalize task depending on all of them would be a good example of this (and the barrier doesn't actually need to execute anything in this case, it's only a meta-task of dependencies).
This pattern would weakly move towards tasks that are executed exactly once, where the side-effect is mutation of some resource. It would not guard against a task being run simultaneously on two workers - an opposite to speculative execution. It's probably not feasible to make a strict guarantee of such without a lot of work.
Feel free to say that considering this is unnecessary complexity. I am thinking of it in terms of shared mutable memory between processes on a single node - but the big IO case is also interesting.