Skip to content

[Discussion] Serialize objects within tasks #4673

@gjoseph92

Description

@gjoseph92

When Blockwise materializes its graph on the scheduler, we'd sometimes like to be able to insert bits of data from the client that aren't msgpack-serializable into that graph. Ideally, we could wrap these things in to_serialize on the client, access them as opaque Serialized blobs on the scheduler to insert them into tasks, and when the worker receives the task, the Serialized blobs within would automatically be deserialized.

However, most tasks are pre-serialized by dumps_task, which calls pickle directly. So if a task contains a Serialize object, the normal serialization machinery on the comm will never know, since it's just bytes by the time it reaches dumps. Same thing in the other direction. The reason for this special case is to cache serialized functions—particularly for numba functions, where we want to reuse the same function object instead of re-jitting every time.

I think our requirements are:

  1. If tasks contain Serialize or Serialized objects, those objects are automatically deserialized on the worker, but not on the scheduler.
  2. Serialized functions are cached, like they currently are with loads_function and dumps_function. That is, deserialize(serialize(func)) is deserialize(serialize(func)) (unless serialize(func) is huge). Okay if this doesn't apply to serialize in general and is specialized to tasks.
  3. Users still see the warn_dumps warning when submitting large tasks.

Nice-to-haves:

  1. __dask_distributed_*pack__ doesn't use dumps_msgpack/loads_msgpack directly; it just produces/consumes the marshalled objects (which may contain Serialize/Serialized) and lets comm serialization handle the rest. I think this separation of concerns would be nice for longer-term maintenance.
  2. make_blockwise_graph doesn't need a deserializing argument (doesn't call dumps_function/warn_dumps, or handle stringify_collection_keys / formatting tasks into a dict)—again, would be nice to separate concerns. I'm wondering if make_blockwise_graph could switch to a generator: dict(make_blockwise_graph(...)) would give you a plain dask, but Scheduler.update_graph_hlg would provide the logic to handle the key-stringification, reformatting to a task dict, and possibly even doing something to mark the function as needing to be cached. A generator could help separate concerns without adding the overhead of a full dict traversal.

Ideas:
I think the general idea is to not have dumps_task serialize to bytes anymore, but just return Serialize objects? Maybe even get rid of dumps_task entirely? This would delegate all serialization-to-bytes to the comm level.

  1. Move function-caching logic into core serialize/deserialize. For deserialize, this will lead to a lot of cache misses. Bad idea.
  2. Register a dask_serialize method for FunctionType that caches, then just calls pickle. Of course, this won't cache other callables besides functions. Mediocre idea.
  3. Register a dummy serialization family for "task", which caches, then just calls back into deserialize. By marking things as "task" when serializing, we can reduce cache misses. Kinda hackey, could work?

Alternatively, the quickest solution would be to leave everything as it is, and have worker._deserialize just call deserialize on the args/kwargs, to unpack any Serialize/Serialized objects they may contain. But I imagine that is undesirable, especially when we already have this nice machinery for handling nested serialization with to_serialize and Serialized.

Pitfalls:

  1. If tasks are being deserialized at the comm level, what will errors look like to users when unpickling a task fails (such as due to a missing import)? What will happen to the task from the scheduler's perspective?

cc @mrocklin @jrbourbeau @ian-r-rose @jakirkham @madsbk @rjzamora @quasiben: how do these requirements sound? What should be added or removed?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions