Distributed request tracing

I frequently find myself in a position where I need to answer questions like: "Who triggered this coroutine? Why was this coroutine triggered?", "Was this triggered by a remote call via a stream handler or due to some internal mechanics? If it was remote, who was it?". Is this the result of a cascade of tasks or was this an isolated request?

Most, if not all of these questions can be answered by different debugging techniques, starting from print statements or logging with some context information or by using a debugger. With the heavy usage of `asyncio` these debugging techniques have their limits and are also not straight forward to setup.

We started adding keywords like `cause` or `reason` to some functions, e.g. [Worker.release_key](https://github.com/dask/distributed/blob/d5fc324bdef22f19d77c3d36e63a3778ceaac0b0/distributed/worker.py#L2410) which adds a free text field which can be optionally used to attach some context information. This hasn't been used consistently through the code base (as you can see we even have two different kwargs for the same thing in this method, one of them is not used, guess which?)

In a microservice world, this problem has been _solved_ by distributed tracing. Oversimplified, this works by attaching a unique ID to every request and include said ID to subsequently spawned requests, etc. This allows one to correlated the initial trigger of a cascade of events and helps one to understand why and how certain things are happening.

I'm wondering if we can benefit from something like this in our distributed context as well, in particular regarding the administration work scheduler, worker, nanny, etc. perform.

If this is considered useful, I'm wondering if we should come up with our own definition for such metadata or adhere to some available tooling. If there is tooling, what would be appropriate?

Any thoughts or previous experience on this topic?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distributed request tracing #4718

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Distributed request tracing #4718

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions