Skip to content

sled agent could better expose information about what it's currently doing #1881

@davepacheco

Description

@davepacheco

I realize this is pretty vague, and may turn into a project or just a bunch of more targeted tickets.

Under #1880, when we hit #1879, we spent kind of a while just trying to figure out what was currently going on. It would have been pretty helpful if we'd been able to query the Sled Agent process and see:

  • RSS's state: it had found a pre-existing plan and was attempting to resume executing it
  • that RSS was waiting on a specific bootstrap agent request
  • that the bootstrap agent request was in a retry loop

You can figure a bunch of this out from the log, the code, and a lot of reasoning. It's just time-consuming and error-prone.

I've had some success with past systems having an endpoint for fetching state specifically for debugging. In this case we could report state like:

"subsystems": {
    "rss": {
        "plan": "found", /* alternatively: "uninitialized", "created" */
        "steps": [ {
                "name": "do_the_things",
                "started": $timestamp,
                "done": $timestamp,
                "result": "success",
        }, ..., {
                "name": "initialize_agents",
                "started": $timestamp,
                "attempts": 27,
                "last_attempt": $timestamp,
                "next_attempt": $timestamp,
                "last_attempt_result": { "error": "Sled Agent already initialized" }
        } ]
    }
}

At Joyent we had composable libraries that would emit the pieces of this. For example vasync was a control flow library that emitted something like the "steps" field above. An analog for us might be to have a function that takes a FuturesOrdered and produces the "steps" output above or something like that? The idea was to make this available in a common way and queryable via a single tool like kang, although we didn't wind up doing much with kang. (We did wind up putting all this into some dashboards that were very useful.)

Thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    DebuggingFor when you want better data in debugging an issue (log messages, post mortem debugging, and more)Sled AgentRelated to the Per-Sled Configuration and Management

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions