I realize this is pretty vague, and may turn into a project or just a bunch of more targeted tickets.
Under #1880, when we hit #1879, we spent kind of a while just trying to figure out what was currently going on. It would have been pretty helpful if we'd been able to query the Sled Agent process and see:
- RSS's state: it had found a pre-existing plan and was attempting to resume executing it
- that RSS was waiting on a specific bootstrap agent request
- that the bootstrap agent request was in a retry loop
You can figure a bunch of this out from the log, the code, and a lot of reasoning. It's just time-consuming and error-prone.
I've had some success with past systems having an endpoint for fetching state specifically for debugging. In this case we could report state like:
"subsystems": {
"rss": {
"plan": "found", /* alternatively: "uninitialized", "created" */
"steps": [ {
"name": "do_the_things",
"started": $timestamp,
"done": $timestamp,
"result": "success",
}, ..., {
"name": "initialize_agents",
"started": $timestamp,
"attempts": 27,
"last_attempt": $timestamp,
"next_attempt": $timestamp,
"last_attempt_result": { "error": "Sled Agent already initialized" }
} ]
}
}
At Joyent we had composable libraries that would emit the pieces of this. For example vasync was a control flow library that emitted something like the "steps" field above. An analog for us might be to have a function that takes a FuturesOrdered and produces the "steps" output above or something like that? The idea was to make this available in a common way and queryable via a single tool like kang, although we didn't wind up doing much with kang. (We did wind up putting all this into some dashboards that were very useful.)
Thoughts?
I realize this is pretty vague, and may turn into a project or just a bunch of more targeted tickets.
Under #1880, when we hit #1879, we spent kind of a while just trying to figure out what was currently going on. It would have been pretty helpful if we'd been able to query the Sled Agent process and see:
You can figure a bunch of this out from the log, the code, and a lot of reasoning. It's just time-consuming and error-prone.
I've had some success with past systems having an endpoint for fetching state specifically for debugging. In this case we could report state like:
At Joyent we had composable libraries that would emit the pieces of this. For example vasync was a control flow library that emitted something like the "steps" field above. An analog for us might be to have a function that takes a
FuturesOrderedand produces the "steps" output above or something like that? The idea was to make this available in a common way and queryable via a single tool like kang, although we didn't wind up doing much with kang. (We did wind up putting all this into some dashboards that were very useful.)Thoughts?