Add one-pager document for Increasing Visibility into Helix Queues #9536

melotic · 2022-06-01T21:13:50Z

Epic: dotnet/dnceng#2657

To double check:

The right tests are in and and the right validation has happened. Guidance: https://github.com/dotnet/arcade/tree/main/Documentation/Validation

Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md

Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md

Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md

ChadNedzlek · 2022-06-01T22:11:43Z

Sorry I missed the meeting. Hopefully my comments weren't already addressed there. If so, just respond with what the consensus was and resolve them.

Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md

Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md

ChadNedzlek · 2022-06-01T22:28:55Z

Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md

+
+*In progress.* (This will be updated to a link to the PR once it's created)
+
+### Risk


I'd include a risk about the accuracy of the data. Queue behavior fluctuates so wildly between outages and evil, bad jobs, that it's really difficult to tell in advance what the behavior of any particular work item will be. I think some sort of system that tracks the "accuracy" would be interesting. Like when a job actually completes, cross reference the "real" time with our estimation, and record that somewhere. That way when, inevitably, customers give us feedback that it's not accurate (we almost always get feedback about stuff not being accurate/true), we have the metrics to actually judge that. Metrics are an important part of a long-term viable project, so you can keep track in the future if assumptions change that your existing services continue to function adequately.

Great point. I'm not sure if this is something I can tackle for the scope of this project, but I've made a note of it.

I should add somewhere that this is maybe a feature worth implementing.

This seems like an important piece to me. Even if we initially tackle showing less information initially, this seems like it is an important requirement to know whether we are giving meaningful numbers to customers. Providing misleading information is usually worse than not providing information.

👍 x 1000. I'm really worried about the accuracy of this information. To the point that I'd rather have ONLY the metrics, and not the UI than the other way around. It's easy enough to add UI to it later, once we decide the information is good, but it's hard to change it without the data necessary to know if the information is correct.

I've been working on quantifying when we'll display that a queue is "overloaded" or experiencing "high volume". My current thought is we can compute 24hr/12hr/6hr moving averages (over the 95th percentile) and compare the values of the moving averages. For instance, we could define overloaded as the 6hr work item wait time moving average being twice the 24hr work item wait time moving average. I've created a Grafana dashboard that will show these computations:

https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/aOlx_y9nk/helix-insights-ma-crossovers?orgId=1&from=now-7d&to=now&var-QueueName=windows.10.amd64&viewPanel=2

See e9386ce for the definitions of the queue statuses

"moving averages" of... what? Queue depth? Wait times? I'm... not sure what I'm seeing in that graph. But if we compare 6hr to 24hr, wouldn't that just means for 6 hours every morning we called all the queues overloaded? So basically the entire business day is just "overloaded" by default.

Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md

riarenas

Some initial comments. I think the implementation details are good to have and will be useful, but we should make sure that the problems we're trying to solve and your proposed solution can be understood without having to go into those implementation details.

Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md

Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md

melotic · 2022-06-03T20:06:30Z

For now, Chad, Ricardo and I decided its best to draw a line in the sand and scrap the Queue Insights section for now. There are a lot of challenges that prevent us from being able to accurately present insights of queues, namely:

Past queue activity doesn't predict future activity
When the PR is submitted, it can be any where from 1-3 hours to when your tests run, and by then the status of the queues could have changed. We could solve this by creating & populating the check when the tests are about to run, but the effort in linking helix runs to a specific PR is a challenging problem. We also would not have anything to show the user untill the build is complete.. It is still on radar to complete after an initial POC of the features I want to implement is done.

See the revised mockup for the scope of features I'll be implementing.

markwilkie · 2022-06-14T18:27:19Z

I really like the mini-graphs.

Am I understanding correctly that we're only showing queue status as a GH check? (no separate dashboard?)

melotic · 2022-06-14T18:34:29Z

I really like the mini-graphs.

Am I understanding correctly that we're only showing queue status as a GH check? (no separate dashboard?)

Yes. The GH check has links to the Grafana dashboard for that specified queue. (ubuntu.1804.amd64.open --> open's link to that respective Grafana dashboard)

markwilkie · 2022-06-14T20:13:18Z

Yes. The GH check has links to the Grafana dashboard for that specified queue. (ubuntu.1804.amd64.open --> open's link to that respective Grafana dashboard)

Cool, that makes sense. This means the current plan doesn't include the "red, yellow, green" state? This is probably fine, but I don't want to promise or imply that we're doing it if we're not.

riarenas · 2022-06-14T20:41:43Z

Cool, that makes sense. This means the current plan doesn't include the "red, yellow, green" state? This is probably fine, but I don't want to promise or imply that we're doing it if we're not.

It currently does not. We determined that the goal so far is to help individual developers understand what's going on with Helix in the context of their PRs.

A dashboard or a general status page are currently out of scope for this.

markwilkie · 2022-06-14T22:14:12Z

A dashboard or a general status page are currently out of scope for this.

ACK - makes sense. Thanks

Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md

riarenas · 2022-06-15T15:00:07Z

Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md

+
+3. Query the work item wait time and queue size for that pipeline's list of queues.
+   1. Currently Grafana has this data, with Kusto queries that we can pull and use.
+   2. We will simply pull the queries that Grafana uses them to present the data.


Suggested change

2. We will simply pull the queries that Grafana uses them to present the data.

2. We will simply pull the queries that Grafana uses to present the data.

this very minor typo is still here

Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md

riarenas · 2022-06-15T15:09:16Z

Took me a while to make a full pass again. I think this is almost ready to go. Needs to be cleaned up a bit after the latest changes because there are some contradictions in some places.

I'd also like to see what your plan for rolling this out to customers would be. This isn't something we want to just enable everywhere, but should rather enable in a few repositories first so we can get initial feedback.

melotic · 2022-06-23T19:44:53Z

@riarenas ready for review. I've yanked the insight stuff for now. Perhaps this can be merged into the ML one-pager later.

Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup-justin-impl.md

riarenas · 2022-06-23T20:16:52Z

LGTM, I think we just need to remove the section about average times for build machines etc from the "commited" mockup.

Add initial one pager

3c4fa96

melotic requested review from a team, ChadNedzlek, Chrisboh, MattGal, garath, ilyas1974, missymessa, riarenas and ulisesh June 1, 2022 21:21