Skip to content

Conversation

@melotic
Copy link
Member

@melotic melotic commented Jun 1, 2022

Epic: dotnet/dnceng#2657

To double check:

@ChadNedzlek
Copy link
Contributor

Sorry I missed the meeting. Hopefully my comments weren't already addressed there. If so, just respond with what the consensus was and resolve them.


*In progress.* (This will be updated to a link to the PR once it's created)

### Risk
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd include a risk about the accuracy of the data. Queue behavior fluctuates so wildly between outages and evil, bad jobs, that it's really difficult to tell in advance what the behavior of any particular work item will be. I think some sort of system that tracks the "accuracy" would be interesting. Like when a job actually completes, cross reference the "real" time with our estimation, and record that somewhere. That way when, inevitably, customers give us feedback that it's not accurate (we almost always get feedback about stuff not being accurate/true), we have the metrics to actually judge that. Metrics are an important part of a long-term viable project, so you can keep track in the future if assumptions change that your existing services continue to function adequately.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point. I'm not sure if this is something I can tackle for the scope of this project, but I've made a note of it.

I should add somewhere that this is maybe a feature worth implementing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like an important piece to me. Even if we initially tackle showing less information initially, this seems like it is an important requirement to know whether we are giving meaningful numbers to customers. Providing misleading information is usually worse than not providing information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 x 1000. I'm really worried about the accuracy of this information. To the point that I'd rather have ONLY the metrics, and not the UI than the other way around. It's easy enough to add UI to it later, once we decide the information is good, but it's hard to change it without the data necessary to know if the information is correct.

Copy link
Member Author

@melotic melotic Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been working on quantifying when we'll display that a queue is "overloaded" or experiencing "high volume". My current thought is we can compute 24hr/12hr/6hr moving averages (over the 95th percentile) and compare the values of the moving averages. For instance, we could define overloaded as the 6hr work item wait time moving average being twice the 24hr work item wait time moving average. I've created a Grafana dashboard that will show these computations:

https://dotnet-eng-grafana.westus2.cloudapp.azure.com/d/aOlx_y9nk/helix-insights-ma-crossovers?orgId=1&from=now-7d&to=now&var-QueueName=windows.10.amd64&viewPanel=2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See e9386ce for the definitions of the queue statuses

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"moving averages" of... what? Queue depth? Wait times? I'm... not sure what I'm seeing in that graph. But if we compare 6hr to 24hr, wouldn't that just means for 6 hours every morning we called all the queues overloaded? So basically the entire business day is just "overloaded" by default.

Copy link
Contributor

@riarenas riarenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. I think the implementation details are good to have and will be useful, but we should make sure that the problems we're trying to solve and your proposed solution can be understood without having to go into those implementation details.

@melotic
Copy link
Member Author

melotic commented Jun 3, 2022

For now, Chad, Ricardo and I decided its best to draw a line in the sand and scrap the Queue Insights section for now. There are a lot of challenges that prevent us from being able to accurately present insights of queues, namely:

  • Past queue activity doesn't predict future activity
  • When the PR is submitted, it can be any where from 1-3 hours to when your tests run, and by then the status of the queues could have changed. We could solve this by creating & populating the check when the tests are about to run, but the effort in linking helix runs to a specific PR is a challenging problem. We also would not have anything to show the user untill the build is complete.. It is still on radar to complete after an initial POC of the features I want to implement is done.

See the revised mockup for the scope of features I'll be implementing.

@markwilkie
Copy link
Member

I really like the mini-graphs.

Am I understanding correctly that we're only showing queue status as a GH check? (no separate dashboard?)

@melotic
Copy link
Member Author

melotic commented Jun 14, 2022

I really like the mini-graphs.

Am I understanding correctly that we're only showing queue status as a GH check? (no separate dashboard?)

Yes. The GH check has links to the Grafana dashboard for that specified queue. (ubuntu.1804.amd64.open --> open's link to that respective Grafana dashboard)

@markwilkie
Copy link
Member

Yes. The GH check has links to the Grafana dashboard for that specified queue. (ubuntu.1804.amd64.open --> open's link to that respective Grafana dashboard)

Cool, that makes sense. This means the current plan doesn't include the "red, yellow, green" state? This is probably fine, but I don't want to promise or imply that we're doing it if we're not.

@riarenas
Copy link
Contributor

Cool, that makes sense. This means the current plan doesn't include the "red, yellow, green" state? This is probably fine, but I don't want to promise or imply that we're doing it if we're not.

It currently does not. We determined that the goal so far is to help individual developers understand what's going on with Helix in the context of their PRs.

A dashboard or a general status page are currently out of scope for this.

@markwilkie
Copy link
Member

A dashboard or a general status page are currently out of scope for this.

ACK - makes sense. Thanks


3. Query the work item wait time and queue size for that pipeline's list of queues.
1. Currently Grafana has this data, with Kusto queries that we can pull and use.
2. We will simply pull the queries that Grafana uses them to present the data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. We will simply pull the queries that Grafana uses them to present the data.
2. We will simply pull the queries that Grafana uses to present the data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this very minor typo is still here

@riarenas
Copy link
Contributor

Took me a while to make a full pass again. I think this is almost ready to go. Needs to be cleaned up a bit after the latest changes because there are some contradictions in some places.

I'd also like to see what your plan for rolling this out to customers would be. This isn't something we want to just enable everywhere, but should rather enable in a few repositories first so we can get initial feedback.

@melotic
Copy link
Member Author

melotic commented Jun 23, 2022

@riarenas ready for review. I've yanked the insight stuff for now. Perhaps this can be merged into the ML one-pager later.

riarenas
riarenas previously approved these changes Jun 23, 2022
@riarenas
Copy link
Contributor

LGTM, I think we just need to remove the section about average times for build machines etc from the "commited" mockup.

@melotic melotic enabled auto-merge (squash) June 23, 2022 20:29
@melotic melotic merged commit 9dbd4ef into dotnet:main Jun 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants