-
Notifications
You must be signed in to change notification settings - Fork 377
Add one-pager document for Increasing Visibility into Helix Queues #9536
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md
Outdated
Show resolved
Hide resolved
|
Sorry I missed the meeting. Hopefully my comments weren't already addressed there. If so, just respond with what the consensus was and resolve them. |
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
|
|
||
| *In progress.* (This will be updated to a link to the PR once it's created) | ||
|
|
||
| ### Risk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd include a risk about the accuracy of the data. Queue behavior fluctuates so wildly between outages and evil, bad jobs, that it's really difficult to tell in advance what the behavior of any particular work item will be. I think some sort of system that tracks the "accuracy" would be interesting. Like when a job actually completes, cross reference the "real" time with our estimation, and record that somewhere. That way when, inevitably, customers give us feedback that it's not accurate (we almost always get feedback about stuff not being accurate/true), we have the metrics to actually judge that. Metrics are an important part of a long-term viable project, so you can keep track in the future if assumptions change that your existing services continue to function adequately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point. I'm not sure if this is something I can tackle for the scope of this project, but I've made a note of it.
I should add somewhere that this is maybe a feature worth implementing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like an important piece to me. Even if we initially tackle showing less information initially, this seems like it is an important requirement to know whether we are giving meaningful numbers to customers. Providing misleading information is usually worse than not providing information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 x 1000. I'm really worried about the accuracy of this information. To the point that I'd rather have ONLY the metrics, and not the UI than the other way around. It's easy enough to add UI to it later, once we decide the information is good, but it's hard to change it without the data necessary to know if the information is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been working on quantifying when we'll display that a queue is "overloaded" or experiencing "high volume". My current thought is we can compute 24hr/12hr/6hr moving averages (over the 95th percentile) and compare the values of the moving averages. For instance, we could define overloaded as the 6hr work item wait time moving average being twice the 24hr work item wait time moving average. I've created a Grafana dashboard that will show these computations:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See e9386ce for the definitions of the queue statuses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"moving averages" of... what? Queue depth? Wait times? I'm... not sure what I'm seeing in that graph. But if we compare 6hr to 24hr, wouldn't that just means for 6 hours every morning we called all the queues overloaded? So basically the entire business day is just "overloaded" by default.
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md
Show resolved
Hide resolved
riarenas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. I think the implementation details are good to have and will be useful, but we should make sure that the problems we're trying to solve and your proposed solution can be understood without having to go into those implementation details.
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup.md
Show resolved
Hide resolved
|
For now, Chad, Ricardo and I decided its best to draw a line in the sand and scrap the Queue Insights section for now. There are a lot of challenges that prevent us from being able to accurately present insights of queues, namely:
See the revised mockup for the scope of features I'll be implementing. |
|
I really like the mini-graphs. Am I understanding correctly that we're only showing queue status as a GH check? (no separate dashboard?) |
Yes. The GH check has links to the Grafana dashboard for that specified queue. (ubuntu.1804.amd64.open --> open's link to that respective Grafana dashboard) |
Cool, that makes sense. This means the current plan doesn't include the "red, yellow, green" state? This is probably fine, but I don't want to promise or imply that we're doing it if we're not. |
It currently does not. We determined that the goal so far is to help individual developers understand what's going on with Helix in the context of their PRs. A dashboard or a general status page are currently out of scope for this. |
ACK - makes sense. Thanks |
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
|
|
||
| 3. Query the work item wait time and queue size for that pipeline's list of queues. | ||
| 1. Currently Grafana has this data, with Kusto queries that we can pull and use. | ||
| 2. We will simply pull the queries that Grafana uses them to present the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. We will simply pull the queries that Grafana uses them to present the data. | |
| 2. We will simply pull the queries that Grafana uses to present the data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this very minor typo is still here
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
Documentation/TeamProcess/One-Pagers/increase-visibility-helix-queues-arcade8824.md
Outdated
Show resolved
Hide resolved
|
Took me a while to make a full pass again. I think this is almost ready to go. Needs to be cleaned up a bit after the latest changes because there are some contradictions in some places. I'd also like to see what your plan for rolling this out to customers would be. This isn't something we want to just enable everywhere, but should rather enable in a few repositories first so we can get initial feedback. |
|
@riarenas ready for review. I've yanked the insight stuff for now. Perhaps this can be merged into the ML one-pager later. |
Documentation/TeamProcess/One-Pagers/IncreaseVisibilityHelixQueues/design-mockup-justin-impl.md
Outdated
Show resolved
Hide resolved
|
LGTM, I think we just need to remove the section about average times for build machines etc from the "commited" mockup. |
Epic: dotnet/dnceng#2657
To double check: