Coordinator throttling RFC by prashantgolash · Pull Request #42 · prestodb/rfcs

prashantgolash · 2025-08-07T04:56:50Z

RFC for coordinator throttling.

…sed on worker load (prestodb#25689) Summary: RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

tdcmeehan

It seems like one flaw in this design is it's fundamentally unfair, meaning, one user can overwhelm a cluster and cause it to be unavailable for other multitenant users. Have you thought about how to make the design more fair? Resource groups are designed to provide some degree of fairness, have you considered a solution that groups load by resource group in order to throttle the problematic users while allowing less problematic users to continue to use the cluster?

…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

prashantgolash · 2025-08-11T23:18:15Z

It seems like one flaw in this design is it's fundamentally unfair, meaning, one user can overwhelm a cluster and cause it to be unavailable for other multitenant users. Have you thought about how to make the design more fair? Resource groups are designed to provide some degree of fairness, have you considered a solution that groups load by resource group in order to throttle the problematic users while allowing less problematic users to continue to use the cluster?

Initially I also explored enhancing RG level throttling mechanism, but there are some reasons I choose this approach.

Accuracy of Resource Group (RG) Resource Accounting:
The softMemoryLimits setting does not account for "unaccounted memory" that may be shared across multiple queries. As a result, accurately attributing actual worker resource usage to individual queries—and therefore to corresponding Resource Groups—can be imprecise. On the CPU side, I am also unsure whether throttling is implemented correctly (see reference). In the future, we plan to introduce additional worker load metrics (e.g., queued drivers, I/O throttling) that may not align directly with the current RG configuration.

Granularity of RG Accounting:
At present, RG accounting operates at the stage->query level across the entire cluster. For our approach, we require more granular, worker-level resource utilization metrics. While it is technically possible to enhance the logic to map worker resource usage back to RGs, the limitations described in point (1) mean that this alone may not provide sufficient accuracy.

Lack of Tenant Isolation at the Worker Level:
Another key concern is tenant isolation. Admitting queries from users that are not causing issues may still result in those queries being routed to the same overloaded node. Since there is currently no tenant isolation at the worker level, simply admitting "non-problematic" queries is unlikely to resolve potential resource contention.

tdcmeehan · 2025-08-12T15:24:02Z

The softMemoryLimits setting does not account for "unaccounted memory" that may be shared across multiple queries.

To be clear, I'm not disagreeing with you that the existing metrics might not be sufficient to prevent overadmission, I'm simply pointing out that the current mechanism of queueing lacks fairness, which is not true for existing resource groups.

As a result, accurately attributing actual worker resource usage to individual queries—and therefore to corresponding Resource Groups—can be imprecise.

What I proposed earlier would be a per-query limit on worker-reported total memory usage. I just think that the decision to queue should be configured at a per-query level in the coordinator if we can't reliably aggregate these metrics from the task level.

On the CPU side, I am also unsure whether throttling is implemented correctly (see reference). In the future, we plan to introduce additional worker load metrics (e.g., queued drivers, I/O throttling) that may not align directly with the current RG configuration.

Queued drivers and IO all sound like they could be aggregated from the task level, which would make them eligible to be configured as new metrics in the resource group. This design leaves them as worker-determined binary flags. I would instead report the raw metrics, which could then be aggregated into resource groups and be configured far more flexibly.

At present, RG accounting operates at the stage->query level across the entire cluster. For our approach, we require more granular, worker-level resource utilization metrics. While it is technically possible to enhance the logic to map worker resource usage back to RGs, the limitations described in point (1) mean that this alone may not provide sufficient accuracy.

For metrics that can be aggregated, they should be added as metrics for queueing consideration in the resource group. For metrics which can't be aggregated, I would recommend this decision be made on a per-query basis and configurable (for example, splitting DDL statements from execution, potentially classifying queries which only access connectors which use a single connection like JDBC, and splitting them from resource-heavy queries like Hive and Iceberg).

Another key concern is tenant isolation. Admitting queries from users that are not causing issues may still result in those queries being routed to the same overloaded node. Since there is currently no tenant isolation at the worker level, simply admitting "non-problematic" queries is unlikely to resolve potential resource contention.

I would recommend that that also be addressed in this design. The Presto scheduler already uses heuristics to decide which nodes are eligible to be scheduled to. If the worker now is reporting reliable statistics to aid in this decision, then the scheduler should use this information to improve resource utilization and prevent this scenario. See SimpleNodeSelector.

prashantgolash · 2025-08-13T06:08:58Z

What I proposed earlier would be a per-query limit on worker-reported total memory usage. I just think that the decision to queue should be configured at a per-query level in the coordinator if we can't reliably aggregate these metrics from the task level.

To ensure I understand your suggestion correctly, could you please clarify the following points:
Are you proposing that, if we detect a worker is trending towards high utilization, we should prevent the admission of heavy queries?
What would the configuration look like in practice?
Would this configuration be specific to a particular resource group (RG), or would it apply more broadly?

For metrics that can be aggregated, they should be added as metrics for queueing consideration in the resource group. For metrics which can't be aggregated, I would recommend this decision be made on a per-query basis and configurable (for example, splitting DDL statements from execution, potentially classifying queries which only access connectors which use a single connection like JDBC, and splitting them from resource-heavy queries like Hive and Iceberg).

If some queries from an RG are light weight and others in the same RG are making worker overload, should we still admit them as per this policy. I think you mentioned about fairness. My understanding is this will violate it or let me know if I missing sth here. In one of the earlier comment, you also mentioned about non-aggregated metrics be defined in conjunction with the resource group metrics. Is query level config and non-aggregated RG metrics are one and same thing? It would be great, if you can provide an example.

I would recommend that that also be addressed in this design. The Presto scheduler already uses heuristics to decide which nodes are eligible to be scheduled to. If the worker now is reporting reliable statistics to aid in this decision, then the scheduler should use this information to improve resource utilization and prevent this scenario. See SimpleNodeSelector.

I mentioned about this in Granular task scheduling section. In our case, most of the workers become overload in matter of few mins. As such the main goal of this RFC was to be reactive rather than proactive, but I can definitely see that this can be improved. This is something I was planning to do in V2 (This also applies for intermediate task scheduling which are also one of the reason of workers becoming overloaded). cc @spershin as well for this thoughts and share more insights on cluster overload pattern.

spershin · 2025-08-14T19:44:54Z

@prashantgolash , @tdcmeehan

Yes, I concur that the phase 1 of this feature is 'reactive', rather than proactive.
We can improve on it further and move it forward to the 'proactive' stage, allowing light-weight queries, while holding the heavy ones (subject to starvation).

I also want to point out that in our clusters we are seeing that if the cluster gets overloaded, then majority of the workers get into this stage, not just a few stragglers. That meant that coordinator is doing quite a good job in distributing work properly and it looks like we don't need to handle that part. That leaves us focusing on not sending more workload to avoid making things worse.

That's why I believe we likely don't really need a complex framework entwined into RGs - if we are overloaded/near overloaded then we stop query admission, when we are out of the woods we restart query admission and it just goes along the RG lines. That makes it fair, as fair as it is now.

If we want to ensure that some bad players (say RGs with heavy queries) are getting submitted less, we should do it separately, using the current RG framework, making metrics better (decide that partial memory is ok, export number of threads or anything that could help us understand how heavy the query is). The idea is that it does not need to be the part of the overload pushback, IMHO, at least in the phase 1. Running it in the real use cases might give us more insight and we can change opinion.

prashantgolash · 2025-08-19T20:53:42Z

If a worker has self-identified as overloaded, would its corresponding NodeState be updated as INACTIVE/SHUTTING_DOWN?

Since Node is still running queries, planning to keep it ACTIVE state.

aditi-pandit

@spershin : Had a high level question.

spershin

Looks good!
Thank you!

…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **ODS Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **ODS Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181

tdcmeehan · 2025-09-16T19:49:59Z

@prashantgolash can you please fix up the numbering to reflect the latest number in this repo? Thanks

prashantgolash · 2025-09-16T21:20:27Z

@prashantgolash can you please fix up the numbering to reflect the latest number in this repo? Thanks

Done

Coordinator throttling RFC

8184916

prestodb-ci added the from:Meta PRs from Meta label Aug 7, 2025

prashantgolash assigned tdcmeehan, rschlussel, amitkdutta, aditi-pandit and spershin Aug 7, 2025

prashantgolash mentioned this pull request Aug 7, 2025

[Coordinator throttling] Scheduling Policies for Admission Control based on worker load prestodb/presto#25689

Merged

rschlussel reviewed Aug 7, 2025

View reviewed changes

Comment thread RFC-0011-coordinator-throttling.md Outdated

Comment thread RFC-0011-coordinator-throttling.md Outdated

Comment thread RFC-0011-coordinator-throttling.md Outdated

Comment thread RFC-0011-coordinator-throttling.md Outdated

Comment thread RFC-0011-coordinator-throttling.md Outdated

tdcmeehan requested changes Aug 7, 2025

View reviewed changes

Comment thread RFC-0011-coordinator-throttling.md Outdated

prashantgolash added 3 commits August 7, 2025 22:12

Review comments

80b2b64

Review comments

329f5d6

Review comments

58d781f

rschlussel approved these changes Aug 8, 2025

View reviewed changes

tdcmeehan requested changes Aug 11, 2025

View reviewed changes

Comment thread RFC-0011-coordinator-throttling.md Outdated

Comment thread RFC-0011-coordinator-throttling.md Outdated

Comment thread RFC-0011-coordinator-throttling.md Outdated

Comment thread RFC-0011-coordinator-throttling.md

Review comments

df46ccc

tdcmeehan reviewed Aug 12, 2025

View reviewed changes

Comment thread RFC-0011-coordinator-throttling.md

Comment thread RFC-0011-coordinator-throttling.md Outdated

prashantgolash requested a review from pgupta2 August 12, 2025 21:11

prashantgolash added 2 commits August 18, 2025 23:57

Code review comments

58ece3a

Code review comments

61be7e9

Code review comments

14f2458

prashantgolash added 2 commits August 19, 2025 15:13

Code review comments

9c7491d

Code review comments

739d04c

aditi-pandit reviewed Aug 19, 2025

View reviewed changes

Comment thread RFC-0011-coordinator-throttling.md Outdated

prashantgolash added 2 commits August 21, 2025 09:56

Code review comments

47db9d8

Code review comments

9fe01a4

tdcmeehan reviewed Aug 26, 2025

View reviewed changes

Comment thread RFC-0011-coordinator-throttling.md Outdated

Code review comments

c21a9b8

tdcmeehan approved these changes Aug 27, 2025

View reviewed changes

spershin approved these changes Aug 27, 2025

View reviewed changes

tdcmeehan approved these changes Sep 16, 2025

View reviewed changes

Fix naming

9c1fa03

tdcmeehan approved these changes Sep 16, 2025

View reviewed changes

tdcmeehan merged commit 168f99d into prestodb:main Sep 16, 2025
1 check passed

Conversation

prashantgolash commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

prashantgolash commented Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdcmeehan commented Aug 12, 2025

Uh oh!

Uh oh!

Uh oh!

prashantgolash commented Aug 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

spershin commented Aug 14, 2025

Uh oh!

prashantgolash commented Aug 19, 2025

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

spershin left a comment

Choose a reason for hiding this comment

Uh oh!

tdcmeehan commented Sep 16, 2025

Uh oh!

prashantgolash commented Sep 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

prashantgolash commented Aug 11, 2025 •

edited

Loading

prashantgolash commented Aug 13, 2025 •

edited

Loading