Coordinator throttling RFC#42
Conversation
…sed on worker load (prestodb#25689) Summary: RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
tdcmeehan
left a comment
There was a problem hiding this comment.
It seems like one flaw in this design is it's fundamentally unfair, meaning, one user can overwhelm a cluster and cause it to be unavailable for other multitenant users. Have you thought about how to make the design more fair? Resource groups are designed to provide some degree of fairness, have you considered a solution that groups load by resource group in order to throttle the problematic users while allowing less problematic users to continue to use the cluster?
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 RFC PR: prestodb/rfcs#42 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in D76357677 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** Design doc and rational for the change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.bdc9ugryon9z Also some follow up questions on the design review / future directions after this change - https://docs.google.com/document/d/16pEkXPzsP09ZpZ8RxqJ-n-c5kx3TFel0b8Ubx7l6v-I/edit?tab=t.0#heading=h.afichcgpu3fe **ODS Metrics on queuing due to this feature:** Right now Queued Queries ODS could be correlated with worker overload to see if this feature is getting activated. Also we log the warn logs when cluster is overloaded. I will add specific metrics for this feature before rollout as well to make debugging easy. **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
Initially I also explored enhancing RG level throttling mechanism, but there are some reasons I choose this approach. Accuracy of Resource Group (RG) Resource Accounting: Granularity of RG Accounting: Lack of Tenant Isolation at the Worker Level: |
To be clear, I'm not disagreeing with you that the existing metrics might not be sufficient to prevent overadmission, I'm simply pointing out that the current mechanism of queueing lacks fairness, which is not true for existing resource groups.
What I proposed earlier would be a per-query limit on worker-reported total memory usage. I just think that the decision to queue should be configured at a per-query level in the coordinator if we can't reliably aggregate these metrics from the task level.
Queued drivers and IO all sound like they could be aggregated from the task level, which would make them eligible to be configured as new metrics in the resource group. This design leaves them as worker-determined binary flags. I would instead report the raw metrics, which could then be aggregated into resource groups and be configured far more flexibly.
For metrics that can be aggregated, they should be added as metrics for queueing consideration in the resource group. For metrics which can't be aggregated, I would recommend this decision be made on a per-query basis and configurable (for example, splitting DDL statements from execution, potentially classifying queries which only access connectors which use a single connection like JDBC, and splitting them from resource-heavy queries like Hive and Iceberg).
I would recommend that that also be addressed in this design. The Presto scheduler already uses heuristics to decide which nodes are eligible to be scheduled to. If the worker now is reporting reliable statistics to aid in this decision, then the scheduler should use this information to improve resource utilization and prevent this scenario. See |
To ensure I understand your suggestion correctly, could you please clarify the following points:
If some queries from an RG are light weight and others in the same RG are making worker overload, should we still admit them as per this policy. I think you mentioned about fairness. My understanding is this will violate it or let me know if I missing sth here. In one of the earlier comment, you also mentioned about non-aggregated metrics be defined in conjunction with the resource group metrics. Is query level config and non-aggregated RG metrics are one and same thing? It would be great, if you can provide an example.
I mentioned about this in Granular task scheduling section. In our case, most of the workers become overload in matter of few mins. As such the main goal of this RFC was to be |
|
Yes, I concur that the phase 1 of this feature is 'reactive', rather than proactive. I also want to point out that in our clusters we are seeing that if the cluster gets overloaded, then majority of the workers get into this stage, not just a few stragglers. That meant that coordinator is doing quite a good job in distributing work properly and it looks like we don't need to handle that part. That leaves us focusing on not sending more workload to avoid making things worse. That's why I believe we likely don't really need a complex framework entwined into RGs - if we are overloaded/near overloaded then we stop query admission, when we are out of the woods we restart query admission and it just goes along the RG lines. That makes it fair, as fair as it is now. If we want to ensure that some bad players (say RGs with heavy queries) are getting submitted less, we should do it separately, using the current RG framework, making metrics better (decide that partial memory is ok, export number of threads or anything that could help us understand how heavy the query is). The idea is that it does not need to be the part of the overload pushback, IMHO, at least in the phase 1. Running it in the real use cases might give us more insight and we can change opinion. |
Since Node is still running queries, planning to keep it ACTIVE state. |
aditi-pandit
left a comment
There was a problem hiding this comment.
@spershin : Had a high level question.
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **ODS Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **ODS Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **ODS Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **ODS Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following ODS metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Pull Request resolved: prestodb#25689 Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
…sed on worker load (prestodb#25689) Summary: Admission control scheduling policy **Logic** Gather worker overload data from the added end point in PR - prestodb#25687 Based on configured policies (cnt of overloaded workers or pct of overloaded workers) and cluster overload, queue the queries **Background** RFC PR: prestodb/rfcs#42 **Metrics on queuing due to this feature:** Added following JMX metrics - ClusterOverloadDuration - ClusterOverloadCount **Feature flag:** Right now feature is disabled. We can use coordinator configs to enable / add thresholds Differential Revision: D79470181
|
@prashantgolash can you please fix up the numbering to reflect the latest number in this repo? Thanks |
Done |
RFC for coordinator throttling.