feat: add fallback support for value metric type#6655
feat: add fallback support for value metric type#6655wozniakjan merged 3 commits intokedacore:mainfrom
Conversation
aeb2c2d to
3349924
Compare
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
|
@JorTurFer Is this something you guys would be interested in? If not, since it's stale now, I'll close it |
|
This is interesting indeed, the limitation of the metric type is something to fix. Sorry for not reviewing the PR, it was missed on my notification, I'm going to review it |
|
/run-e2e fallback |
JorTurFer
left a comment
There was a problem hiding this comment.
In general, the code looks nice, and it's a quite interesting solution! I'd need to check it by myself to be 100% sure that it works as expected, so let's see how e2e test goes. I've kept one small comment inline
|
@wozniakjan PTAL too |
|
@JorTurFer can we expect this in the next release?? , the version update was causing many issues as the average value is enforced , this would be helpful for me to update to latest version without any schema change for all the existing scaled objects |
|
I see there are also merge conflicts in this PR |
|
@y-rabie hey can you please take a look into it , and resolve the merge conflicts , this is much needed feature |
0d0f680 to
f302bd8
Compare
|
@SanthanCH @rickbrouwer Conflicts resolved |
|
/run-e2e fallback |
|
@rickbrouwer can you review now , i could see all the checks passed |
I think it's good to add this PR to the Release 2.18 list. That gives me some time to take a closer look at the PR, but overall, everything looks good. |
|
Hey @rickbrouwer is this feature going to be in next release? And when is the next release , is there any ETA ? |
|
@JorTurFer Yep, I think the timeout was introduced because we now test rollouts and deployments. I've splitted them into two files |
2a545c5 to
e373fc8
Compare
|
Hey @JorTurFer ,@zroubalik ,@rickbrouwer |
|
/run-e2e fallback* |
|
@SanthanCH unfortunately e2e tests still fail, PTAl |
|
Hey @y-rabie could you PTAL |
Signed-off-by: y-rabie <youssef.rabie@procore.com>
Signed-off-by: y-rabie <youssef.rabie@procore.com>
de92e68 to
1095496
Compare
Signed-off-by: y-rabie <youssef.rabie@procore.com>
1095496 to
7add249
Compare
|
/run-e2e fallback* |
|
Hey @JorTurFer ,@zroubalik ,@rickbrouwer |
* feat: add fallback support for value metric type Signed-off-by: y-rabie <youssef.rabie@procore.com> * Alternative implementation for getReadyReplicasCount in fallback Signed-off-by: y-rabie <youssef.rabie@procore.com> * fix: restructure fallback tests to avoid timeouts Signed-off-by: y-rabie <youssef.rabie@procore.com> --------- Signed-off-by: y-rabie <youssef.rabie@procore.com>
|
Still getting this error in keda admission webhook pod
This scaled object has only cpu trigger and metrics type is utilization |
|
Fallback doesn't work for CPU and Memory scalers because they those metrics aren't managed by KEDA. KEDA generates the HPA for the k8s metrics server in your behalf but nothing else. To use fallback after 2.18, you need at least one scaler different from memory or cpu |
We also run into this. And did not see it in the documentation of the cpu scaler. Maybe it is stated somewhere else. Would you accept a PR for the docs about this? |
|
It is not in the CPU and mem scaler doc, but it is here: https://keda.sh/docs/2.18/reference/scaledobject-spec/#fallback
|
|
@inrdkec 🙏 . Thanks for pointing out the right place in the documentation. |
* feat: add fallback support for value metric type Signed-off-by: y-rabie <youssef.rabie@procore.com> * Alternative implementation for getReadyReplicasCount in fallback Signed-off-by: y-rabie <youssef.rabie@procore.com> * fix: restructure fallback tests to avoid timeouts Signed-off-by: y-rabie <youssef.rabie@procore.com> --------- Signed-off-by: y-rabie <youssef.rabie@procore.com> Signed-off-by: Dmitriy Altuhov <altuhovd@gmail.com>
* feat: add fallback support for value metric type Signed-off-by: y-rabie <youssef.rabie@procore.com> * Alternative implementation for getReadyReplicasCount in fallback Signed-off-by: y-rabie <youssef.rabie@procore.com> * fix: restructure fallback tests to avoid timeouts Signed-off-by: y-rabie <youssef.rabie@procore.com> --------- Signed-off-by: y-rabie <youssef.rabie@procore.com>
* feat: add fallback support for value metric type Signed-off-by: y-rabie <youssef.rabie@procore.com> * Alternative implementation for getReadyReplicasCount in fallback Signed-off-by: y-rabie <youssef.rabie@procore.com> * fix: restructure fallback tests to avoid timeouts Signed-off-by: y-rabie <youssef.rabie@procore.com> --------- Signed-off-by: y-rabie <youssef.rabie@procore.com>
Since the constraint on having fallback only for
AverageValueseems to me kinda unwarranted, it's here relaxed a bit, and can be removed altogether if we opt for another implementation.The constraint now becomes that the
scaleTargetRefobject has a field.status.readyReplicasthat its controller updates with the number of ready replicas, so that we can directly use that. This is de facto the case withDeployments/StatefulSets/Replicasets/ArgoRollouts.We can then generically fetch the object as
unstructuredand access the value of the field to divide by it. A brief math illustration starting with the HPA's equationdesiredReplicas = ceil [currentReplicas * (currentMetricValue/desiredMetricValue) ]By passing
currentMetricValue = desiredMetricValue * fallbackReplicas / currentReplicasWe end up with
desiredReplicas = ceil [currentReplicas * (( desiredMetricValue * fallbackReplicas / currentReplicas )/desiredMetricValue) ]desiredReplicas = ceil [currentReplicas * (fallbackReplicas / currentReplicas ) ] = ceil [fallbackReplicas] = fallbackReplicasEmphasis: currentReplicas in HPA's equation is the number of ready replicas.
I preferred this approach to the other one (which would remove the
.status.readyReplicasfield constraint) which is manually counting the number of ready pods (similar to what HPA does here), since it'd be quite involved. If it seems a better approach to you, I can implement it.For full clarity, a problematic nit with this: is that we're dependent on the object's controller updating the
.status.readyReplicasin a timely manner.If there had been a lag, then the
currentReplicasthe HPA multiplies by (which it gets by manually counting pods) would deviate from thecurrentReplicasvalue we divide by.If ours is less, then we'd scale higher than
fallbackReplicasfor a brief while; if ours is more, then we'd scale less thanfallbackReplicas. Either way we should eventually stabilize atfallbackReplicasexactly.Final unrelated small change for correctness sake (that I think should be fixed regardless of this PR), the line
Value: *resource.NewMilliQuantity(normalisationValue*1000*replicas, resource.DecimalSI),has been changed to
Value: *resource.NewMilliQuantity(int64(normalisationValue*1000*replicas), resource.DecimalSI),with
normalizationValuebeingfloat64instead of int.This prevents early casting to int, which would discard fractions in
normalizationValueearly on, that would have been already rounded when multiplying by 1000 below.Imagine
normalisationValue=2.5andfallbackReplicas=10, previously,Value = 2*1000*10 = 20000.Now,
Value = 2.5*1000*10 = int64(25000.0) = 2500.Obviously the former will cause HPA to not scale to fallbackReplicas exactly.
For the unchecked list item, if this looks good, I'll open a PR to the docs repo.
Checklist
Fixes #4205