What happened?
Problem
Deleting a PCSG with 5000 PodCliques takes 20 minutes due to low concurrency settings.
Current bottleneck:
concurrentSyncs: 3 - only 3 PodCliques reconcile at once
- Manual deletion + finalizer cleanup for each PodClique
- API rate limiting (QPS: 100, Burst: 150)
Scale impact:
- 5000 PodCliques: 20+ minutes
test this this yaml
# Scale Test: ~5000 pods (625 prefill + 624 decode ScalingGroups + 3 frontend)
# Large scale test for validating extreme-scale behavior on KWOK cluster
---
apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
name: scale-test-5000
labels:
app: scale-test-5000
spec:
replicas: 1
template:
podCliqueScalingGroups:
- name: prefill
replicas: 625
minAvailable: 625
cliqueNames:
- pleader
- pworker
- name: decode
replicas: 625
minAvailable: 625
cliqueNames:
- dleader
- dworker
cliques:
- name: frontend
labels:
kai.scheduler/queue: default
spec:
roleName: frontend
replicas: 3
minAvailable: 3
podSpec:
nodeSelector:
type: kwok
tolerations:
- key: fake-node
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: frontend
image: registry:5001/nginx:alpine-slim
resources:
requests:
memory: 30Mi
cpu: 100m
- name: pleader
labels:
kai.scheduler/queue: default
spec:
roleName: pleader
replicas: 1
minAvailable: 1
podSpec:
nodeSelector:
type: kwok
tolerations:
- key: fake-node
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: pleader
image: registry:5001/nginx:alpine-slim
resources:
requests:
memory: 30Mi
cpu: 100m
- name: pworker
labels:
kai.scheduler/queue: default
spec:
roleName: pworker
replicas: 3
minAvailable: 3
podSpec:
nodeSelector:
type: kwok
tolerations:
- key: fake-node
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: pworker
image: registry:5001/nginx:alpine-slim
resources:
requests:
memory: 30Mi
cpu: 100m
- name: dleader
labels:
kai.scheduler/queue: default
spec:
roleName: dleader
replicas: 1
minAvailable: 1
podSpec:
nodeSelector:
type: kwok
tolerations:
- key: fake-node
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: dleader
image: registry:5001/nginx:alpine-slim
resources:
requests:
memory: 30Mi
cpu: 100m
- name: dworker
labels:
kai.scheduler/queue: default
spec:
roleName: dworker
replicas: 3
minAvailable: 3
podSpec:
nodeSelector:
type: kwok
tolerations:
- key: fake-node
operator: Equal
value: "true"
effect: NoSchedule
containers:
- name: dworker
image: registry:5001/nginx:alpine-slim
resources:
requests:
memory: 30Mi
cpu: 100m
Solutions
Option 1: Increase Concurrency
controllers:
podClique:
concurrentSyncs: 50 # 16x increase
runtimeClientConnection:
qps: 500
burst: 1000
Option 2: GC-Based Deletion
Remove manual deletion, let Kubernetes GC cascade deletionTimestamp automatically.
Changes needed:
- Remove
deletePodCliqueScalingGroupResources() step
- PCSG finalizer removes itself immediately (don't wait for children)
- Trust PodClique finalizers for cleanup
Result: Instant cascade, scales to any size
Tradeoff: Requires finalizer logic changes
Impact
Blocks scale testing and production use of large workloads.
Related to: #405
What did you expect to happen?
delete in sec not 20 mintues
Environment
- Kubernetes version 1.32
- Grove version main
- Scheduler details default
- Cloud provider or hardware configuration
- Tools that you are using Grove together with KWOK nodes
- Anything else that is relevant
What happened?
Problem
Deleting a PCSG with 5000 PodCliques takes 20 minutes due to low concurrency settings.
Current bottleneck:
concurrentSyncs: 3- only 3 PodCliques reconcile at onceScale impact:
test this this yaml
Solutions
Option 1: Increase Concurrency
Option 2: GC-Based Deletion
Remove manual deletion, let Kubernetes GC cascade deletionTimestamp automatically.
Changes needed:
deletePodCliqueScalingGroupResources()stepResult: Instant cascade, scales to any size
Tradeoff: Requires finalizer logic changes
Impact
Blocks scale testing and production use of large workloads.
Related to: #405
What did you expect to happen?
delete in sec not 20 mintues
Environment