Skip to content

Slow Deletion at scale #423

@Ronkahn21

Description

@Ronkahn21

What happened?

Problem

Deleting a PCSG with 5000 PodCliques takes 20 minutes due to low concurrency settings.

Current bottleneck:

  • concurrentSyncs: 3 - only 3 PodCliques reconcile at once
  • Manual deletion + finalizer cleanup for each PodClique
  • API rate limiting (QPS: 100, Burst: 150)

Scale impact:

  • 5000 PodCliques: 20+ minutes

test this this yaml

# Scale Test: ~5000 pods (625 prefill + 624 decode ScalingGroups + 3 frontend)
# Large scale test for validating extreme-scale behavior on KWOK cluster
---
apiVersion: grove.io/v1alpha1
kind: PodCliqueSet
metadata:
  name: scale-test-5000
  labels:
    app: scale-test-5000
spec:
  replicas: 1
  template:
    podCliqueScalingGroups:
      - name: prefill
        replicas: 625
        minAvailable: 625
        cliqueNames:
          - pleader
          - pworker
      - name: decode
        replicas: 625
        minAvailable: 625
        cliqueNames:
          - dleader
          - dworker
    cliques:
      - name: frontend
        labels:
          kai.scheduler/queue: default
        spec:
          roleName: frontend
          replicas: 3
          minAvailable: 3
          podSpec:
            nodeSelector:
              type: kwok
            tolerations:
              - key: fake-node
                operator: Equal
                value: "true"
                effect: NoSchedule
            containers:
              - name: frontend
                image: registry:5001/nginx:alpine-slim
                resources:
                  requests:
                    memory: 30Mi
                    cpu: 100m
      - name: pleader
        labels:
          kai.scheduler/queue: default
        spec:
          roleName: pleader
          replicas: 1
          minAvailable: 1
          podSpec:
            nodeSelector:
              type: kwok
            tolerations:
              - key: fake-node
                operator: Equal
                value: "true"
                effect: NoSchedule
            containers:
              - name: pleader
                image: registry:5001/nginx:alpine-slim
                resources:
                  requests:
                    memory: 30Mi
                    cpu: 100m
      - name: pworker
        labels:
          kai.scheduler/queue: default
        spec:
          roleName: pworker
          replicas: 3
          minAvailable: 3
          podSpec:
            nodeSelector:
              type: kwok
            tolerations:
              - key: fake-node
                operator: Equal
                value: "true"
                effect: NoSchedule
            containers:
              - name: pworker
                image: registry:5001/nginx:alpine-slim
                resources:
                  requests:
                    memory: 30Mi
                    cpu: 100m
      - name: dleader
        labels:
          kai.scheduler/queue: default
        spec:
          roleName: dleader
          replicas: 1
          minAvailable: 1
          podSpec:
            nodeSelector:
              type: kwok
            tolerations:
              - key: fake-node
                operator: Equal
                value: "true"
                effect: NoSchedule
            containers:
              - name: dleader
                image: registry:5001/nginx:alpine-slim
                resources:
                  requests:
                    memory: 30Mi
                    cpu: 100m
      - name: dworker
        labels:
          kai.scheduler/queue: default
        spec:
          roleName: dworker
          replicas: 3
          minAvailable: 3
          podSpec:
            nodeSelector:
              type: kwok
            tolerations:
              - key: fake-node
                operator: Equal
                value: "true"
                effect: NoSchedule
            containers:
              - name: dworker
                image: registry:5001/nginx:alpine-slim
                resources:
                  requests:
                    memory: 30Mi
                    cpu: 100m

Solutions

Option 1: Increase Concurrency

controllers:
  podClique:
    concurrentSyncs: 50  # 16x increase
runtimeClientConnection:
  qps: 500
  burst: 1000

Option 2: GC-Based Deletion

Remove manual deletion, let Kubernetes GC cascade deletionTimestamp automatically.

Changes needed:

  • Remove deletePodCliqueScalingGroupResources() step
  • PCSG finalizer removes itself immediately (don't wait for children)
  • Trust PodClique finalizers for cleanup

Result: Instant cascade, scales to any size

Tradeoff: Requires finalizer logic changes

Impact

Blocks scale testing and production use of large workloads.

Related to: #405

What did you expect to happen?

delete in sec not 20 mintues

Environment

  • Kubernetes version 1.32
  • Grove version main
  • Scheduler details default
  • Cloud provider or hardware configuration
  • Tools that you are using Grove together with KWOK nodes
  • Anything else that is relevant

Metadata

Metadata

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions