新尝试:最近关注点在推理,一方面学习 vLLM(目前主要在看 Blog、 Office Hours ,代码还没怎么看)、此外还关注了下 Ray、PyTorch; 另一个重点话题就是 AI 编排,简单总结了下推理编排方案如何选择?AIBrix or Kthena or Dynamo?这块还没形成标准,但是因为这块其实很薄,更多的是如何和调度器等生态整合,而不是标准化。
今年年底回顾了下之前看到的各种焦点,https://github.com/pacoxu/AI-Infra 也一直在学习 AI Infra 的内容。上面公众号很多文章就是来自之前 AI Infra 学习过程的积累。当然也发现,这个 AI Infra 定义似乎并不被很多人接受,尤其是本来做 AI 方向的,这里的概念更像是 Cloud Native AI Infra 的概念。
最后用劝别人的话劝劝自己:很多事情收集完整信息后 就可以做决定了。收集信息的时间可以拉长,做决定的时间 需要控制,才能留出时间去完成它;last but not least 落子无悔。有耐心的去坚持自己的选择(反悔的成本一般都不低,但都可以,所以需要坚持到一定程度 再谈放弃)。纠结只能说明你有选择权。
AI 焦虑,一方面来自 AI 发展的速度日新月异,新技术文章和项目如雨后春笋。学习速度远远跟不上 AI 发展速度,而且能感觉到越拉越多。当然可以借助 AI 加速学习过程,在公众号实践中,我其实就是我来收集材料和截图(或者生成一些配图),给AI相关的核心关键链接和内容,让 AI 组织语言,能非常快的发布一篇文章。
另一方面,发现 AI 巨头尤其是模型侧和公有云部分,很多优化会被规模放大,新的模型基本被巨头垄断了,而不仅仅在模型方面,在其他AI细分领域,小规模会越来越弱势。尤其是 MCP、A2A 等工具链和 workflow 引入之后,Agent 能力得到了很好的扩展,越来越多的领域会出现更高的“墙”。而这个过程中,可以看到有可能被取代的工作越来越多,AI编程一方面压缩了初级程序员的生存空间,一方面也在给产品经理或者非程序员更多想象力,也许这个“泛程序员”市场会被放大也未可知。希望 AI 能创造出更多职业和新的动力,而不是毁灭掉更多。
如何适应 AI 时代?有个初级的想法,就是做一个高效的 Agent:高效的 Agent 就是一个高效的人类智能体,
祖辈还在种地(焦虑吃不饱),父辈基本都在工厂学校(焦虑不稳定),平辈白领居多(竞争焦虑),未来也许都是 AI 带来的各种工作(巨大的不确定性)。短短大几十年,感觉思想和规则都跟不上技术变化,地球/国家/大企业都像是一台庞大的机器,也像是 Kubernetes 社区,已经经过了几十轮甚至更多的迭代,核心稳定之后发现 AI 这波有需要不少巨大的变化要去适应。每个阶段 3-5年,我们的焦虑点会有所不同,但似乎难以避免。
Agones brings dedicated game server hosting to Kubernetes, enabling multiplayer gaming infrastructure with cloud-native scalability and management. This blog explores Agones as it applies to join CNCF Sandbox.
Introduction
As the gaming industry grows rapidly, the demand for scalable, reliable dedicated game server infrastructure has become critical. Agones is an open-source platform built on Kubernetes that addresses this need by providing a specialized solution for hosting, running, and scaling dedicated game servers.
Agones, derived from the Greek word “agōn” meaning “contest” or “competition at games”, transforms Kubernetes into a powerful platform for managing game server workloads with the same cloud-native principles used for traditional applications.
Project Status: Agones has applied to join the CNCF Sandbox (github.com/cncf/sandbox/issues/440), marking an important step in bringing gaming workloads into the cloud-native ecosystem.
What is Agones?
Agones is a library for hosting, running, and scaling dedicated game servers on Kubernetes. It replaces bespoke or proprietary cluster management solutions with Kubernetes-native APIs and controllers.
Core Concept: Dedicated game servers are stateful, ephemeral workloads that differ significantly from typical web applications. Each game session requires its own isolated server process, must maintain consistent network identity, and needs specialized lifecycle management. Agones extends Kubernetes to handle these unique requirements through Custom Resource Definitions (CRDs) and controllers.
Key Features
GameServer CRD: Define individual game servers declaratively using YAML or the Kubernetes API, complete with health checking and connection information
Fleet Management: Manage large groups of game servers as Fleets, similar to Kubernetes Deployments but optimized for game server workloads
Autoscaling: Native integration with Kubernetes cluster autoscaling, allowing Fleets to scale based on game server demand
Client SDKs: SDKs for multiple languages (Go, C#, C++, Rust, Node.js, REST) enabling game servers to communicate with the Agones control plane
Lifecycle Management: Automatic health checks, graceful shutdown handling, and state management for game server processes
Metrics and Observability: Game server-specific metrics exports and dashboards for operations teams
Architecture and Design
Agones extends Kubernetes with custom controllers and resources specifically designed for game server workloads:
Custom Resources
GameServer: Represents a single dedicated game server instance with health status, network ports, and connection information
Fleet: Manages groups of GameServers, providing replica management, rolling updates, and scaling capabilities
FleetAutoscaler: Automates Fleet scaling based on buffer policies, webhook policies, or counter/list-based policies
GameServerAllocation: Enables matchmakers to atomically allocate Ready GameServers from a Fleet for player connections
How It Works
Deployment: Operators define GameServers or Fleets using Kubernetes manifests
Lifecycle Management: Agones controllers create pods and manage their lifecycle based on game server state
Ready State: Game servers use the Agones SDK to mark themselves Ready when accepting connections
Allocation: Matchmaking systems request GameServer allocation via the Kubernetes API
Session Management: Game servers notify Agones when sessions end, triggering cleanup
Autoscaling: FleetAutoscalers monitor Fleet status and adjust replicas to maintain desired buffer or respond to custom policies
Use Cases and Production Adoption
Agones is designed for multiplayer gaming scenarios requiring dedicated game servers:
Session-based multiplayer games: FPS, MOBA, battle royale games where each match runs on a dedicated server
Persistent game worlds: MMO game zones or shards that require long-lived server processes
Match-based esports: Competitive gaming infrastructure requiring consistent server performance
Cross-platform gaming: Unified infrastructure for console, PC, and mobile multiplayer experiences
The project is already used in production by major gaming companies and has proven its reliability at scale. The CNCF sandbox application notes that “this project is already used in production by many” organizations.
Why CNCF?
According to the CNCF Sandbox application:
Since Agones is tightly coupled to Kubernetes, CNCF is the logical home for the project. Agones being in the CNCF allows for a broader community contributor ecosystem.
Agones brings a new gaming offering to the CNCF landscape, representing a specific but important use case for Kubernetes. As cloud-native technologies expand into specialized domains, gaming infrastructure represents a significant workload category with unique requirements.
Cloud-Native Integration
Agones integrates directly with core CNCF projects:
Kubernetes: Built as a Kubernetes controller with CRDs
Prometheus: Metrics exports for monitoring game server health and performance
Helm: Installation and configuration via Helm charts
Container runtimes: Works with any Kubernetes-compatible container runtime
Project Governance and Community
Agones operates as a vendor-neutral open-source project:
License: Apache 2.0
Code of Conduct: Contributor Covenant
Governance: Clear contribution guidelines and ownership model
Community Channels: Active Slack workspace, mailing list, regular community meetings
Maintained by: Originally created by Google Cloud, now community-driven with multiple maintainers
The project has comprehensive documentation, quickstart guides, and example implementations for developers getting started with game server hosting on Kubernetes.
Similar Projects and Ecosystem
Within the Kubernetes gaming ecosystem, OpenKruise’s kruise-game (github.com/openkruise/kruise-game) provides similar capabilities. Both projects demonstrate growing interest in gaming workloads on Kubernetes.
Agones’ application to CNCF Sandbox represents an opportunity to establish standards and best practices for game server orchestration across the cloud-native community.
Vision and Roadmap
Agones continues active development with regular releases following a documented release process. The project roadmap focuses on:
Enhancing autoscaling capabilities with more sophisticated policies
Improving observability and debugging tools for game server operations
Expanding SDK support for additional programming languages and engines
Performance optimizations for larger-scale deployments
Better integration with matchmaking and lobby systems
The project aims to make dedicated game server hosting as straightforward and reliable as deploying stateless web applications, while respecting the unique requirements of real-time gaming workloads.
Getting Started
For developers interested in exploring Agones:
Documentation: Comprehensive guides at agones.dev/site/docs/
Quick Start: Install Agones on a Kubernetes cluster and deploy a simple game server
Examples: Multiple example game server implementations in the repository
Community: Join the Agones Slack and mailing list for support and discussion
Agones represents the maturation of gaming infrastructure into the cloud-native era, bringing the operational benefits of Kubernetes to one of the most demanding real-time workload types.
Conclusion
Agones transforms Kubernetes into a powerful platform for dedicated game server hosting, addressing the unique challenges of multiplayer gaming infrastructure. As it applies to join the CNCF Sandbox, the project demonstrates how cloud-native technologies can adapt to specialized workload requirements while maintaining Kubernetes-native principles.
For gaming companies building multiplayer experiences and infrastructure teams managing game servers, Agones provides a proven, production-ready solution that leverages the full ecosystem of cloud-native tools and practices.
Note: The content in this article is based on currently available public information and is intended for technical reference only. The effectiveness of each solution depends heavily on your specific workload, infrastructure, and ecosystem integration. The architectural affiliations and early design choices mentioned here do not determine their future direction. In practice, community activity, openness, and long-term evolution are often more important factors. Please evaluate and choose based on your own scenario.
Introduction
The landscape of open-source inference orchestration for Large Language Models (LLMs) has evolved rapidly in 2025. Multiple projects have emerged to address the challenges of deploying and scaling LLM inference workloads on Kubernetes, each with its own approach to workload management, resource orchestration, and performance optimization.
This blog post provides an overview of the current inference orchestration solutions, examines the convergence trends in the ecosystem, and raises important questions about when Prefill-Decode (PD) disaggregation truly provides value.
The Current Landscape
Rapid Development, Gradual Convergence
The inference orchestration space is characterized by:
Many implementations: Multiple projects solving similar problems
Different architectural choices: Varying approaches to workload management
Shared goals: All aim to optimize LLM inference at scale
Emerging patterns: Common solutions beginning to emerge
Despite the diversity, we’re seeing convergence around key patterns: LeaderWorkerSet (LWS)-based architectures, intelligent routing, and disaggregated serving models.
Workload Orchestration Solutions
1. Dual LWS Architecture
llm-d implements a dual LeaderWorkerSet architecture for Prefill-Decode disaggregation:
Two LWS instances: Separate LWS for prefill and decode workers
KServe integration: Deep integration with KServe for model serving
LMCache support: Efficient KV cache management across workers
Routing sidecar: Intelligent request routing and cache optimization
Why dual LWS? This architecture enables independent scaling and resource optimization for each phase while maintaining coordination through the leader-worker pattern.
2. Serving Group: Volcano Kthena
Kthena takes a different approach with its Serving Group concept:
No dual LWS: Kthena intentionally avoids the dual LWS pattern
Gang scheduling integration: Leverages Volcano’s gang scheduling capabilities
Reduced layering: Eliminates the StatefulSet/Pod layer complexity
Direct integration: Native integration with Volcano scheduler
Why not LWS? The Kthena team found that integrating with Volcano’s gang scheduling required a different architecture. The dual LWS, StatefulSet, and Pod layering added complexity without clear benefits for their use case.
This design choice reflects a key insight: the best orchestration solution depends on your existing infrastructure and scheduling requirements.
3. StormService: AIBrix
AIBrix StormService provides specialized container lifecycle management for P/D disaggregation:
P/D lifecycle management: Fine-grained control over prefill and decode containers
Multi-mode support: TP, PP, single GPU, and P/D disaggregation
StormService and RoleSet CRDs: Custom resources for P/D orchestration
Enterprise features: Multi-tenancy, routing, and observability
LWS-inspired: Incorporates proven patterns from LeaderWorkerSet
Resource-aware scheduling: Optimizes batch scheduling based on resources
Batch optimization: Intelligent batching strategies for throughput
P/D support: Enables disaggregated prefill and decode workloads
Convergence Trends
Common Patterns Emerging
Despite different implementations, several patterns are converging:
Pattern
llm-d
Kthena
AIBrix
Dynamo
RBG
LWS-based
✓ (dual)
✗
✗
✓ (option)
✓ (inspired)
P/D disaggregation
✓
✓
✓
✓
✓
Intelligent routing
✓
✓
✓
✓
✓
KV cache management
LMCache
Native
Distributed
Native
Native
Why So Many Implementations?
The diversity reflects different optimization goals:
Scheduling integration: Kthena needs Volcano gang scheduling directly
Enterprise features: AIBrix focuses on multi-tenancy and observability
Performance focus: Dynamo optimizes for NVIDIA hardware
Simplicity: RBG provides a lightweight LWS-inspired approach
Production-readiness: llm-d demonstrates a complete reference implementation
The PD Disaggregation Question
At KCD Hangzhou 2025, Wen Yuan Yu’s keynote “Kubernetes Is Born for Service Resource Orchestration—MaaS Changes Everything” raised an important question about PD-separation:
“Achieving strong production gains from PD-separation is very difficult. While stress testing can show great results, in real dynamic environments it becomes much harder. Over-provisioning Decode introduces significant challenges.”
This observation directly challenges the assumption that PD-separation is always beneficial.
Does PD Disaggregation Always Provide Value?
At KCD Hangzhou 2025, Yu Wen Yuan’s keynote “Kubernetes Was Built for Service-Resource Orchestration. MaaS Changes Everything” raised important questions about PD disaggregation:
“PD-Disaggregate Role Scheduling • Not So Sure? (Our answer is Data Plane!)”
This challenges the assumption that PD disaggregation is always beneficial.
When PD Disaggregation Helps
PD disaggregation provides clear benefits when:
Long prefill, short decode: Input prompts are much longer than outputs
High concurrency: Many simultaneous requests need serving
Heterogeneous hardware: Different GPU types for different phases
SLA-driven scheduling: Different latency requirements (TTFT vs TPOT)
KV cache transfer overhead: Network latency exceeds computation savings
The Data Plane Perspective
The “Data Plane” answer suggests that the value of PD disaggregation depends on where bottlenecks actually exist. Before implementing complex orchestration:
Profile your workload: Understand where time is spent
Measure KV cache transfer costs: Network overhead matters
Consider simpler alternatives: TP/DP without disaggregation
Evaluate operational complexity: More components = more failure modes
Configuration Optimization: AIConfigurator
Choosing the right P/D configuration is complex. NVIDIA’s AIConfigurator helps optimize disaggregated deployment configurations:
What AIConfigurator Does
Configuration space search: Evaluates thousands of P/D combinations
Predictive optimization: Estimate performance before deployment
Resource efficiency: Maximize GPU utilization with SLA guarantees
Recommendations
For New Deployments
Start simple: Begin with monolithic serving (no P/D disaggregation)
Profile first: Understand your workload characteristics
Use AIConfigurator: Let data guide configuration decisions
Add complexity gradually: Introduce P/D only when benefits are clear
For Existing Infrastructure
If you use…
Consider…
Volcano
Kthena (native integration)
KServe
llm-d (deep integration)
vLLM
AIBrix (vLLM ecosystem)
NVIDIA GPUs
Dynamo (NVIDIA optimization)
SGLang
RBG (LWS-inspired, lightweight)
Key Questions Before Adopting PD Disaggregation
Is your prefill time >> decode time? If not, disaggregation may not help.
Can your network handle KV cache transfer? Network overhead can eliminate gains.
Do you need independent scaling? If P and D scale together, keep them together.
Is operational complexity acceptable? More components = more failure modes.
Conclusion
The inference orchestration landscape is diverse but converging. Key takeaways:
Multiple solutions exist because different infrastructure has different needs
LWS-based patterns are popular but not universal (Kthena’s Serving Group shows alternatives)
PD disaggregation is not always valuable – profile your workload first
Tools like AIConfigurator help navigate the complex configuration space
Start simple, add complexity when needed based on actual measurements
The future will likely see further consolidation around proven patterns, but the current diversity reflects healthy experimentation in a rapidly evolving field.
Agent Sandbox provides a secure, isolated, and efficient execution environment for AI agents. This blog explores the project, its integration with gVisor and Kata Containers, and future trends.
Introduction
As AI agents become more prevalent in enterprise applications, the need for secure execution environments has become critical. Agent Sandbox is a new Kubernetes project under SIG Apps that addresses this challenge by providing a standardized, declarative API for managing isolated, stateful, singleton workloads—ideal for AI agent runtimes.
Key Features:
Kubernetes Primitive Sandbox CRD and Controller: A native Kubernetes abstraction for managing sandboxed workloads
Ready to Scale: Support for thousands of concurrent sandboxes while achieving sub-second latency
Developer-Focused SDK: Easy integration into agent frameworks and tools
Project Overview
Core: Sandbox CRD
The Sandbox Custom Resource Definition (CRD) is the heart of agent-sandbox. It provides a declarative API for managing a single, stateful pod with:
Stable Identity: Each Sandbox has a stable hostname and network identity
Persistent Storage: Sandboxes can be configured with persistent storage that survives restarts
Lifecycle Management: The controller manages pod lifecycle including creation, scheduled deletion, pausing, and resuming
Extensions
The project provides additional CRDs for advanced use cases:
SandboxTemplate: Reusable templates for creating Sandboxes
SandboxClaim: Allows users to create Sandboxes from templates
SandboxWarmPool: Manages a pool of pre-warmed Sandbox Pods for fast allocation (achieving sub-second startup latency)
Agent Sandbox is designed to be vendor-neutral, supporting various runtimes to provide enhanced security and isolation. The two primary implementations are gVisor and Kata Containers.
gVisor Integration (GKE)
图片来源:KubeCon 北美 Keynote 演讲,Jago Macleod (谷歌)
gVisor is an application kernel that provides an additional layer of isolation between container applications and the host kernel. It intercepts application system calls and implements them in user space.
GKE Integration Status:
Production Ready: gVisor is available as a runtime option in Google Kubernetes Engine (GKE) via the gvisor RuntimeClass
Snapshot and Resume: GKE supports snapshotting and resuming sandboxes, enabling infrastructure efficiency and sophisticated parallel executions
Performance Optimized: The gVisor team at Google has optimized the runtime for AI agent workloads with minimal overhead
Kata Containers provides lightweight virtual machines that behave like containers but offer the security isolation of VMs. Each container runs in its own lightweight VM with a dedicated kernel.
Integration Status:
Active Development: The Kata Containers community is actively working on Agent Sandbox integration
VM-Level Isolation: Provides strong isolation through hardware virtualization
GPU Support: Kata supports GPU passthrough for AI/ML workloads
Agent Sandbox represents an important step forward in providing secure, efficient execution environments for AI agents on Kubernetes. With support for multiple isolation runtimes (gVisor and Kata Containers), standardized APIs, and a focus on developer experience, it addresses the growing need for sandboxed AI workloads in enterprise environments.
The project is actively developing under SIG Apps, and contributions from the community are welcome. Whether you’re building AI agents, development environments, or any workload requiring isolated execution, Agent Sandbox provides a Kubernetes-native solution.
Co-Evolving: When Kubernetes Features Empower the Ecosystem
In the rapidly evolving AI infrastructure landscape, a beautiful synergy is emerging: the Kubernetes community develops foundational capabilities, and downstream projects like JobSet, Ray, and LeaderWorkerSet (LWS) adopt these features to dramatically improve their efficiency. We call this Co-Evolving (协同演进) — the entire ecosystem advancing together.
Kubernetes has been introducing more AI-related capabilities recently, but realizing their full potential in AI workloads requires adaptation by other projects. Today, we’ll explore a prime example: JobSet leveraging Kubernetes In-Place Container Restart to achieve 92% faster restart times.
The Problem: Slow JobSet Restart
When a distributed training job running on JobSet needs to restart (due to transient failures, configuration updates, or checkpoint recovery), the traditional approach involves:
Delete all pods in the JobSet
Wait for pod termination to complete
Reschedule all pods through the Kubernetes scheduler
Wait for pod startup (including image pulls, init containers, etc.)
In a large-scale cluster with 5000 nodes, this process takes approximately 2 minutes and 10 seconds. For AI/ML workloads where fast recovery is critical, this overhead is significant.
The Solution: In-Place Container Restart
Kubernetes has introduced capabilities that allow containers to restart without pod recreation:
This JobSet improvement exemplifies the Co-Evolving pattern in cloud-native AI:
Kubernetes Capability
Project Adoption
Benefit
In-Place Restart
JobSet
92% faster recovery
Gang Scheduling (1.35)
Kueue, LWS
All-or-nothing placement
DRA (1.34 GA)
NVIDIA GPU Operator
Flexible device allocation
Workload API (1.35)
Volcano, YuniKorn
Native workload support
As Kubernetes continues to add AI-friendly features, we expect more projects to adopt them, creating a virtuous cycle of improvement.
Getting Started
Prerequisites
Kubernetes 1.34+ (for KEP-5307)
Kubernetes 1.35+ (for KEP-5532 pod-level restart)
JobSet with in-place restart support (check latest releases)
Enable Feature Gates
# On kubelet for KEP-5307 (Container Restart Policy, 1.34+)
--feature-gates=ContainerRestartPolicy=true
# On kubelet for KEP-5532 (Restart All Containers, 1.35+)
--feature-gates=RestartAllContainersOnContainerExits=true
The JobSet in-place restart optimization demonstrates the power of Co-Evolving in the Kubernetes ecosystem. By adopting upstream Kubernetes capabilities, projects can achieve dramatic performance improvements:
92% faster restart (2m10s → 10s)
No scheduling overhead
Preserved pod identity and network
Reduced API server load
This is just one example of how the Kubernetes community and downstream projects work together to improve AI workload efficiency. As more AI-related features land in Kubernetes, we can expect even more optimizations from projects like JobSet, Ray, LWS, and others.
The future of AI infrastructure is Co-Evolving — and it’s happening now.
At KubeCon NA 2025, one theme dominated conversations in the AI/ML space: topology. Everyone is talking about topology-aware scheduling because it’s critical for optimizing AI workload performance.
Modern AI workloads, especially distributed training and high-performance inference, are extremely sensitive to hardware topology. When GPUs, NICs, CPUs, and memory are not properly aligned within the same NUMA node, PCIe root, or network fabric, performance can degrade by 30-50% or more.
Background: Current Topology Scheduling Support
Device Plugin: The Traditional Approach
Kubernetes Device Plugins have been the standard mechanism for managing hardware resources like GPUs. The Device Plugin API provides:
Dynamic Resource Allocation (DRA) represents a fundamental shift in how Kubernetes handles device topology. DRA provides structured parameters that enable rich topology expression and constraint specification.
How DRA Handles Topology-Aware Scheduling
DRA uses attributes and constraints with CEL (Common Expression Language) to express topology requirements. The key mechanisms include:
Device Attributes: Each device publishes topology information
pcieRoot: PCIe hierarchy identifier
numaNode: NUMA node association
nvlinkDomain: NVLink fabric identifier
rdmaDevice: Associated RDMA NIC
Constraints: CEL expressions that enforce topology rules
Same PCIe root for GPU and NIC
Same NUMA node for CPU and memory
NVLink connectivity between GPUs
SharedID: Devices on the same topology domain get a shared identifier
GPU + NIC Topology Coordination
The most powerful use case for DRA topology is coordinating GPU and NIC allocation on the same PCIe root. This is critical for RDMA-based distributed training where GPU-Direct is used.
ResourceClaimTemplate with PCIe Topology Constraint Example:
apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
name: gpu-nic-topology
spec:
spec:
devices:
requests:
- name: gpu
deviceClassName: nvidia-gpu
count: 1
- name: rdma-nic
deviceClassName: rdma-nic
count: 1
constraints:
# GPU and NIC must be on the same PCIe root
- requests: ["gpu", "rdma-nic"]
matchAttribute: pcieRoot
How this works:
The DRA scheduler evaluates available GPUs and NICs
For each candidate GPU, it finds NICs on the same PCIe root
Only allocations satisfying the constraint are considered
The matchAttribute: pcieRoot ensures both devices share the same PCIe topology
DRANET: Network Device DRA
DRANET is Google’s DRA implementation for network devices. It integrates with Kueue’s topology-aware scheduling using node labels:
# DRANET uses these labels for topology awareness
cloud.google.com/gce-topology-block
cloud.google.com/gce-topology-subblock
cloud.google.com/gce-topology-host
kubernetes.io/hostname
DRANET + NVIDIA GPU DRA can coordinate:
RDMA NICs allocated with GPUs on same PCIe fabric
Multi-NIC configurations for distributed training
Network isolation using SR-IOV VFs
CPU Micro-Topology Support
The dra-driver-cpu project is adding CPU micro-topology support including:
NUMA-aware CPU allocation
CPU pinning with topology alignment
Coordination with GPU NUMA placement
DRAConsumableCapacity: New in Kubernetes 1.34
A major advancement in DRA is the DRAConsumableCapacity feature:
Topology-aware scheduling has evolved from a nice-to-have feature to a critical requirement for AI workloads. The transition from Device Plugin to DRA represents a fundamental shift in how Kubernetes manages hardware topology:
Device Plugin: Simple, established, but limited topology support
As AI workloads continue to grow in complexity, the need for sophisticated topology-aware scheduling will only increase. Whether you’re using Kueue, Volcano, or native Kubernetes scheduling, understanding topology and planning for DRA adoption is essential for optimizing your AI infrastructure.
Scheduling large workloads in Kubernetes has always been challenging. When you need to run distributed training jobs, batch processing tasks, or other multi-pod applications, the traditional pod-by-pod scheduling approach can lead to resource wastage, deadlocks, and inefficiencies. Today, we’re excited to share insights about the Workload Aware Scheduling initiative that’s transforming how Kubernetes handles multi-pod workloads.
The Problem with Traditional Pod Scheduling
In traditional Kubernetes scheduling, each pod is scheduled independently. For distributed workloads like:
Distributed ML training (e.g., PyTorch, TensorFlow multi-worker jobs)
Batch processing (e.g., Apache Spark, Ray clusters)
Gang scheduling implements the all-or-nothing placement strategy:
How it works:
Waiting Phase: When pods arrive, the scheduler blocks them until minCount pods are pending
Evaluation Phase: The scheduler attempts to find suitable nodes for all pods in the gang
Decision Phase:
✅ Success: If all pods can be placed, they’re bound to nodes together
❌ Failure: If any pod can’t be placed within timeout (5 minutes), ALL pods are rejected and requeued
This prevents resource waste and ensures your distributed workload either runs completely or waits for sufficient resources.
Key benefits:
Eliminates partial scheduling deadlocks
Improves cluster utilization by freeing resources for runnable workloads
Provides predictable behavior for distributed applications
Works seamlessly with pod preemption and autoscaling
3. Opportunistic Batching (Beta)
Opportunistic Batching is a performance optimization that speeds up scheduling of identical pods without requiring any configuration changes.
How it works:
When the scheduler processes pods with identical scheduling requirements (same resources, images, affinities, etc.), it can reuse feasibility calculations and scoring results for subsequent pods in the queue.
Performance impact:
Dramatically reduces scheduling latency for large homogeneous workloads
Can improve scheduling throughput by 5-10x for batch workloads
Works transparently – no user configuration needed
Enabled by default in Kubernetes v1.35 (Beta)
Current restrictions:
Disabled for pods using topology spread constraints
Disabled for pods using Dynamic Resource Allocation (DRA)
All scheduling-relevant pod fields must be identical
Real-World Use Cases
Distributed ML Training
apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
name: pytorch-training
spec:
podGroups:
- name: workers
policy:
gang:
minCount: 8 # Need 8 GPUs for distributed training
Your PyTorch distributed training job only starts when all 8 workers can be scheduled, preventing wasted GPU resources.
Spark jobs with gang scheduling avoid the common problem where the driver starts but executors can’t be scheduled.
Ray Clusters
Ray applications benefit from gang scheduling by ensuring the head node and worker nodes start together, enabling immediate distributed computation.
The Roadmap: What’s Coming in 1.36 and Beyond
The Workload Aware Scheduling effort has an ambitious roadmap for Kubernetes 1.36:
Planned for v1.36
Expanding Workload API: Enhanced capabilities and refinements based on alpha feedback
Auto-workload for Job, StatefulSet, JobSet: Automatic workload creation for common Kubernetes resources
Topology Aware Scheduling: Consider network and hardware topology when placing gang members
Single-cycle workload scheduling: Schedule entire gangs in a single scheduling cycle for better performance
Tree-based workload scheduling algorithm: More efficient gang placement decisions
Improved binding process: Better handling of kubelet races using nominations
Delayed preemption: Introduce nominating victims before actual eviction
Workload-level preemption: Preempt entire gangs rather than individual pods
Long-term Vision
The ultimate goal is to make Kubernetes natively understand and optimize for workload-level operations, including:
Deep integration with cluster autoscaling
Workload-aware resource quotas and limits
Better support for mixed workload types (batch + serving)
Enhanced observability for multi-pod applications
Upcoming Official Blog Post
The Kubernetes community is preparing an official blog post about Workload Aware Scheduling that will be published soon on the Kubernetes blog. Watch for kubernetes/website#53012 to be merged for the official announcement.
Getting Started
Prerequisites
Kubernetes v1.35 or later
Feature gates configured on kube-apiserver and kube-scheduler
Enable Workload API and Gang Scheduling
# On kube-apiserver
--feature-gates=GenericWorkload=true
--runtime-config=scheduling.k8s.io/v1alpha1=true
# On kube-scheduler
--feature-gates=GenericWorkload=true,GangScheduling=true
Enable Opportunistic Batching
Opportunistic Batching is enabled by default in v1.35 as a Beta feature. To disable it:
# On kube-scheduler
--feature-gates=OpportunisticBatching=false
Testing Gang Scheduling
Create a Workload resource
Create pods with workloadRef pointing to the Workload
Observe scheduling behavior in kube-scheduler logs
Monitor metrics for gang scheduling success/failure rates
Best Practices
Set appropriate minCount: Consider your application’s minimum viable size
Use resource requests accurately: Gang scheduling depends on accurate resource requirements
Monitor scheduling metrics: Track gang scheduling success rates and timeout events
Test with cluster autoscaling: Ensure your autoscaler can provision nodes for gangs
Plan for failure scenarios: Understand timeout behavior and retry logic
Comparison with Existing Solutions
Before native gang scheduling, users relied on:
Volcano: CNCF incubating project with gang scheduling
Kueue: Kubernetes SIG project for queue and quota management
YuniKorn: Apache project with gang scheduling support
Custom schedulers: In-house solutions for specific use cases
Why use native gang scheduling?
Maintained by Kubernetes SIG Scheduling
Integrated with core scheduler features (preemption, autoscaling)
No additional components to deploy and maintain
Part of the Kubernetes conformance suite (eventually)
When to use external schedulers?
Need production-ready gang scheduling today (use Volcano or Kueue)
Require features beyond current Kubernetes roadmap
Gang Scheduling and Workload Aware Scheduling represent a major step forward for Kubernetes in supporting AI/ML, HPC, and batch processing workloads. The v1.35 alpha release provides a foundation for native multi-pod scheduling, with an exciting roadmap for v1.36 and beyond.
We encourage the community to:
Test these features in development environments
Provide feedback through GitHub issues
Share use cases and requirements
Contribute to the ongoing development
The future of Kubernetes scheduling is workload-aware, and the journey has just begun!
As v1.35 will announce the cgroup v1 deprecation, kubelet will fail on cgroup v1 with default configuration. FailCgroupV1 will be set to true by default. See more in coming blog https://github.com/kubernetes/website/pull/52814. Blow is what I wrote after cgroup v1 was announced to enter maintenance mode. As I linked a lot and can not finish is pretty complete, I stopped update https://github.com/kubernetes/website/pull/47342. Just publish it here for users who want to know more about why we should shift from cgroup v1 to v2 and the difference.
cgroups (control groups) are a Linux kernel feature used for managing system resources. Kubernetes uses cgroups to allocate resources like CPU and memory to containers, ensuring that applications run smoothly without interfering with each other. With the release of Kubernetes v1.31, cgroups v1 has been moved into [maintenance mode]/blog/2024/08/14/kubernetes-1-31-moving-cgroup-v1-support-maintenance-mode/). For cgroups v2, it graduated in v1.25 2 years ago.
Top FAQs are why we should migrate, what’s the benifits and lost, and what needs to be noticed when using cgroups v2.
cgroups v1 problem, and solutions in cgroups v2
cgroups v1 and cgroups official doc can be found in
In cgroups v1, we have no native solutions. Workarounds are setting larger memory limit for Pods or using some external projects to drop cache or throttling memory allocating when memory is beyond a threshold.
In cgroups v2, we can use memory.high to throttle.
Support for Memory QoS was initially added in Kubernetes v1.22, and later some limitations around the formula for calculating memory.high were identified. These limitations are addressed in Kubernetes v1.27.
However, until v1.31, the feature gate is still alpha due to another known issue that application pod may be hanging forever due to heavy memory reclaiming.
Container aware OOM killer and better OOM handling strategies
In cgroups v2, one process of a multi-processes Pod could be killed by the OOM killer. In this case, Pod has to use runit or supervisord to manage multi processes lifecycle.
cgroups v2 uses cgroup.kill file. Writing “1” to the file causes the cgroups and all descendant cgroups to be killed. This means that all processes located in the affected cgroup tree will be killed via SIGKILL. Pod may run multiple processes, and all processes can be killed simultaneously.
As mentioned above, cgroups v2 memory.high can throttle the new memory allocation and cgroups can be aware of the OOM earsiler. Besides, PSI can also help to know the memory load. oomd is a good example using PSI to implement a userspace out-of-memory killer.
Rootless support
In cgroups v1, delegating cgroups v1 controllers to less privileged containers may be dangerous.
Unlike cgroups v1, cgroups v2 officially supports delegation. Most Rootless Containers implementations rely on systemd for delegating v2 controllers to non-root users.
User Namespace minimal kernel version is 6.5, according to KEP-127.
What’s more?
eBPF stories:
In cgroups v1, the device access control are defined in the static configuration/.
cgroups v2 device controller has no interface files and is implemented on top of cgroup BPF.
Cilium will automatically mount cgroups v2 filesystem required to attach BPF cgroup programs by default at the path /run/cilium/cgroupv2 .
PSI is planned in a future release KEP-4205, but pending due to runc 1.2.0 release delay.
monitoring tools support, like Cadvisor. Currently, cgroups v2 features are not fully-supported yet.
Adopting cgroup version 2
Requirements
Here’s what you need to use cgroup v2 with Kubernetes. First up, you need to be using a version of Kubernetes with support for v2 cgroup management; that’s been stable since Kubernetes v1.25 and all supported Kubernetes releases include this support.
OS distribution enables cgroups v2
Linux Kernel version is 5.8 or later
Container runtime supports cgroups v2. For example:
containerd v1.4 or later (at the time of writing, containerd releases v1.6 and later are within that project’s support period)
CRI-O v1.20 or later
The kubelet and the container runtime are configured to use the systemd cgroup driver
kernel updates around cgroups v2
cgroups v2 first appeared in Linux Kernel 4.5 in 2016.
In Linux 4.5, cgroups v2 io, memory & pid cgroups management were supported.
Linux 4.15 added support for cgroups v2 cpu management
The Container runtimes page explains that the systemd driver is recommended for kubeadm based setups instead of the kubelet’s default cgroupfs driver, because kubeadm manages the kubelet as a systemd service.
A minimal example of configuring the field explicitly:
In v1.31, KEP-4033 is beta to extend CRI API for the kubelet to discover the cgroup driver from the container runtime. This will help installer and kubelet to autodetect
Tools and commands for troubleshooting
Tools and commands that you should know about cgroups:
stat -fc %T /sys/fs/cgroup/: Check if cgroups v2 is enabled which will return cgroup2fs
systemctl list-units kube* --type=slice or --type=scope: List kube related units that systemd currently has in memory.
bpftool cgroup list /sys/fs/cgroup/*: List all programs attached to the cgroup CGROUP.
systemd-cgls /sys/fs/cgroup/*: Recursively show control group contents.
systemd-cgtop: Show top control groups by their resource usage.
tree -L 2 -d /sys/fs/cgroup/kubepods.slice: Show Pods’ related cgroups directories.
How to check if a Pod CPU or memory limit is successfully applied to the cgroup file?
Kubernetes Pod Spec: check limits spec.containers[*].resources.limits.{cpu,memory} and requests spec.containers[*].resources.requests.{cpu,memory}
CRI: cpu_period, cpu_quota, cpu_shares for CPU and memory_limit_in_bytes for memory limit
Note, this blog will only include the basic requirments and configurations in Kubernetes components. It will not include how to enable cgroup fs in OS distributions. For migration, you can refer to migrating cgroups v2
InftyAI’s llmaz is an advanced inference platform designed to streamline the deployment and management of large language models (LLMs) on Kubernetes. By integrating state-of-the-art inference backends, llmaz brings cutting-edge research to the cloud, offering a production-ready solution for LLMs.
Key Features of llmaz:
Kubernetes Integration for easy to use: deploy and manage LLMs within Kubernetes clusters, leveraging Kubernetes’ robust orchestration capabilities.
Advanced Inference Backends: Utilize state-of-the-art inference backends to ensure efficient and scalable model serving.
Production-Ready: Designed for production environments, llmaz offers reliability and performance for enterprise applications.
The deployment of a model is quite simple in llmaz.
Here’s a toy example for deploying deepseek-ai/DeepSeek-R1, all you need to do is to apply a Model and a Playground.
The latest release, v0.1.3, was released on April 23th, 2025. The release v0.1 includes several enhancements and bug fixes to improve the platform’s stability and performance. For detailed information on the changes introduced in this release, please refer to the release notes.
Integrations
Broad Backends Support: llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.
llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores.
AI Gateway Support: Offering capabilities like token-based rate limiting, model routing with the integration of Envoy AI Gateway. Build-in ChatUI: Out-of-the-box chatbot support with the integration of Open WebUI, offering capacities like function call, RAG, web search and more, see configurations here.
llmaz, serving as an easy to use and advanced inference platform, uses LeaderWorkerSet as the underlying workload to support both single-host and multi-host inference scenarios.
llmaz supports horizontal scaling with HPA by default and will integrate with autoscaling components like Cluster-Autoscaler or Karpenter for smart scaling across different clouds.
About the Founder: Kante Yin
Kante Yin is a prominent figure in the Kubernetes community, serving as a SIG Scheduling Approver and a top committer of LWS and Kueue. His contributions to Kubernetes scheduling and workload management have been instrumental in advancing cloud-native technologies. Kante’s expertise and leadership continue to drive innovation in the Kubernetes ecosystem.
Compared to other inference platforms, llmaz stands out with its extensionable cloud-native design, making it incredibly lightweight and efficient. Its architecture is optimized for scalability and resource efficiency, enabling seamless integration into modern cloud environments while maintaining high performance.
OSPP 2025 (Open Source Software Supply)
The Open Source Promotion Plan is a summer program organized by the Open Source Software Supply Chain Promotion Plan of the Institute of Software Chinese Academy of Sciences in 2020. It aims to encourage university students to actively participate in the development and maintenance of open source software, cultivate and discover more outstanding developers, promote the vigorous development of excellent open source software communities, and assist in the construction of open source software supply chains.
llmaz has 2 projects in OSPP 2025. Student Registration and Application: May 9 – June 9. Welcome to our community.