2025年终小结

工作篇(社区)

  • 社区 kube、CNCF
    • 1.33 – 1.35(Focusing on AI)
      •  Restart Container/Pod 策略, Gang Scheduling, DRA,Pod Level Resource Management,sla based scheduling(taint/toleration)
      • Node Readiness Controller
    • Steering 连任,感谢大家支持 Announcing the 2025 Steering Committee Election Results
    • Why CNCF TAGs are the core of cloud native innovation (and where to find them at KubeCon Atlanta) 参与到 CNCF 新的 TAG Workloads Foundation,Scope其实很大,目前还没重要产出,目前 Batch 和 Agentic 的讨论更多一些,之前的事情主要是调度白皮书部分。
    • Kubernetes 社区的发展来看,KCD 主要介绍了这部分内容,新的多个 AI 相关的 WorkGroup 内容,以及新的项目或者 AI相关的项目:Kueue、JobSet、GAIE、Agent Sadnbox、Kube Agentic Gateway 等。
    • 新尝试:最近关注点在推理,一方面学习 vLLM(目前主要在看 Blog、 Office Hours ,代码还没怎么看)、此外还关注了下 Ray、PyTorch; 另一个重点话题就是 AI 编排,简单总结了下推理编排方案如何选择?AIBrix or Kthena or Dynamo?这块还没形成标准,但是因为这块其实很薄,更多的是如何和调度器等生态整合,而不是标准化。
    • 尝试了一下开源之夏,前期收到了三份申请书,当时还是很开心的,看到申请书也很用心。但是执行过程有点无语😓,感觉流程本身可以完成,流程是完成了,但是看下来,除了提交申请书以及提交代码,其他时间的互动几乎为零。可以理解大家都很忙,但是总让人觉得很奇怪,希望活动能越来越好。

比较认可这张图,确实 对 AI ML workloads 的支持给了 Kubernetes 再次启航的机会,但是目前的趋势来看,kubernetes 社区更多的是定义 API,定义标准的 CRD,进行简单的最佳实践。目前 Gateway API、Gateway API Inference Extension、Gang API(Workloads),总体来说这部分(kube社区内)可探索空间依然不大,更多需要 Co-Evolving 协同演进到上下游项目中。这个和 cgroups v2 的内核更新与 Kubelet 相关功能更新很类似。

  • 公众号:
    • 公众号发文大概分几个系列:
      • 大集群 & 多集群方案:延续了 KubeCon 欧洲的主题,a huge cluster or multi clusters,翻译了 GKE 130k 和 65k(去年)的文章,翻译了 AWS 文章,汇总了字节 KubeWharf 体系的完整方案(开源版本) 以及 蚂蚁的集群方案和优化。多集群没太多涉及。阅读量来看还是这部分最吸引人。
      • 隔离方案:AgentSandbox 项目为主,也简单研究了下 Kata、gVisor、vArmor 原理。
      • AI 方向主要看的是 vLLM 以及相关的推理编排部分,此外还看了下 Ray 社区总结的 Co-Evolving,协同演进的项目越来越多 PARK PyTorch + AI + Ray + Kubernetes 这个提法挺有趣的。
      • 差不多刚好发了一个月,基本库存清理干净了,未来能做到双周更新就不错了。涨粉700,远超预期了。分享1千+是没想到的。
  • Repo
    • 今年年底回顾了下之前看到的各种焦点,https://github.com/pacoxu/AI-Infra 也一直在学习 AI Infra 的内容。上面公众号很多文章就是来自之前 AI Infra 学习过程的积累。当然也发现,这个 AI Infra 定义似乎并不被很多人接受,尤其是本来做 AI 方向的,这里的概念更像是 Cloud Native AI Infra 的概念。

生活篇

目前今年没太多出门,去了趟广州,青岛,苏州和杭州(KCD),KubeCon 去了趟香港和伦敦。

挺喜欢广州的饮茶文化,感觉人精神了很多。青岛的海滩真不错,苏杭也是很好的周末遛娃去处。

娃2岁开始围棋入门是不是有点早😂;带娃变成主旋律,但是面对“不要不要”的娃,耐心温柔实在是太难了,尤其是遇到一些赶时间或者意外的时候,实在是太难了。

今年感觉完全没看书,却萌生了未来和女儿一起看哈利波特书的想法。

AI 生成绘本一直没开始,上半年的尝试感觉效果不太好,最近的 Gemini 应该可以满足我的想法。

最后用劝别人的话劝劝自己:很多事情收集完整信息后 就可以做决定了。收集信息的时间可以拉长,做决定的时间 需要控制,才能留出时间去完成它;last but not least 落子无悔。有耐心的去坚持自己的选择(反悔的成本一般都不低,但都可以,所以需要坚持到一定程度 再谈放弃)。纠结只能说明你有选择权。

回顾过之前的很多次经历,对这张图的感受也越来越真切了。很多事情,只是自己多主动了一点点去争取,就做到了;也有很多事情,当时没主动去做,结果就错失机会了。

兴趣篇

  • 足球:
    • 今年踢球 28次,比去年强太多了,今年总体状态有所回升,体重算是维持在了80kg左右,没有持续肥胖,但是健康状况仍然不容乐观,鼻炎感冒频率仍然很高。今年似乎是我睡眠最差的一年😓,以前重来没有入睡障碍,今年经常半夜醒了就睡不着了。另外有人拍视频之后,踢球比之前更认真了,😂。
    • 伦敦 KubeCon 一周看了四场英超,也算是疯狂了一把,记忆最深刻还是富勒姆 3:2 利物浦。今年看直播也比往年多一些,一方面英超时间确实更友好,另一方面确实更好看一些,之前看西甲的时间真的很魔鬼,而且今年瓦伦卖掉刚培养好的中卫后又不行了。利物浦今年很煎熬,但是慢慢能看到有些改变,但是现在伊萨克伤了,感觉这赛季主打一个不顺。永远怀念20号若塔。萨拉赫的历史地位其实挺高的,也没想到来这么一出,希望能换战术后重生,毕竟世界杯年,去年开年没多久斋月后就感觉状态完全没了,现在也没找到,希望在国家队找回点自信,俱乐部也能在这段时间把战术磨到60分就可以。
  • PUBG 弃坑,今年国内战队打得实在是太差了😓,越来越看不到啥希望。
  • 围棋,这块可能还是需要大量的训练以及耐心计算,在目前的节奏里面很难静下心来下棋,当然也是水平太差。
  • 播客:开车干不了别的,基本靠听播客打发时间了,一个利物浦球迷在这听阿森纳球迷唠嗑😓。

心境

AI 焦虑,一方面来自 AI 发展的速度日新月异,新技术文章和项目如雨后春笋。学习速度远远跟不上 AI 发展速度,而且能感觉到越拉越多。当然可以借助 AI 加速学习过程,在公众号实践中,我其实就是我来收集材料和截图(或者生成一些配图),给AI相关的核心关键链接和内容,让 AI 组织语言,能非常快的发布一篇文章。

另一方面,发现 AI 巨头尤其是模型侧和公有云部分,很多优化会被规模放大,新的模型基本被巨头垄断了,而不仅仅在模型方面,在其他AI细分领域,小规模会越来越弱势。尤其是 MCP、A2A 等工具链和 workflow 引入之后,Agent 能力得到了很好的扩展,越来越多的领域会出现更高的“墙”。而这个过程中,可以看到有可能被取代的工作越来越多,AI编程一方面压缩了初级程序员的生存空间,一方面也在给产品经理或者非程序员更多想象力,也许这个“泛程序员”市场会被放大也未可知。希望 AI 能创造出更多职业和新的动力,而不是毁灭掉更多。

如何适应 AI 时代?有个初级的想法,就是做一个高效的 Agent:高效的 Agent 就是一个高效的人类智能体,

  • Routing:问题识别和分解,把一些事情快速的转发给合适的人去做,拆分任务排好优先级,部分活也可以“外包”出去。
    • 避免频繁切换 Context
    • 阶段性输出一些内容,可以让自己更好的量化自己
  • A2A:和其他高性价比的 Agent (如何筛选和标注)工作;另外就是 Agent 世界应该不能依赖workflow 和指定的人对接,而是每个人都是 Boss,你需要找其他Agent“付费”做事情。找不到事情做的 Agent 天然被淘汰掉或者冷却即可。“赚钱”的Agent被公布出来,供入不敷出的Agent 学习。也许不应该用钱来衡量,或许效率或者输出质量和能源消耗更好的反应吧。
  • MCP:学习使用各种 AI 工具
  • Reasoning:和其他 Agent 交互过程,不仅仅是获取结果,尝试理解其推理/实现过程
  • 知识图谱建立,高质量材料收集和整理
  • 锻炼身体增加兴趣:激活大脑更多专家

2026展望

活动: KCD 北京 + KCD 杭州(申请中)+ KubeCon China 2026 上海

日本 KubeCon 2025 很成功,还是蛮想去一趟的。

AI Infra 目前学习的内容还比较浅,尤其是实际的模型知识、推理引擎、DRA 的实战都是没有的,这块可能是2026年的重点。

工作前三年感觉纯探索,之后五年在产品方向,最近五年基本在kube 开源方向,受到这波 AI 的冲击,似乎需要做一些转型,AI 焦虑下如何定位自己变得越来越难了:模型训练、推理服务与编排、Agent Workflow、路由管理、KV Cache、P D分离、Gang 调度、GPU 管理(利用率提升)、沙箱预热、超大规模、成本优化、多租户隔离、可观测。看上去处处都是落脚点,实际上很多都是浅尝辄止。😐 虽然如此,明年依然计划会在更多新方向看看,甚至需要更耐心的去参与下 vLLM、SGlang 或者 AAIF 方向。

26年娃准备上托班和后续正式上幼儿园了,可能又是一个新的阶段了。带娃真的就是耐心和时间,无他。不想鸡娃,重视身体锻炼,增加户外。老婆在准备公开课和其他任务的时候今年也累倒好几次,还是要重视锻炼,加强抵抗力。

写在最后,也许每个人需要 3-5年整体回顾下

不管是工作还是人生,都需要周期性的做一些深度的回顾。

祖辈还在种地(焦虑吃不饱),父辈基本都在工厂学校(焦虑不稳定),平辈白领居多(竞争焦虑),未来也许都是 AI 带来的各种工作(巨大的不确定性)。短短大几十年,感觉思想和规则都跟不上技术变化,地球/国家/大企业都像是一台庞大的机器,也像是 Kubernetes 社区,已经经过了几十轮甚至更多的迭代,核心稳定之后发现 AI 这波有需要不少巨大的变化要去适应。每个阶段 3-5年,我们的焦虑点会有所不同,但似乎难以避免。

圣诞快乐🎄🧑‍🎄

2025年终小结

Agones: Kubernetes-Native Game Server Hosting

Agones brings dedicated game server hosting to Kubernetes, enabling multiplayer gaming infrastructure with cloud-native scalability and management. This blog explores Agones as it applies to join CNCF Sandbox.

Introduction

As the gaming industry grows rapidly, the demand for scalable, reliable dedicated game server infrastructure has become critical. Agones is an open-source platform built on Kubernetes that addresses this need by providing a specialized solution for hosting, running, and scaling dedicated game servers.

Agones, derived from the Greek word “agōn” meaning “contest” or “competition at games”, transforms Kubernetes into a powerful platform for managing game server workloads with the same cloud-native principles used for traditional applications.

Project Status: Agones has applied to join the CNCF Sandbox (github.com/cncf/sandbox/issues/440), marking an important step in bringing gaming workloads into the cloud-native ecosystem.

What is Agones?

Agones is a library for hosting, running, and scaling dedicated game servers on Kubernetes. It replaces bespoke or proprietary cluster management solutions with Kubernetes-native APIs and controllers.

Core Concept: Dedicated game servers are stateful, ephemeral workloads that differ significantly from typical web applications. Each game session requires its own isolated server process, must maintain consistent network identity, and needs specialized lifecycle management. Agones extends Kubernetes to handle these unique requirements through Custom Resource Definitions (CRDs) and controllers.

Key Features

  • GameServer CRD: Define individual game servers declaratively using YAML or the Kubernetes API, complete with health checking and connection information
  • Fleet Management: Manage large groups of game servers as Fleets, similar to Kubernetes Deployments but optimized for game server workloads
  • Autoscaling: Native integration with Kubernetes cluster autoscaling, allowing Fleets to scale based on game server demand
  • Client SDKs: SDKs for multiple languages (Go, C#, C++, Rust, Node.js, REST) enabling game servers to communicate with the Agones control plane
  • Lifecycle Management: Automatic health checks, graceful shutdown handling, and state management for game server processes
  • Metrics and Observability: Game server-specific metrics exports and dashboards for operations teams

Architecture and Design

Agones extends Kubernetes with custom controllers and resources specifically designed for game server workloads:

Custom Resources

  • GameServer: Represents a single dedicated game server instance with health status, network ports, and connection information
  • Fleet: Manages groups of GameServers, providing replica management, rolling updates, and scaling capabilities
  • FleetAutoscaler: Automates Fleet scaling based on buffer policies, webhook policies, or counter/list-based policies
  • GameServerAllocation: Enables matchmakers to atomically allocate Ready GameServers from a Fleet for player connections

How It Works

  1. Deployment: Operators define GameServers or Fleets using Kubernetes manifests
  2. Lifecycle Management: Agones controllers create pods and manage their lifecycle based on game server state
  3. Ready State: Game servers use the Agones SDK to mark themselves Ready when accepting connections
  4. Allocation: Matchmaking systems request GameServer allocation via the Kubernetes API
  5. Session Management: Game servers notify Agones when sessions end, triggering cleanup
  6. Autoscaling: FleetAutoscalers monitor Fleet status and adjust replicas to maintain desired buffer or respond to custom policies

Use Cases and Production Adoption

Agones is designed for multiplayer gaming scenarios requiring dedicated game servers:

  • Session-based multiplayer games: FPS, MOBA, battle royale games where each match runs on a dedicated server
  • Persistent game worlds: MMO game zones or shards that require long-lived server processes
  • Match-based esports: Competitive gaming infrastructure requiring consistent server performance
  • Cross-platform gaming: Unified infrastructure for console, PC, and mobile multiplayer experiences

The project is already used in production by major gaming companies and has proven its reliability at scale. The CNCF sandbox application notes that “this project is already used in production by many” organizations.

Why CNCF?

According to the CNCF Sandbox application:

Since Agones is tightly coupled to Kubernetes, CNCF is the logical home for the project. Agones being in the CNCF allows for a broader community contributor ecosystem.

Agones brings a new gaming offering to the CNCF landscape, representing a specific but important use case for Kubernetes. As cloud-native technologies expand into specialized domains, gaming infrastructure represents a significant workload category with unique requirements.

Cloud-Native Integration

Agones integrates directly with core CNCF projects:

  • Kubernetes: Built as a Kubernetes controller with CRDs
  • Prometheus: Metrics exports for monitoring game server health and performance
  • Helm: Installation and configuration via Helm charts
  • Container runtimes: Works with any Kubernetes-compatible container runtime

Project Governance and Community

Agones operates as a vendor-neutral open-source project:

  • License: Apache 2.0
  • Code of Conduct: Contributor Covenant
  • Governance: Clear contribution guidelines and ownership model
  • Community Channels: Active Slack workspace, mailing list, regular community meetings
  • Maintained by: Originally created by Google Cloud, now community-driven with multiple maintainers

The project has comprehensive documentation, quickstart guides, and example implementations for developers getting started with game server hosting on Kubernetes.

Similar Projects and Ecosystem

Within the Kubernetes gaming ecosystem, OpenKruise’s kruise-game (github.com/openkruise/kruise-game) provides similar capabilities. Both projects demonstrate growing interest in gaming workloads on Kubernetes.

Agones’ application to CNCF Sandbox represents an opportunity to establish standards and best practices for game server orchestration across the cloud-native community.

Vision and Roadmap

Agones continues active development with regular releases following a documented release process. The project roadmap focuses on:

  • Enhancing autoscaling capabilities with more sophisticated policies
  • Improving observability and debugging tools for game server operations
  • Expanding SDK support for additional programming languages and engines
  • Performance optimizations for larger-scale deployments
  • Better integration with matchmaking and lobby systems

The project aims to make dedicated game server hosting as straightforward and reliable as deploying stateless web applications, while respecting the unique requirements of real-time gaming workloads.

Getting Started

For developers interested in exploring Agones:

  1. Documentation: Comprehensive guides at agones.dev/site/docs/
  2. Quick Start: Install Agones on a Kubernetes cluster and deploy a simple game server
  3. Examples: Multiple example game server implementations in the repository
  4. Community: Join the Agones Slack and mailing list for support and discussion

Agones represents the maturation of gaming infrastructure into the cloud-native era, bringing the operational benefits of Kubernetes to one of the most demanding real-time workload types.

Conclusion

Agones transforms Kubernetes into a powerful platform for dedicated game server hosting, addressing the unique challenges of multiplayer gaming infrastructure. As it applies to join the CNCF Sandbox, the project demonstrates how cloud-native technologies can adapt to specialized workload requirements while maintaining Kubernetes-native principles.

For gaming companies building multiplayer experiences and infrastructure teams managing game servers, Agones provides a proven, production-ready solution that leverages the full ecosystem of cloud-native tools and practices.


References:

  • Agones GitHub: github.com/googleforgames/agones
  • Official Website: agones.dev/site/
  • CNCF Sandbox Application: github.com/cncf/sandbox/issues/440
  • Announcement Blog: cloud.google.com/blog/products/containers-kubernetes/ introducing-agones-open-source-multiplayer-dedicated-game-server-hosting- built-on-kubernetes
Agones: Kubernetes-Native Game Server Hosting

How to choose the inference orchestration solution? AIBrix or Kthena or Dynamo?

Note: The content in this article is based on currently available public information and is intended for technical reference only. The effectiveness of each solution depends heavily on your specific workload, infrastructure, and ecosystem integration. The architectural affiliations and early design choices mentioned here do not determine their future direction. In practice, community activity, openness, and long-term evolution are often more important factors. Please evaluate and choose based on your own scenario.

Introduction

The landscape of open-source inference orchestration for Large Language Models (LLMs) has evolved rapidly in 2025. Multiple projects have emerged to address the challenges of deploying and scaling LLM inference workloads on Kubernetes, each with its own approach to workload management, resource orchestration, and performance optimization.

This blog post provides an overview of the current inference orchestration solutions, examines the convergence trends in the ecosystem, and raises important questions about when Prefill-Decode (PD) disaggregation truly provides value.

The Current Landscape

Rapid Development, Gradual Convergence

The inference orchestration space is characterized by:

  • Many implementations: Multiple projects solving similar problems
  • Different architectural choices: Varying approaches to workload management
  • Shared goals: All aim to optimize LLM inference at scale
  • Emerging patterns: Common solutions beginning to emerge

Despite the diversity, we’re seeing convergence around key patterns: LeaderWorkerSet (LWS)-based architectures, intelligent routing, and disaggregated serving models.

Workload Orchestration Solutions

1. Dual LWS Architecture

llm-d implements a dual LeaderWorkerSet architecture for Prefill-Decode disaggregation:

  • Two LWS instances: Separate LWS for prefill and decode workers
  • KServe integration: Deep integration with KServe for model serving
  • LMCache support: Efficient KV cache management across workers
  • Routing sidecar: Intelligent request routing and cache optimization
Client → Routing Sidecar → Prefill LWS → KV Cache → Decode LWS → Response

Why dual LWS? This architecture enables independent scaling and resource optimization for each phase while maintaining coordination through the leader-worker pattern.

2. Serving Group: Volcano Kthena

Kthena takes a different approach with its Serving Group concept:

  • No dual LWS: Kthena intentionally avoids the dual LWS pattern
  • Gang scheduling integration: Leverages Volcano’s gang scheduling capabilities
  • Reduced layering: Eliminates the StatefulSet/Pod layer complexity
  • Direct integration: Native integration with Volcano scheduler

Why not LWS? The Kthena team found that integrating with Volcano’s gang scheduling required a different architecture. The dual LWS, StatefulSet, and Pod layering added complexity without clear benefits for their use case.

This design choice reflects a key insight: the best orchestration solution depends on your existing infrastructure and scheduling requirements.

3. StormService: AIBrix

AIBrix StormService provides specialized container lifecycle management for P/D disaggregation:

  • P/D lifecycle management: Fine-grained control over prefill and decode containers
  • Multi-mode support: TP, PP, single GPU, and P/D disaggregation
  • StormService and RoleSet CRDs: Custom resources for P/D orchestration
  • Enterprise features: Multi-tenancy, routing, and observability

Architecture:

AIBrix Control Plane
    ├── StormService Controller
    │   ├── RoleSet (Prefill)
    │   └── RoleSet (Decode)
    ├── Gateway & Routing
    └── Autoscaler

4. NVIDIA Dynamo: Two Modes

Dynamo offers two distinct deployment modes:

Grove Mode: https://github.com/ai-dynamo/dynamo/blob/be67f67b1a8d0837291ac7033af6edbc146f6995/docs/kubernetes/grove.md

  • High-performance inference
  • NVIDIA-native deployment
  • Optimized for pure NVIDIA infrastructure
    • “GPU support depends on the engine: Dynamo uses backends vllm, sglang and trt-llm. Dynamo is the layer above that.” quota

LWS Mode:

  • Kubernetes-native deployment using LeaderWorkerSet
  • Multi-node disaggregated serving
  • Integration with Kubernetes ecosystem

This dual-mode approach allows users to choose the right level of abstraction for their infrastructure.

5. SGLang RBG: LWS-Inspired

RBG (Resource-Aware Batch Scheduler) learned from and reused design patterns from LWS:

  • LWS-inspired: Incorporates proven patterns from LeaderWorkerSet
  • Resource-aware scheduling: Optimizes batch scheduling based on resources
  • Batch optimization: Intelligent batching strategies for throughput
  • P/D support: Enables disaggregated prefill and decode workloads

Convergence Trends

Common Patterns Emerging

Despite different implementations, several patterns are converging:

Patternllm-dKthenaAIBrixDynamoRBG
LWS-based✓ (dual)✓ (option)✓ (inspired)
P/D disaggregation
Intelligent routing
KV cache managementLMCacheNativeDistributedNativeNative

Why So Many Implementations?

The diversity reflects different optimization goals:

  1. Scheduling integration: Kthena needs Volcano gang scheduling directly
  2. Enterprise features: AIBrix focuses on multi-tenancy and observability
  3. Performance focus: Dynamo optimizes for NVIDIA hardware
  4. Simplicity: RBG provides a lightweight LWS-inspired approach
  5. Production-readiness: llm-d demonstrates a complete reference implementation

The PD Disaggregation Question

At KCD Hangzhou 2025, Wen Yuan Yu’s keynote “Kubernetes Is Born for Service Resource Orchestration—MaaS Changes Everything” raised an important question about PD-separation:

“Achieving strong production gains from PD-separation is very difficult.
While stress testing can show great results, in real dynamic environments it becomes much harder.
Over-provisioning Decode introduces significant challenges.”

This observation directly challenges the assumption that PD-separation is always beneficial.

Does PD Disaggregation Always Provide Value?

At KCD Hangzhou 2025, Yu Wen Yuan’s keynote “Kubernetes Was Built for Service-Resource Orchestration. MaaS Changes Everything” raised important questions about PD disaggregation:

“PD-Disaggregate Role Scheduling • Not So Sure? (Our answer is Data Plane!)”

This challenges the assumption that PD disaggregation is always beneficial.

When PD Disaggregation Helps

PD disaggregation provides clear benefits when:

  • Long prefill, short decode: Input prompts are much longer than outputs
  • High concurrency: Many simultaneous requests need serving
  • Heterogeneous hardware: Different GPU types for different phases
  • SLA-driven scheduling: Different latency requirements (TTFT vs TPOT)

When PD Disaggregation May Not Help

Consider alternatives when:

  • Short contexts: Both prefill and decode are fast
  • Low concurrency: Few simultaneous requests
  • Homogeneous hardware: Same GPUs for all workloads
  • Complexity costs: Operational overhead outweighs benefits
  • KV cache transfer overhead: Network latency exceeds computation savings

The Data Plane Perspective

The “Data Plane” answer suggests that the value of PD disaggregation depends on where bottlenecks actually exist. Before implementing complex orchestration:

  1. Profile your workload: Understand where time is spent
  2. Measure KV cache transfer costs: Network overhead matters
  3. Consider simpler alternatives: TP/DP without disaggregation
  4. Evaluate operational complexity: More components = more failure modes

Configuration Optimization: AIConfigurator

Choosing the right P/D configuration is complex. NVIDIA’s AIConfigurator helps optimize disaggregated deployment configurations:

What AIConfigurator Does

  • Configuration space search: Evaluates thousands of P/D combinations
  • SLA-constrained optimization: Finds configurations meeting TTFT/TPOT targets
  • Hardware-specific tuning: Supports H100, H200, B200 with collected data
  • xPyD planning: Determines optimal prefill/decode worker ratios

Example Usage

# Find optimal configuration for Qwen3-32B on 32 H200 GPUs
# with SLA targets: TTFT ≤ 300ms, TPOT ≤ 10ms
aiconfigurator cli default \
  --model QWEN3_32B \
  --total_gpus 32 \
  --system h200_sxm \
  --isl 4000 \
  --osl 500 \
  --ttft 300 \
  --tpot 10

Why AIConfigurator Matters

Traditional autoscaling (HPA/KPA) doesn’t understand LLM-specific characteristics. AIConfigurator provides:

  • Informed decisions: Data-driven configuration choices
  • Predictive optimization: Estimate performance before deployment
  • Resource efficiency: Maximize GPU utilization with SLA guarantees

Recommendations

For New Deployments

  1. Start simple: Begin with monolithic serving (no P/D disaggregation)
  2. Profile first: Understand your workload characteristics
  3. Use AIConfigurator: Let data guide configuration decisions
  4. Add complexity gradually: Introduce P/D only when benefits are clear

For Existing Infrastructure

If you use…Consider…
VolcanoKthena (native integration)
KServellm-d (deep integration)
vLLMAIBrix (vLLM ecosystem)
NVIDIA GPUsDynamo (NVIDIA optimization)
SGLangRBG (LWS-inspired, lightweight)

Key Questions Before Adopting PD Disaggregation

  1. Is your prefill time >> decode time? If not, disaggregation may not help.
  2. Can your network handle KV cache transfer? Network overhead can eliminate gains.
  3. Do you need independent scaling? If P and D scale together, keep them together.
  4. Is operational complexity acceptable? More components = more failure modes.

Conclusion

The inference orchestration landscape is diverse but converging. Key takeaways:

  • Multiple solutions exist because different infrastructure has different needs
  • LWS-based patterns are popular but not universal (Kthena’s Serving Group shows alternatives)
  • PD disaggregation is not always valuable – profile your workload first
  • Tools like AIConfigurator help navigate the complex configuration space
  • Start simple, add complexity when needed based on actual measurements

The future will likely see further consolidation around proven patterns, but the current diversity reflects healthy experimentation in a rapidly evolving field.


References

Workload Orchestration Projects

  • llm-d – Dual LWS architecture for P/D
  • Kthena – Volcano-based Serving Group
  • AIBrix – StormService for P/D
  • Dynamo – NVIDIA inference platform
  • RBG – LWS-inspired batch scheduler

Configuration Tools

Related Documentation

Presentations

How to choose the inference orchestration solution? AIBrix or Kthena or Dynamo?

Agent Sandbox: Pre-Warming Pool Makes Secure Containers Cold-Start Lightning Fast

Agent Sandbox provides a secure, isolated, and efficient execution environment
for AI agents. This blog explores the project, its integration with gVisor and
Kata Containers, and future trends.

Introduction

As AI agents become more prevalent in enterprise applications, the need for
secure execution environments has become critical. Agent Sandbox is a new
Kubernetes project under SIG Apps
that addresses this challenge by providing a standardized, declarative API for
managing isolated, stateful, singleton workloads—ideal for AI agent runtimes.

Key Features:

  • Kubernetes Primitive Sandbox CRD and Controller: A native Kubernetes
    abstraction for managing sandboxed workloads
  • Ready to Scale: Support for thousands of concurrent sandboxes while
    achieving sub-second latency
  • Developer-Focused SDK: Easy integration into agent frameworks and tools

Project Overview

Core: Sandbox CRD

The Sandbox Custom Resource Definition (CRD) is the heart of agent-sandbox.
It provides a declarative API for managing a single, stateful pod with:

  • Stable Identity: Each Sandbox has a stable hostname and network identity
  • Persistent Storage: Sandboxes can be configured with persistent storage
    that survives restarts
  • Lifecycle Management: The controller manages pod lifecycle including
    creation, scheduled deletion, pausing, and resuming

Extensions

The project provides additional CRDs for advanced use cases:

  • SandboxTemplate: Reusable templates for creating Sandboxes
  • SandboxClaim: Allows users to create Sandboxes from templates
  • SandboxWarmPool: Manages a pool of pre-warmed Sandbox Pods for fast
    allocation (achieving sub-second startup latency)

Architecture

                              ┌─────────────────┐
                              │   K8s API       │
                              │   Server        │
                              └────────┬────────┘
                                       │
                              ┌────────▼────────┐     ┌─────────────┐
                              │  Agent Sandbox  │────▶│  Replenish  │
                              │   Controller    │     │    Pool     │
                              └────────┬────────┘     └─────────────┘
                                       │
                                       │ Allocate from Pool
                                       ▼
┌────────────────────────────────────────────────────────────────────┐
│                        Agent Sandbox                               │
│            Executing Isolated, Low Latency Tasks                   │
│ ┌──────────────────┐   ┌──────────┐   ┌──────────────────────────┐ │
│ │ Agent Orchestrator│──▶│ Executor │──▶│  Task Execution         │ │
│ │       Pod         │   │ (API/SDK)│   │  Agent Sandbox          │ │
│ │                   │   │          │   │ ┌──────────────────────┐│ │
│ │ Agent app/framework   │ iStream  │   │ │Execution Process     ││ │
│ │ requesting sandboxed  │          │   │ │  (gVisor/Kata)       ││ │
│ │ execution environment │          │   │ ├──────────────────────┤│ │
│ │                   │   │          │   │ │ Ephemeral Storage    ││ │
│ │                   │   │          │   │ ├──────────────────────┤│ │
│ │                   │   │          │   │ │ Network Policy       ││ │
│ └──────────────────┘   └──────────┘   │ └──────────────────────┘│ │
│                                        └──────────────────────────┘ │
└────────────────────────────────────────────────────────────────────┘

Runtime Integration: gVisor and Kata Containers

Agent Sandbox is designed to be vendor-neutral, supporting various runtimes
to provide enhanced security and isolation. The two primary implementations are
gVisor and Kata Containers.

gVisor Integration (GKE)

图片来源:KubeCon 北美 Keynote 演讲,Jago Macleod (谷歌)

gVisor is an application kernel that provides an
additional layer of isolation between container applications and the host
kernel. It intercepts application system calls and implements them in user
space.

GKE Integration Status:

  • Production Ready: gVisor is available as a runtime option in Google
    Kubernetes Engine (GKE) via the gvisor RuntimeClass
  • Snapshot and Resume: GKE supports snapshotting and resuming sandboxes,
    enabling infrastructure efficiency and sophisticated parallel executions
  • Performance Optimized: The gVisor team at Google has optimized the
    runtime for AI agent workloads with minimal overhead

Example Configuration:

apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: ai-agent-sandbox
spec:
  podTemplate:
    spec:
      runtimeClassName: gvisor
      containers:
      - name: agent-runtime
        image: my-ai-agent:latest

Kata Containers Integration

From: KCD hangzhou 2025

Kata Containers provides lightweight virtual
machines that behave like containers but offer the security isolation of VMs.
Each container runs in its own lightweight VM with a dedicated kernel.

Integration Status:

  • Active Development: The Kata Containers community is actively working on
    Agent Sandbox integration
  • VM-Level Isolation: Provides strong isolation through hardware
    virtualization
  • GPU Support: Kata supports GPU passthrough for AI/ML workloads

Example with Kata on GKE:

apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: kata-ai-sandbox
spec:
  podTemplate:
    spec:
      runtimeClassName: kata-qemu-nvidia-gpu
      containers:
      - name: agent-runtime
        image: my-ai-agent:latest

Key Resources:

Comparison

FeaturegVisorKata Containers
IsolationUser-space kernelHardware virtualization
Startup TimeFaster (~100ms)Slower (~1-2s)
Memory OverheadLowerHigher
Syscall Compatibility~95%100%
GPU SupportLimitedFull passthrough
Best ForWeb workloads, untrusted codeGPU workloads, full isolation

Desired Characteristics

The Agent Sandbox project aims to achieve:

  • Strong Isolation: Support for gVisor and Kata Containers for kernel and
    network isolation
  • Deep Hibernation: Save state to persistent storage and archive Sandbox
    objects
  • Automatic Resume: Resume sandboxes on network connection
  • Efficient Persistence: Elastic and rapidly provisioned storage
  • Memory Sharing: Explore sharing memory across Sandboxes on the same host
  • Rich Identity & Connectivity: Dual user/sandbox identities and efficient
    traffic routing
  • Programmable: Applications and agents can programmatically consume the
    Sandbox API

Use Cases

Agent Sandbox is designed for:

  1. AI Agent Runtimes: Isolated environments for executing untrusted,
    LLM-generated code
  2. Development Environments: Persistent, network-accessible cloud
    environments for developers
  3. Notebooks and Research Tools: Persistent sessions for tools like Jupyter
    Notebooks
  4. Stateful Single-Pod Services: Hosting single-instance applications
    needing stable identity

Getting Started

Installation

# Replace "vX.Y.Z" with a specific version tag
export VERSION="v0.1.0"

# Install core components
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/manifest.yaml

# Install extensions (optional)
kubectl apply -f https://github.com/kubernetes-sigs/agent-sandbox/releases/download/${VERSION}/extensions.yaml

Create Your First Sandbox

apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
  name: my-sandbox
spec:
  podTemplate:
    spec:
      containers:
      - name: my-container
        image: your-agent-image:latest

Trends and Future Directions

Industry Trends

  1. Growing AI Agent Adoption: As AI agents become more autonomous and
    capable, secure execution environments become essential
  2. Zero-Trust Security: Agent Sandbox aligns with zero-trust principles by
    providing isolated execution environments
  3. Cloud-Native AI Infrastructure: Integration with Kubernetes ecosystem
    tools (Kueue, Gateway API, etc.)

Future Development

The project roadmap includes:

  • Enhanced Runtime Support: Continued improvements for gVisor and Kata
    integration
  • Better Warm Pool Management: More sophisticated allocation strategies
  • Observability Integration: Native support for monitoring and tracing
  • Multi-Cluster Support: Managing sandboxes across clusters

Resources

Conclusion

Agent Sandbox represents an important step forward in providing secure,
efficient execution environments for AI agents on Kubernetes. With support for
multiple isolation runtimes (gVisor and Kata Containers), standardized APIs,
and a focus on developer experience, it addresses the growing need for
sandboxed AI workloads in enterprise environments.

The project is actively developing under SIG Apps, and contributions from the
community are welcome. Whether you’re building AI agents, development
environments, or any workload requiring isolated execution, Agent Sandbox
provides a Kubernetes-native solution.

Agent Sandbox: Pre-Warming Pool Makes Secure Containers Cold-Start Lightning Fast

Kubernetes x JobSet:How CoEvolving Makes AI Jobs Restart 10× Faster

Co-Evolving: When Kubernetes Features Empower the Ecosystem

In the rapidly evolving AI infrastructure landscape, a beautiful synergy is
emerging: the Kubernetes community develops foundational capabilities, and
downstream projects like JobSet,
Ray, and
LeaderWorkerSet (LWS) adopt these
features to dramatically improve their efficiency. We call this Co-Evolving
(协同演进) — the entire ecosystem advancing together.

Kubernetes has been introducing more AI-related capabilities recently, but
realizing their full potential in AI workloads requires adaptation by other
projects. Today, we’ll explore a prime example: JobSet leveraging Kubernetes In-Place Container Restart to achieve 92% faster restart times.

The Problem: Slow JobSet Restart

When a distributed training job running on
JobSet needs to restart (due to
transient failures, configuration updates, or checkpoint recovery), the
traditional approach involves:

  1. Delete all pods in the JobSet
  2. Wait for pod termination to complete
  3. Reschedule all pods through the Kubernetes scheduler
  4. Wait for pod startup (including image pulls, init containers, etc.)

In a large-scale cluster with 5000 nodes, this process takes approximately
2 minutes and 10 seconds. For AI/ML workloads where fast recovery is
critical, this overhead is significant.

The Solution: In-Place Container Restart

Kubernetes has introduced capabilities that allow containers to restart without
pod recreation:

KEP-5307: Container Restart Policy (Kubernetes 1.34)

KEP-5307
introduces fine-grained control over individual container restart behavior
within pods. This allows:

  • Specifying restart policies per container (not just per pod)
  • Triggering container restarts without affecting the entire pod
  • Maintaining pod identity, IP, and volumes during container restarts

KEP-5532: Restart All Containers on Container Exits (Kubernetes 1.35)

KEP-5532
extends this capability to enable coordinated restarts:

  • Restart all containers in a pod when specific containers exit
  • Restart init containers and sidecars as part of the pod lifecycle
  • Enable pod-level restart coordination without pod recreation

Real-World Results: JobSet In-Place Restart

The JobSet team has developed an
in-place restart prototype
that demonstrates remarkable performance improvements:

MetricTraditional RestartIn-Place RestartImprovement
Restart Time2m10s10s92% faster
Test Scale5000 nodes5000 nodes
Scheduling OverheadHighNoneEliminated
Pod RecreationRequiredNot neededAvoided

For detailed design information, see the
JobSet in-place restart design document.

Why This Matters for AI Workloads

1. Distributed Training Recovery

Large-scale distributed training jobs (PyTorch DDP, TensorFlow MultiWorkerMirroredStrategy)
are particularly sensitive to restart latency:

  • Checkpoint recovery: After a failure, all workers need to restart from
    the latest checkpoint. In-place restart gets workers back online 12x faster.
  • Gradient synchronization: All workers must be running for training to
    proceed. Faster restarts mean less wasted GPU time.
  • Cost savings: On expensive GPU clusters ($2-10/GPU-hour), 2 minutes saved
    per restart adds up significantly.

2. Job Dependencies

Many AI pipelines have complex job dependencies. When a job restarts:

  • Downstream jobs wait for upstream completion
  • Gang scheduling constraints require all workers to be present
  • Network connectivity must be maintained for collective operations

In-place restart preserves pod identity and network connectivity, minimizing
disruption to the overall pipeline.

3. Resource Efficiency

Traditional restart involves:

  • Scheduler load: Finding nodes for potentially thousands of pods
  • API server load: Creating/deleting pod objects
  • Node preparation: Image pulls, volume mounts, init containers

In-place restart eliminates all of this overhead, keeping resources available
for actual workloads.

How It Works

Before: Traditional Restart Flow

Job Restart Triggered
    ↓
Delete All Pods → Wait for Termination (30s+)
    ↓
Create New Pods → Wait for Scheduling (30s+)
    ↓
Pull Images (if needed) → Start Containers (60s+)
    ↓
Total: ~2m10s

After: In-Place Restart Flow

Job Restart Triggered
    ↓
Signal Container Exit → Container Restarts In-Place (10s)
    ↓
Total: ~10s

The key differences:

  1. No pod deletion: Pod objects remain, preserving identity
  2. No rescheduling: Pods stay on their current nodes
  3. No image pulls: Images are already cached on nodes
  4. Immediate restart: Container process simply restarts

Implementation Considerations

When to Use In-Place Restart

  • Transient failures: Container crashes, OOM kills, network timeouts
  • Configuration updates: Restart to pick up new environment variables
  • Checkpoint recovery: Resume training from saved state
  • Rolling updates: Graceful restart of workers in sequence

When Traditional Restart is Needed

  • Node failures: Pod must move to a healthy node
  • Resource changes: Pod needs more/less resources (consider VPA)
  • Image updates: New container image required
  • Topology changes: Pod needs different placement

Integration with JobSet

JobSet can leverage in-place restart through:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: distributed-training
spec:
  replicatedJobs:
  - name: workers
    replicas: 8
    template:
      spec:
        template:
          spec:
            restartPolicy: Always  # Enables in-place restart
            containers:
            - name: trainer
              image: pytorch/pytorch:latest

The Broader Co-Evolving Pattern

This JobSet improvement exemplifies the Co-Evolving pattern in cloud-native AI:

Kubernetes CapabilityProject AdoptionBenefit
In-Place RestartJobSet92% faster recovery
Gang Scheduling (1.35)Kueue, LWSAll-or-nothing placement
DRA (1.34 GA)NVIDIA GPU OperatorFlexible device allocation
Workload API (1.35)Volcano, YuniKornNative workload support

As Kubernetes continues to add AI-friendly features, we expect more projects
to adopt them, creating a virtuous cycle of improvement.

Getting Started

Prerequisites

  • Kubernetes 1.34+ (for KEP-5307)
  • Kubernetes 1.35+ (for KEP-5532 pod-level restart)
  • JobSet with in-place restart support (check latest releases)

Enable Feature Gates

# On kubelet for KEP-5307 (Container Restart Policy, 1.34+)
--feature-gates=ContainerRestartPolicy=true

# On kubelet for KEP-5532 (Restart All Containers, 1.35+)
--feature-gates=RestartAllContainersOnContainerExits=true

Test In-Place Restart

  1. Deploy a JobSet with restartPolicy: Always
  2. Trigger a container restart (e.g., kubectl exec ... -- kill -TERM 1)
  3. Observe the restart time compared to pod recreation

Future Roadmap

The in-place restart capability continues to evolve:

  • KEP-5307 graduation: Moving toward Beta/GA
  • KEP-5532 graduation: Enhanced pod-level restart control
  • JobSet integration: Native support for in-place restart policies
  • Monitoring: Better observability for restart events
  • Kueue integration: Workload-aware restart handling

Conclusion

The JobSet in-place restart optimization demonstrates the power of Co-Evolving
in the Kubernetes ecosystem. By adopting upstream Kubernetes capabilities,
projects can achieve dramatic performance improvements:

  • 92% faster restart (2m10s → 10s)
  • No scheduling overhead
  • Preserved pod identity and network
  • Reduced API server load

This is just one example of how the Kubernetes community and downstream
projects work together to improve AI workload efficiency. As more AI-related
features land in Kubernetes, we can expect even more optimizations from
projects like JobSet, Ray, LWS, and others.

The future of AI infrastructure is Co-Evolving — and it’s happening now.


References

KEPs and Documentation

Related Projects

Related Blog Posts

Kubernetes x JobSet:How CoEvolving Makes AI Jobs Restart 10× Faster

Smarter Scheduling for AI Workloads: Topology-Aware Scheduling

Why Topology? Why Now?

At KubeCon NA 2025, one theme dominated conversations in the AI/ML space:
topology. Everyone is talking about topology-aware scheduling because it’s
critical for optimizing AI workload performance.

Why Topology? Why Now?

Source: Lightning Talk: Mind the Topology – Roman Baron, NVIDIA

Modern AI workloads, especially distributed training and high-performance
inference, are extremely sensitive to hardware topology. When GPUs, NICs, CPUs,
and memory are not properly aligned within the same NUMA node, PCIe root, or
network fabric, performance can degrade by 30-50% or more.

Background: Current Topology Scheduling Support

Device Plugin: The Traditional Approach

Kubernetes Device Plugins have been the standard mechanism for managing
hardware resources like GPUs. The Device Plugin API provides:

Device Management with Device Plugin

Source: KubeCon NA 2025: Device Management

Key Components:

  • GetDevicePluginOptions: Plugin configuration
  • ListAndWatch: Report available devices to kubelet
  • GetPreferredAllocation: Suggest optimal device allocation (topology hint)
  • Allocate: Perform device allocation for containers
  • PreStartContainer: Pre-container-start hooks

Device Plugin supports:

  • Basic GPU counting (e.g., nvidia.com/gpu: 8)
  • MIG (Multi-Instance GPU) partitioning
  • Time-slicing for GPU oversubscription

Limitations of Device Plugin

However, Device Plugins have significant limitations for topology-aware
scheduling:

Limitations of Device Plugin Management

Source: KubeCon NA 2025: Device Management

  1. Static isolation config: MIG configurations must be pre-defined
  2. Static slicing config: Time-slicing ratios are fixed at deployment
  3. Only even sharing expected: Limited sharing granularity
  4. Requires secondary scheduler: Complex topologies need additional
    schedulers like Volcano or Kueue

Kueue: Topology-Aware Scheduling

Kueue provides topology-aware
scheduling through node labels. It uses hierarchical topology levels like:

# Node labels for rack/block topology
cloud.google.com/gce-topology-block: "block-1"
cloud.google.com/gce-topology-subblock: "subblock-1"
cloud.google.com/gce-topology-host: "host-1"
kubernetes.io/hostname: "node-1"

Kueue supports:

  • TopologyAwareScheduling: Place workload pods on nodes with matching
    topology
  • Cohort-based resource sharing: Share resources within topology groups
  • Gang scheduling with topology: Ensure all gang members are
    topology-aligned

Kueue Topology Configuration Example:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: gpu-topology
spec:
  nodeLabels:
    cloud.google.com/gce-topology-block: "block-1"
  nodeTaints:
  - effect: NoSchedule
    key: nvidia.com/gpu
    value: "present"

Volcano: Gang Scheduling with Topology

Volcano provides advanced scheduling
features including:

  • Gang scheduling: All-or-nothing scheduling for distributed workloads
  • Topology plugin: Consider GPU topology in scheduling decisions
  • Network-aware scheduling: RDMA/InfiniBand fabric awareness
apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: distributed-training
spec:
  minMember: 8
  minResources:
    nvidia.com/gpu: "8"
  queue: training-queue
  # Topology affinity for NVLink connectivity
  topologyPolicy: "best-effort"

DRA: The Next Generation of Topology Management

Dynamic Resource Allocation (DRA)
represents a fundamental shift in how Kubernetes handles device topology. DRA
provides structured parameters that enable rich topology expression and
constraint specification.

How DRA Handles Topology-Aware Scheduling

DRA uses attributes and constraints with CEL (Common Expression
Language) to express topology requirements. The key mechanisms include:

  1. Device Attributes: Each device publishes topology information
  • pcieRoot: PCIe hierarchy identifier
  • numaNode: NUMA node association
  • nvlinkDomain: NVLink fabric identifier
  • rdmaDevice: Associated RDMA NIC
  1. Constraints: CEL expressions that enforce topology rules
  • Same PCIe root for GPU and NIC
  • Same NUMA node for CPU and memory
  • NVLink connectivity between GPUs
  1. SharedID: Devices on the same topology domain get a shared identifier

GPU + NIC Topology Coordination

The most powerful use case for DRA topology is coordinating GPU and NIC
allocation on the same PCIe root. This is critical for RDMA-based distributed
training where GPU-Direct is used.

ResourceClaimTemplate with PCIe Topology Constraint Example:

apiVersion: resource.k8s.io/v1beta1
kind: ResourceClaimTemplate
metadata:
  name: gpu-nic-topology
spec:
  spec:
    devices:
      requests:
      - name: gpu
        deviceClassName: nvidia-gpu
        count: 1
      - name: rdma-nic
        deviceClassName: rdma-nic
        count: 1
      constraints:
      # GPU and NIC must be on the same PCIe root
      - requests: ["gpu", "rdma-nic"]
        matchAttribute: pcieRoot

How this works:

  1. The DRA scheduler evaluates available GPUs and NICs
  2. For each candidate GPU, it finds NICs on the same PCIe root
  3. Only allocations satisfying the constraint are considered
  4. The matchAttribute: pcieRoot ensures both devices share the same
    PCIe topology

DRANET: Network Device DRA

DRANET is Google’s DRA implementation for
network devices. It integrates with Kueue’s topology-aware scheduling using
node labels:

# DRANET uses these labels for topology awareness
cloud.google.com/gce-topology-block
cloud.google.com/gce-topology-subblock
cloud.google.com/gce-topology-host
kubernetes.io/hostname

DRANET + NVIDIA GPU DRA can coordinate:

  • RDMA NICs allocated with GPUs on same PCIe fabric
  • Multi-NIC configurations for distributed training
  • Network isolation using SR-IOV VFs

CPU Micro-Topology Support

The dra-driver-cpu
project is adding CPU micro-topology support including:

  • NUMA-aware CPU allocation
  • CPU pinning with topology alignment
  • Coordination with GPU NUMA placement

DRAConsumableCapacity: New in Kubernetes 1.34

A major advancement in DRA is the DRAConsumableCapacity feature:

DRAConsumableCapacity

Source: KubeCon NA 2025: Device Management

Key Capabilities:

  • Alpha feature introduced in Kubernetes 1.34
  • Recommended to start using from Kubernetes 1.35 (still in Alpha)

Core abilities:

  • Allow multiple allocations over multiple resource requests
  • Consumable capacity: Guaranteed resource sharing

Potential use cases:

  • Virtual GPU Memory Partitioning
  • Virtual NIC (vNIC) Sharing
  • Bandwidth-limited Network Allocation
  • I/O Bandwidth Smart Storage Device Sharing
  • Native Resource Request (CPU)

This enables much more flexible resource sharing while maintaining topology
awareness.


Challenges: Device Plugin to DRA Migration

Many organizations have invested heavily in Device Plugin-based solutions.
Migrating to DRA presents several challenges:

1. Existing Device Plugin Investments

Organizations may have:

  • Custom Device Plugins with topology logic
  • Integration with monitoring and observability tools
  • Operator workflows depending on Device Plugin APIs

2. Coexistence Problems

Running Device Plugin and DRA together can cause:

  • Resource conflicts: Same device managed by both systems
  • Topology inconsistency: Different topology views between systems
  • Scheduling confusion: Scheduler doesn’t have unified view

3. Feature Gaps

Some Device Plugin features don’t have DRA equivalents yet:

  • Device health monitoring: Device Plugin has built-in health checks
  • Hot-plug support: Device Plugin supports dynamic device addition
  • Metrics integration: Prometheus metrics from Device Plugins

Solutions and Workarounds

DRA Extension Capabilities:

  • DRA drivers can implement compatibility layers
  • NVIDIA’s DRA driver supports Device Plugin migration path
  • NRI integration can bridge runtime-level gaps

Recommended Migration Path:

  1. Deploy DRA driver alongside existing Device Plugin
  2. Use node taints to partition workloads
  3. Gradually migrate workloads to DRA-based resource claims
  4. Phase out Device Plugin once all workloads migrated

Related KubeCon Talks

Several excellent talks from KubeCon NA 2025 cover these topics:

Lightning Talk: Mind the Topology

Mind the Topology: Smarter Scheduling for AI Workloads on Kubernetes
by Roman Baron, NVIDIA

Key topics:

  • Why topology matters for AI workloads
  • NVIDIA KAI Scheduler for topology-aware scheduling
  • NVIDIA KAI-Scheduler

Device Management Deep Dive

Deep dive into DRA and Device Plugin

Key topics:

  • Evolution from Device Plugin to DRA
  • DRAConsumableCapacity feature
  • Multi-device topology coordination

Best Practices for Topology-Aware Scheduling

  1. Understand your topology requirements
  • Profile workloads to identify topology sensitivity
  • Map hardware topology (PCIe, NUMA, NVLink, RDMA)
  1. Choose the right scheduling approach
  • Simple GPU workloads: Device Plugin + Topology Manager
  • Complex multi-device: DRA with constraints
  • Distributed training: Kueue or Volcano + DRA
  1. Label nodes with topology information
  • Use consistent labeling scheme
  • Include rack, block, and host-level topology
  1. Test topology impact
  • Benchmark with and without topology alignment
  • Measure latency and throughput differences
  1. Plan for migration
  • Start with new workloads on DRA
  • Create compatibility tests
  • Document topology requirements

Conclusion

Topology-aware scheduling has evolved from a nice-to-have feature to a critical
requirement for AI workloads. The transition from Device Plugin to DRA
represents a fundamental shift in how Kubernetes manages hardware topology:

  • Device Plugin: Simple, established, but limited topology support
  • DRA: Rich topology expression, multi-device coordination, future of
    Kubernetes device management

As AI workloads continue to grow in complexity, the need for sophisticated
topology-aware scheduling will only increase. Whether you’re using Kueue,
Volcano, or native Kubernetes scheduling, understanding topology and planning
for DRA adoption is essential for optimizing your AI infrastructure.


Resources

Projects

Documentation

Videos


Smarter Scheduling for AI Workloads: Topology-Aware Scheduling

Kubernetes Introduces Native Gang Scheduling Support to Better Serve AI/ML Workloads

中文 https://mp.weixin.qq.com/s/EO0yfdVQMNgKI7nqkJ18Yw Kubernetes 支持原生 Gang Scheduling : 适应 AI/ML 工作负载

Introduction

Scheduling large workloads in Kubernetes has always been challenging. When you need to run distributed training jobs, batch processing tasks, or other multi-pod applications, the traditional pod-by-pod scheduling approach can lead to resource wastage, deadlocks, and inefficiencies. Today, we’re excited to share insights about the Workload Aware Scheduling initiative that’s transforming how Kubernetes handles multi-pod workloads.

The Problem with Traditional Pod Scheduling

In traditional Kubernetes scheduling, each pod is scheduled independently. For distributed workloads like:

  • Distributed ML training (e.g., PyTorch, TensorFlow multi-worker jobs)
  • Batch processing (e.g., Apache Spark, Ray clusters)
  • High-performance computing (e.g., MPI applications)

This independent scheduling creates several problems:

  1. Partial scheduling deadlocks: Some pods get scheduled while others wait indefinitely for resources
  2. Resource wastage: Scheduled pods consume resources but can’t start work until all peers are ready
  3. Poor cluster utilization: Resources are tied up by incomplete workloads
  4. Unpredictable job completion times: Jobs may wait hours or days in partially-scheduled states

Kubernetes v1.35: Workload Aware Scheduling

The Kubernetes community has introduced Workload Aware Scheduling in v1.35, featuring three major components:

1. Workload API (Alpha)

The new Workload API resource in scheduling.k8s.io/v1alpha1 provides a structured way to define scheduling requirements for multi-pod applications.

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: training-job-workload
  namespace: ml-workloads
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        # All-or-nothing: schedule only if 4 pods can run together
        minCount: 4

Link your pods to the workload:

apiVersion: v1
kind: Pod
metadata:
  name: worker-0
  namespace: ml-workloads
spec:
  workloadRef:
    name: training-job-workload
    podGroup: workers
  containers:
  - name: trainer
    image: my-ml-framework:latest
    resources:
      requests:
        nvidia.com/gpu: 1

2. Gang Scheduling (Alpha)

Gang scheduling implements the all-or-nothing placement strategy:

How it works:

  1. Waiting Phase: When pods arrive, the scheduler blocks them until minCount pods are pending
  2. Evaluation Phase: The scheduler attempts to find suitable nodes for all pods in the gang
  3. Decision Phase:
    • ✅ Success: If all pods can be placed, they’re bound to nodes together
    • ❌ Failure: If any pod can’t be placed within timeout (5 minutes), ALL pods are rejected and requeued

This prevents resource waste and ensures your distributed workload either runs completely or waits for sufficient resources.

Key benefits:

  • Eliminates partial scheduling deadlocks
  • Improves cluster utilization by freeing resources for runnable workloads
  • Provides predictable behavior for distributed applications
  • Works seamlessly with pod preemption and autoscaling

3. Opportunistic Batching (Beta)

Opportunistic Batching is a performance optimization that speeds up scheduling of identical pods without requiring any configuration changes.

How it works:

When the scheduler processes pods with identical scheduling requirements (same resources, images, affinities, etc.), it can reuse feasibility calculations and scoring results for subsequent pods in the queue.

Performance impact:

  • Dramatically reduces scheduling latency for large homogeneous workloads
  • Can improve scheduling throughput by 5-10x for batch workloads
  • Works transparently – no user configuration needed
  • Enabled by default in Kubernetes v1.35 (Beta)

Current restrictions:

  • Disabled for pods using topology spread constraints
  • Disabled for pods using Dynamic Resource Allocation (DRA)
  • All scheduling-relevant pod fields must be identical

Real-World Use Cases

Distributed ML Training

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: pytorch-training
spec:
  podGroups:
  - name: workers
    policy:
      gang:
        minCount: 8  # Need 8 GPUs for distributed training

Your PyTorch distributed training job only starts when all 8 workers can be scheduled, preventing wasted GPU resources.

Apache Spark on Kubernetes

apiVersion: scheduling.k8s.io/v1alpha1
kind: Workload
metadata:
  name: spark-job
spec:
  podGroups:
  - name: executors
    policy:
      gang:
        minCount: 10  # 1 driver + 9 executors minimum

Spark jobs with gang scheduling avoid the common problem where the driver starts but executors can’t be scheduled.

Ray Clusters

Ray applications benefit from gang scheduling by ensuring the head node and worker nodes start together, enabling immediate distributed computation.

The Roadmap: What’s Coming in 1.36 and Beyond

The Workload Aware Scheduling effort has an ambitious roadmap for Kubernetes 1.36:

Planned for v1.36

  • Expanding Workload API: Enhanced capabilities and refinements based on alpha feedback
  • Auto-workload for Job, StatefulSet, JobSet: Automatic workload creation for common Kubernetes resources
  • Topology Aware Scheduling: Consider network and hardware topology when placing gang members
  • Single-cycle workload scheduling: Schedule entire gangs in a single scheduling cycle for better performance
  • Tree-based workload scheduling algorithm: More efficient gang placement decisions
  • Improved binding process: Better handling of kubelet races using nominations
  • Delayed preemption: Introduce nominating victims before actual eviction
  • Workload-level preemption: Preempt entire gangs rather than individual pods

Long-term Vision

The ultimate goal is to make Kubernetes natively understand and optimize for workload-level operations, including:

  • Deep integration with cluster autoscaling
  • Workload-aware resource quotas and limits
  • Better support for mixed workload types (batch + serving)
  • Enhanced observability for multi-pod applications

Upcoming Official Blog Post

The Kubernetes community is preparing an official blog post about Workload Aware Scheduling that will be published soon on the Kubernetes blog. Watch for kubernetes/website#53012 to be merged for the official announcement.

Getting Started

Prerequisites

  • Kubernetes v1.35 or later
  • Feature gates configured on kube-apiserver and kube-scheduler

Enable Workload API and Gang Scheduling

# On kube-apiserver
--feature-gates=GenericWorkload=true
--runtime-config=scheduling.k8s.io/v1alpha1=true

# On kube-scheduler
--feature-gates=GenericWorkload=true,GangScheduling=true

Enable Opportunistic Batching

Opportunistic Batching is enabled by default in v1.35 as a Beta feature. To disable it:

# On kube-scheduler
--feature-gates=OpportunisticBatching=false

Testing Gang Scheduling

  1. Create a Workload resource
  2. Create pods with workloadRef pointing to the Workload
  3. Observe scheduling behavior in kube-scheduler logs
  4. Monitor metrics for gang scheduling success/failure rates

Best Practices

  1. Set appropriate minCount: Consider your application’s minimum viable size
  2. Use resource requests accurately: Gang scheduling depends on accurate resource requirements
  3. Monitor scheduling metrics: Track gang scheduling success rates and timeout events
  4. Test with cluster autoscaling: Ensure your autoscaler can provision nodes for gangs
  5. Plan for failure scenarios: Understand timeout behavior and retry logic

Comparison with Existing Solutions

Before native gang scheduling, users relied on:

  • Volcano: CNCF incubating project with gang scheduling
  • Kueue: Kubernetes SIG project for queue and quota management
  • YuniKorn: Apache project with gang scheduling support
  • Custom schedulers: In-house solutions for specific use cases

Why use native gang scheduling?

  • Maintained by Kubernetes SIG Scheduling
  • Integrated with core scheduler features (preemption, autoscaling)
  • No additional components to deploy and maintain
  • Part of the Kubernetes conformance suite (eventually)

When to use external schedulers?

  • Need production-ready gang scheduling today (use Volcano or Kueue)
  • Require features beyond current Kubernetes roadmap
  • Have existing investments in specific schedulers

Resources and References

KEPs and Documentation

Related Projects

Several projects currently support gang scheduling:

  • Volcano Scheduler – CNCF Incubating
    • Full gang scheduling support
    • Recently added LeaderWorkerSet (LWS) gang scheduling in v1.13 release
  • Koordinator – Alibaba Open Source
    • Basic gang scheduling capabilities
    • Workload orchestration and resource scheduling enhancements
  • Kueue – Kubernetes SIG Project
    • CoScheduling support (a lighter version of gang scheduling)
    • Focus on job queueing and quota management
  • YuniKorn – Apache Project
    • Gang scheduling and resource scheduling capabilities

Community

Conclusion

Gang Scheduling and Workload Aware Scheduling represent a major step forward for Kubernetes in supporting AI/ML, HPC, and batch processing workloads. The v1.35 alpha release provides a foundation for native multi-pod scheduling, with an exciting roadmap for v1.36 and beyond.

We encourage the community to:

  • Test these features in development environments
  • Provide feedback through GitHub issues
  • Share use cases and requirements
  • Contribute to the ongoing development

The future of Kubernetes scheduling is workload-aware, and the journey has just begun!

Kubernetes Introduces Native Gang Scheduling Support to Better Serve AI/ML Workloads

The Shift to cgroups v2 in Kubernetes: What You Need to Know

As v1.35 will announce the cgroup v1 deprecation, kubelet will fail on cgroup v1 with default configuration. FailCgroupV1 will be set to true by default. See more in coming blog https://github.com/kubernetes/website/pull/52814. Blow is what I wrote after cgroup v1 was announced to enter maintenance mode. As I linked a lot and can not finish is pretty complete, I stopped update https://github.com/kubernetes/website/pull/47342. Just publish it here for users who want to know more about why we should shift from cgroup v1 to v2 and the difference.

cgroups (control groups) are a Linux kernel feature used for managing system resources. Kubernetes uses cgroups to allocate resources like CPU and memory to containers, ensuring that applications run smoothly without interfering with each other. With the release of Kubernetes v1.31, cgroups v1 has been moved into [maintenance mode]/blog/2024/08/14/kubernetes-1-31-moving-cgroup-v1-support-maintenance-mode/). For cgroups v2, it graduated in v1.25 2 years ago.

Top FAQs are why we should migrate, what’s the benifits and lost, and what needs to be noticed when using cgroups v2.

cgroups v1 problem, and solutions in cgroups v2

cgroups v1 and cgroups official doc can be found in

Let’s enumerate some known issues.

active_file memory is not considered as available memory

There is a known issue of page cache: #43916.

  • In cgroups v1, we have no native solutions. Workarounds are setting larger memory limit for Pods or using some external projects to drop cache or throttling memory allocating when memory is beyond a threshold.
  • In cgroups v2, we can use memory.high to throttle.

Support for Memory QoS was initially added in Kubernetes v1.22, and later some limitations around the formula for calculating memory.high were identified. These limitations are addressed in Kubernetes v1.27.

However, until v1.31, the feature gate is still alpha due to another known issue that application pod may be hanging forever due to heavy memory reclaiming.

Container aware OOM killer and better OOM handling strategies

In cgroups v2, one process of a multi-processes Pod could be killed by the OOM killer. In this case, Pod has to use runit or supervisord to manage multi processes lifecycle.

cgroups v2 uses cgroup.kill file. Writing “1” to the file causes the cgroups and all descendant cgroups to be killed. This means that all processes located in the affected cgroup tree will be killed via SIGKILL. Pod may run multiple processes, and all processes can be killed simultaneously.

As mentioned above, cgroups v2 memory.high can throttle the new memory allocation and cgroups can be aware of the OOM earsiler. Besides, PSI can also help to know the memory load. oomd is a good example using PSI to implement a userspace out-of-memory killer.

Rootless support

In cgroups v1, delegating cgroups v1 controllers to less privileged containers may be dangerous.

Unlike cgroups v1, cgroups v2 officially supports delegation. Most Rootless Containers implementations rely on systemd for delegating v2 controllers to non-root users.

User Namespace minimal kernel version is 6.5, according to KEP-127.

What’s more?

  1. eBPF stories:
    • In cgroups v1, the device access control are defined in the static configuration/.
    • cgroups v2 device controller has no interface files and is implemented on top of cgroup BPF.
    • Cilium will automatically mount cgroups v2 filesystem required to attach BPF cgroup programs by default at the path /run/cilium/cgroupv2 .
  2. PSI is planned in a future release KEP-4205, but pending due to runc 1.2.0 release delay.
  3. monitoring tools support, like Cadvisor. Currently, cgroups v2 features are not fully-supported yet.

Adopting cgroup version 2

Requirements

Here’s what you need to use cgroup v2 with Kubernetes. First up, you need to be using a version of Kubernetes with support for v2 cgroup management; that’s been stable since Kubernetes v1.25 and all supported Kubernetes releases include this support.

  • OS distribution enables cgroups v2
  • Linux Kernel version is 5.8 or later
  • Container runtime supports cgroups v2. For example:
    • containerd v1.4 or later (at the time of writing, containerd releases v1.6 and later are within that project’s support period)
    • CRI-O v1.20 or later
  • The kubelet and the container runtime are configured to use the systemd cgroup driver

kernel updates around cgroups v2

cgroups v2 first appeared in Linux Kernel 4.5 in 2016.

  • In Linux 4.5, cgroups v2 iomemory & pid cgroups management were supported.
  • Linux 4.15 added support for cgroups v2 cpu management
  • Pressure Stall Information (PSI) support began with Linux 4.20.
  • The Kubernetes project does not recommend using cgroups v2 with a Linux kernel older than 5.2 due to lack of cgroup-level task freezer support.
  • In Kubernetes, 5.8 is the minimal kernel version for cgroups v2 as root cpu.stat file on cgroupv2 was only added on kernel 5.8.
  • memory.peak is added in 5.19.

Use systemd as cgroup driver

Configure the kubelet’s cgroup driver to match the container runtime cgroup driver.

The Container runtimes page explains that the systemd driver is recommended for kubeadm based setups instead of the kubelet’s default cgroupfs driver, because kubeadm manages the kubelet as a systemd service.

A minimal example of configuring the field explicitly:

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration
cgroupDriver: systemd

In v1.31, KEP-4033 is beta to extend CRI API for the kubelet to discover the cgroup driver from the container runtime. This will help installer and kubelet to autodetect

Tools and commands for troubleshooting

Tools and commands that you should know about cgroups:

  • stat -fc %T /sys/fs/cgroup/: Check if cgroups v2 is enabled which will return cgroup2fs
  • systemctl list-units kube* --type=slice or --type=scope: List kube related units that systemd currently has in memory.
  • bpftool cgroup list /sys/fs/cgroup/*: List all programs attached to the cgroup CGROUP.
  • systemd-cgls /sys/fs/cgroup/*: Recursively show control group contents.
  • systemd-cgtop: Show top control groups by their resource usage.
  • tree -L 2 -d /sys/fs/cgroup/kubepods.slice: Show Pods’ related cgroups directories.

How to check if a Pod CPU or memory limit is successfully applied to the cgroup file?

  • Kubernetes Pod Spec: check limits spec.containers[*].resources.limits.{cpu,memory} and requests spec.containers[*].resources.requests.{cpu,memory}
  • CRI: cpu_periodcpu_quotacpu_shares for CPU and memory_limit_in_bytes for memory limit
  • OCI Spec: memorry.limitcpu.sharescpu.quotacpu.period
  • Systemd Scope Unit: CPUWeightCPUQuotaPerSecUSecCPUQuotaPeriodUSecMemoryMax
  • Cgroupfs value: /sys/fs/cgroup/../cpu.weight/sys/fs/cgroup/../cpu.max/sys/fs/cgroup/../memory.max

Further reading

The Shift to cgroups v2 in Kubernetes: What You Need to Know

Introducing llmaz: Easy, advanced inference platform for large language models on Kubernetes

InftyAI’s llmaz is an advanced inference platform designed to streamline the deployment and management of large language models (LLMs) on Kubernetes. By integrating state-of-the-art inference backends, llmaz brings cutting-edge research to the cloud, offering a production-ready solution for LLMs.

Key Features of llmaz:

  • Kubernetes Integration for easy to use: deploy and manage LLMs within Kubernetes clusters, leveraging Kubernetes’ robust orchestration capabilities.
  • Advanced Inference Backends: Utilize state-of-the-art inference backends to ensure efficient and scalable model serving.
  • Production-Ready: Designed for production environments, llmaz offers reliability and performance for enterprise applications.

The deployment of a model is quite simple in llmaz.

Here’s a toy example for deploying deepseek-ai/DeepSeek-R1, all you need to do is to apply a Model and a Playground.

apiVersion: llmaz.io/v1alpha1
kind: OpenModel
metadata:
  name: opt-125m
spec:
  familyName: opt
  source:
    modelHub:
      modelID: deepseek-ai/DeepSeek-R1
  inferenceConfig:
    flavors:
      – name: default # Configure GPU type
        requests:
          nvidia.com/gpu: 1

apiVersion: inference.llmaz.io/v1alpha1
kind: Playground
metadata:
  name: opt-125m
spec:
  replicas: 1
  modelClaim:
    modelName: opt-125m

Latest Release: v0.1.3

The latest release, v0.1.3, was released on April 23th, 2025. The release v0.1 includes several enhancements and bug fixes to improve the platform’s stability and performance. For detailed information on the changes introduced in this release, please refer to the release notes.

Integrations

Broad Backends Support:  llmaz supports a wide range of advanced inference backends for different scenarios, like vLLM, Text-Generation-Inference, SGLang, llama.cpp. Find the full list of supported backends here.

llmaz supports a wide range of model providers, such as HuggingFace, ModelScope, ObjectStores.

AI Gateway Support: Offering capabilities like token-based rate limiting, model routing with the integration of Envoy AI Gateway.
Build-in ChatUI: Out-of-the-box chatbot support with the integration of Open WebUI, offering capacities like function call, RAG, web search and more, see configurations here

llmaz, serving as an easy to use and advanced inference platform, uses LeaderWorkerSet as the underlying workload to support both single-host and multi-host inference scenarios.

llmaz supports horizontal scaling with HPA by default and will integrate with autoscaling components like Cluster-Autoscaler or Karpenter for smart scaling across different clouds.

About the Founder: Kante Yin

Kante Yin is a prominent figure in the Kubernetes community, serving as a SIG Scheduling Approver and a top committer of LWS and Kueue. His contributions to Kubernetes scheduling and workload management have been instrumental in advancing cloud-native technologies. Kante’s expertise and leadership continue to drive innovation in the Kubernetes ecosystem.

Compared to other inference platforms, llmaz stands out with its extensionable cloud-native design, making it incredibly lightweight and efficient. Its architecture is optimized for scalability and resource efficiency, enabling seamless integration into modern cloud environments while maintaining high performance.

OSPP 2025 (Open Source Software Supply)

The Open Source Promotion Plan is a summer program organized by the Open Source Software Supply Chain Promotion Plan of the Institute of Software Chinese Academy of Sciences in 2020. It aims to encourage university students to actively participate in the development and maintenance of open source software, cultivate and discover more outstanding developers, promote the vigorous development of excellent open source software communities, and assist in the construction of open source software supply chains.

llmaz has 2 projects in OSPP 2025. Student Registration and Application: May 9 – June 9. Welcome to our community.

  1. KEDA-based Serverless Elastic Scaling for llmaz
  2. Enabling Efficient Model and Container Image Distribution in LLMaz with Dragonfly

For more information about llmaz and its features, visit the GitHub repository.

Introducing llmaz: Easy, advanced inference platform for large language models on Kubernetes

KubeCon 现场见闻:从 HeadLamp 到 MCP 热潮

本次大会上,不仅有各类技术项目的精彩展示,还有不少轻量级工具与社区项目引起了广泛关注。以下是我对部分亮点的整理与感受。

HeadLamp

作为 Kubernetes 社区项目,正以极具竞争力的姿态亮相。

  • 替代方案优势:HeadLamp 的功能已经足够丰富,可以取代传统的 kube dashboard 以及 KubeSphere部分功能。
  • 微软风格的桌面体验:Keynote 演示中,微软展现了将其打造为每个 Kubernetes 用户的必备桌面应用的意图。与 Lens 等竞品相比,其轻量、便捷、支持自部署的特点给人留下深刻印象——用户只需将各个集群的 token 或证书添加进来,即可快速上手管理。

ETCD Operator

关注热度与现状:项目启动之初吸引了不少关注,但目前实际参与者非常少,正处于 “help wanted” 阶段。

跨领域协作与挑战

弹性指标定义挑战:在 LLM 场景下如何定义弹性指标仍是一大难题

当前对 s3 modeling 的支持让人颇感无奈,这些都为社区未来的设计改善留下了想象空间。

推理领域: vllm production stack、KServer 、AIBrix 、 llmaz 的对比,目前感受上 KServe 有很多历史包袱,迫于之前用户和产品的要求很难直接做颠覆性重构,这也带来了一种担忧。AIBrix 和 llmaz 都是刚刚起步,AIBrix 有字节背书;llmaz 的目标则是更轻量。

MCP 热潮 —— 新项目与支持的风起云涌

本周又见一波 MCP 热潮:

另外,clusterpedia 的也需要相关方案,由 manusa 和 silenceper 推出的 kube MCP 项目也在积极探索中。

项目多元化:不少新建热门项目纷纷加入 MCP 支持行列,有项目直接在已有项目中添加 MCP 支持。

实例展示

dagger 与 MCP 的整合

k8sgpt 的 MCP 讨论

Steering 年报与 SIG 动态

一些需要更贡献者参与的点, Steering 团队基于各个 SIG/WG 的年报进行了总结:需要帮助的项目:各 SIG 维护者在年报中均提到若干待解项目和功能点,展现了社区在持续迭代和改进中的求助信号。

展台与 Demo Theater 的亮点

现场展区同样吸引了众多关注:

  • Wiz 展台:展示了丰富的安全工具 UI,从演示中可以感受到其在安全领域的坚实基础。
  • Demo Theater:不少 sponsor 的演示项目大放异彩,如 Google 现场展示的 65k node 演示给与会者留下了深刻印象。
  • 热门展区主题:当前最受欢迎的包括可观测性、安全、AI + Gateway 等方向,另有 Kubeflow 专题也成为亮点之一。

会场全景与个人感受

在现场的诸多体验中,也有一些值得一提的地方:

  • 行程安排的小插曲:第一天一早到达时状态不佳,参加 Maintainer Summit 时主要参与了 Steering 的 AMA。同时,预定的民宿体验不如预期,提示大家尽量避免使用 Booking.com 订房。
  • 最佳 End User 奖项:来自海外的 KubeCon 首次将此奖项授予中国企业蚂蚁集团,此前国内的京东和滴滴曾获得该奖,但是实在 KubeCon China。评判标准更多聚焦在社区贡献上。
  • 国际视野:日本专场圆桌中可见维护者数量和议题均有所增加。
  • End User: 欧洲 KubeCon 的 End User 演讲更是覆盖了从工业到农业的各个领域。
  • 会议场地设计:此次会场的房间布局略显“奇特”——Room A-H 人气颇旺,而部分在三楼或隐蔽位置(如 ROOM IJ)的场地则稍显冷清。
  • 演讲大会回顾
    • 热门主题:AI 相关议题(如 LLM、Ollama、Benchmark、DRA、k8sgpt)备受关注;此外,Argo、Cilium、Otel 与 Platform Engineering 也都有不少观众。
    • 项目展示:Project Lighting 场次人气高涨;而在其他项目中,keynote 中呈现的 honeycomb 效果不错,karpenter、cluster-api 与 vCluster(我的主题)等均引发了会后踊跃讨论。
    • 部分领域的冷场:部分偏维护者或底层技术的主题,如存储,本次的参与度和热度相对较低。
  • 餐饮体验:与过去相比,现场的用餐体验似乎又回到了曾经的“难吃”状态(巴黎好像饮食非常棒),主要是冷。

本次活动超过12500人参加,是史上新高,云原生的热潮似乎并没有冷却,加油👍

KubeCon 现场见闻:从 HeadLamp 到 MCP 热潮