What's New

@kingeasternsun

What's Changed

Bug fixes

cherrypick 4829 to release 1.13: keep terminating pod in job by @kingeasternsun in #4860
[release-1.13] fix potential panic on numa resources info updating in snapshot by @qi-min in #4897
[release-1.13] Fix gpu resource error by @sailorvii in #4916
[release-1.13] Update metrics_client_prometheus.go by @nitindhiman314e in #4931
[release-1.13] Fix shared mutable objects in scheduler snapshot clones by @zhifei92 in #5093

Full Changelog: v1.13.1...v1.13.2

@JesseStutler

What's Changed

Bug fixes

[release-1.14] Fixed issue where jobs with subgroups but not hard networkTopology.mode could not be scheduled. by @JesseStutler in #5041
[release-1.14] fix: The AllocatedHyperNode recovery for SubJobs during scheduler restart may not be the lowest tier. by @ouyangshengjia in #5012

Full Changelog: v1.14.0...v1.14.1

@ssfffss

Summary

Volcano v1.14.0 establishes Volcano as a unified scheduling platform for diverse workloads at scale. This release introduces a scalable multi-scheduler architecture with dynamic node scheduling shard, enabling multiple schedulers to coordinate efficiently across large clusters. A new Agent Scheduler provides fast scheduling for latency-sensitive AI Agent workloads while working seamlessly with the Volcano batch scheduler. Network topology aware scheduling gains significant enhancements including HyperNode-level binpacking, SubGroup policies, and multi-level gang scheduling across Job and SubGroup scopes. Volcano Global integration advances with HyperJob for multi-cluster training and data-aware scheduling. Colocation now support generic operating systems with CPU Throttling, Memory QoS, and Cgroup V2. Additionally, integrated Ascend vNPU scheduling enables efficient sharing of Ascend AI accelerators.

What's New

Key Features Overview

Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha): Dynamically compute candidate node pools for schedulers with extensible strategies
Fast Scheduling for AI Agent Workloads (Alpha): A new Agent Scheduler for latency-sensitive AI Agent workloads is introduced, working in coordination with Volcano batch scheduler to establish a unified scheduling platform
Network Topology Aware Scheduling Enhancements: Support hyperNode-level binpacking, SubGroup level network topology aware scheduling, and multi-level gang scheduling across Job and SubGroup scopes for distributed workloads
Volcano Global Enhancements: HyperJob for multi-cluster training and data-aware scheduling for federated environments
Colocation for Generic OS: CPU Throttling, Memory QoS, CPU Burst with Cgroup V2 support on Ubuntu, CentOS, and other generic operating systems
Ascend vNPU Scheduling: Integrated support for Ascend 310P/910 series vNPU scheduling with MindCluster and HAMi modes

Key Feature Details

Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha)

Background and Motivation:

As Volcano evolves to support diverse scheduling workloads at massive scale, the single scheduler architecture faces significant challenges. Different workload types (batch training, AI agents, microservices) have distinct scheduling requirements and resource utilization patterns. A single scheduler becomes a bottleneck, and static resource allocation leads to inefficient cluster utilization.

The Sharding Controller introduces a scalable multi-scheduler architecture that dynamically computes candidate node pools for each scheduler. Unlike strict partitioning, the Sharding Controller calculates dynamic candidate node pools rather than enforcing hard isolation between schedulers. This flexible approach enables Volcano to serve as a unified scheduling platform for diverse workloads while maintaining high throughput and low latency.

Alpha Feature Notice: This feature is currently in alpha stage. The NodeShard CRD (Node Scheduling Shard) API structure and the underlying scheduling shard concepts are actively evolving.

Key Capabilities:

Dynamic Node Scheduling Shard Strategies: Compute dynamic candidate node pools based on various policies. Currently supports scheduling shard by CPU utilization, with an extensible design to support more policies in the future.
NodeShard CRD: Manages dynamic candidate node pools for specific schedulers.
Large-scale Cluster Support: Architecture designed to support large-scale clusters by distributing load across multiple schedulers
Scheduler Coordination: Enable seamless coordination among various scheduler combinations (e.g., multiple Batch Schedulers, or a mix of Agent and Batch Schedulers), establishing Volcano as a unified scheduling platform

Configuration:

# Sharding Controller startup flags
--scheduler-configs="volcano:volcano:0.0:0.6:false:2:100,agent-scheduler:agent:0.7:1.0:true:2:100"
--shard-sync-period=60s
--enable-node-event-trigger=true

# Config format: name:type:min_util:max_util:prefer_warmup:min_nodes:max_nodes

PR: #4777
Design Doc: Sharding Controller Design
Contributors: @ssfffss, @Haoran, @qi-min

Fast Scheduling for AI Agent Workloads (Alpha)

Background and Motivation:

AI Agent workloads are latency-sensitive with frequent task creation, requiring ultra-fast scheduling with high throughput. The Volcano batch scheduler is optimized for batch workloads and processes pods at fixed intervals, which cannot guarantee low latency for Agent workloads. To establish Volcano as a unified scheduling platform for both batch and latency-sensitive workloads, we introduce a dedicated Agent Scheduler.

The Agent Scheduler works in coordination with the Volcano batch scheduler through the Sharding Controller (which is introduced in "Scalable Multi-Scheduler with Dynamic Node Scheduling Shard" feature). This architecture positions Volcano as a unified scheduling platform capable of handling diverse workload types.

Alpha Feature Notice: This feature is currently in alpha stage and under active development. The Agent Scheduler related APIs, configuration options, and scheduling algorithms may be refined in future releases.

Key Capabilities:

Fast-Path Scheduling: Independent scheduler optimized for latency-sensitive workloads such as AI Agent workloads
Multi-Worker Parallel Scheduling: Multiple workers process pods concurrently from the scheduling queue, increasing throughput
Optimistic Concurrency Control: Conflict-Aware Binder resolves scheduling conflicts before executing real binding
Optimized Scheduling Queue: Enhanced queue mechanism with urgent retry support
Unified Platform Integration: Seamless coordination with Volcano batch scheduler via Sharding Controller

Issue: #4722
PRs: #4804, #4801, #4805
Design Doc: Agent Scheduler Design
Contributors: @qi-min, @JesseStutler, @handan-yxh

Network Topology Aware Scheduling Enhancements

Background and Motivation:

Volcano v1.14.0 brings significant enhancements to network topology aware scheduling, addressing the growing demands of distributed workloads including LLM training, HPC, and other network-intensive applications.

Key Enhancements:

SubGroup Level Topology Awareness: Support fine-grained network topology constraints at the SubGroup/Partition level.
Flexible Network Tier Configuration: Support highestTierName for specifying maximum network tier constraints by name.
Multi-Level Gang Scheduling: Improved gang scheduling to support both Job-level and SubGroup-level consistency.
Volcano Job Partitioning: Enable partitioning of Volcano Jobs for better resource management and fault isolation.
HyperNode-Level Binpacking: Optimization for resource utilization across network topology boundaries.

Configuration Example - Volcano Job:

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: llm-training-job
spec:
  # ...other fields
  networkTopology:
    mode: hard
    highestTierAllowed: 2  # Job can cross up to Tier 2 HyperNodes
  tasks:
  - name: trainer
    replicas: 8
    partitionPolicy:
      totalPartitions: 2    # Split into 2 partitions
      partitionSize: 4      # 4 pods per partition
      minPartitions: 2      # Minimum 2 partitions required
      networkTopology:
        mode: hard
        highestTierAllowed: 1  # Each partition must stay within Tier 1
    template:
      spec:
        containers:
        - name: trainer
          image: training-image:v1
          resources:
            requests:
              nvidia.com/gpu: 8

Configuration Example - PodGroup SubGroupPolicy:

apiVersion: scheduling.volcano.sh/v1beta1
kind: PodGroup
metadata:
  name: llm-training-pg
spec:
  minMember: 4
  networkTopology:
    mode: hard
    highestTierAllowed: 2
  subGroupPolicy:
  - name: "trainer"
    subGroupSize: 4
    labelSelector:
      matchLabels:
        volcano.sh/task-spec: trainer
    matchLabelKeys:
    - volcano.sh/partition-id
    networkTopology:
      mode: hard
      highestTierAllowed: 1

Issues: #4188, #4368, #4869
PRs: #4721, #4810, #4795, #4785, #4889
Design Doc: Network Topology Aware Scheduling
Contributors: @ouyangshengjia, @3sunny, @zhaoqi, @wangyang0616, @MondayCha, @Tau721

Colocation for Generic OS

This release brings comprehensive improvements to Volcano's colocation capabilities, with a major milestone: support for generic operating systems (Ubuntu, CentOS, etc.) in addition to OpenEuler. This enables broader adoption of Volcano Agent for resource sharing between online and offline workloads.

New Features in v1.14.0:

CPU Throttling (CPU Suppression)

The CPU usage of online pod...

@wangdongyang1

What's Changed

Bug fixes

[Cherry-pick v1.12] add hcclrank job plugin by @wangdongyang1 in #4555
Automated cherry pick of #4347: When some scalar resources are 0 in deserved, hierarychical queues validation can not pass by @wuxiaobao in #4586
Automated cherry pick of #4590: add permissions for managing namespaces in admission rules by @suyiiyii in #4594
[Cherry-pick v1.12] fix mpi job plugin panic when mpi job only has master task by @wangdongyang1 in #4619
[Cherry-pick v1.12]Sync kube-scheduler:Improve CSILimits plugin accuracy by using VolumeAttachments by @guoqinwill in #4627
Automated cherry pick of #4599: fix: report all scalar metrics for each queue by @hajnalmt in #4651
[Cherry-pick 1.12] fix: Initialize realCapability field in newQueueAttr by @dafu-wu in #4695
[cherry-pick 1.12]Scheduling main loop blocked and timeout due to un-released PreBind lock in Volcano by @guoqinwill in #4699
[release-1.12] Cherry-pick #4786 and #4792: fix replicaset KubeGroupNameAnnotation handling and replicaSet podgroup update synchronization by @hajnalmt in #4843
Automated cherry pick of #4829: keep terminating pod in job by @wangdongyang1 in #4861
[release-1.12] fix potential panic on numa resources info updating in snapshot by @qi-min in #4898
[release-1.12] Fix gpu resource error by @ChenW66 in #4915
[release-1.12] Fix: Changes to task members in a PodGroup caused task validity checks to fail during scheduling by @ouyangshengjia in #4920
[release-1.12] Fix scheduler panic when metrics are disabled by @Copilot in #4921
[release-1.12] Update metrics_client_prometheus.go by @nitindhiman314e in #4932

Maintenance

[release-1.12] Add Free Disk Space step to E2E workflows by @Copilot in #4851

Full Changelog: v1.12.2...v1.12.3

@Wonki4

What's Changed

Bug fixes

Automated cherry pick of #4670: fix: ci err caused bt ray e2e default image by @Wonki4 in #4681
[Cherry-pick 1.13] fix: Initialize realCapability field in newQueueAttr by @dafu-wu in #4694
[cherry-pick 1.13]Scheduling main loop blocked and timeout due to un-released PreBind lock in Volcano by @guoqinwill in #4700
[release-1.13] Fix scheduler panic when metrics are disabled by @Copilot in #4770
Cherry-pick PR #4786 to release-1.13: Fix replicaSet podgroup update synchronization by @jiahuat in #4799
[release-1.13] fix: replicaset KubeGroupNameAnnotation handling by @hajnalmt in #4826
[release-1.13] fix: constant cache warnings by @hajnalmt in #4831
[release-1.13] fix: capacity plugin's preemptivefn logic by @hajnalmt in #4830
[release-1.13] Fix: Changes to task members in a PodGroup caused task validity checks to fail during scheduling by @ouyangshengjia in #4852

Maintenance

[release-1.13] Add Free Disk Space step to E2E workflows by @Copilot in #4763

Full Changelog: v1.13.0...v1.13.1

What's New

Welcome to the v1.13.0 release of Volcano! 🚀 🎉 📣
In this release, we have brought a series of significant enhancements that have been long-awaited by community users:

AI Training and Inference Enhancements
Resource Management and Scheduling Enhancements
- Introduce ResourceStrategyFit Plugin
  - Independent Scoring Strategy by Resource Type
  - Scarce Resource Avoidance (SRA)
- Enhance NodeGroup Functionality
Colocation Enhancements
- Decouple Colocation from OS
- Support Custom OverSubscription Resource Names

Support LeaderWorkerSet for Large Model Inference Scenarios

LeaderWorkerSet (LWS) is an API for deploying a group of Pods on Kubernetes. It is primarily used to address multi-host inference in AI/ML inference workloads, especially scenarios that require sharding large language models (LLMs) and running them across multiple devices on multiple nodes.

Since its open-source release, Volcano has actively integrated with upstream and downstream ecosystems, building a comprehensive community ecosystem for batch computing such as AI and big data. In the v0.7 release of LWS, it natively integrated Volcano's AI scheduling capabilities. When used with the new version of Volcano, LWS automatically creates PodGroups, which are then scheduled and managed by Volcano, thereby implementing advanced capabilities like Gang scheduling for large model inference scenarios.

Looking ahead, Volcano will continue to expand its ecosystem integration capabilities, providing robust scheduling and resource management support for more projects dedicated to enabling distributed inference on Kubernetes.

Usage documentation: LeaderWorkerSet With Gang.

Related PRs: kubernetes-sigs/lws#496, kubernetes-sigs/lws#498, @JesseStutler

Introduce Cron VolcanoJob

This release introduces support for Cron Volcano Jobs. Users can now periodically create and run Volcano Jobs based on a predefined schedule, similar to native Kubernetes CronJobs, to achieve periodic execution of batch computing tasks like AI and big data. Detailed features are as follows:

Scheduled Execution: Define the execution cycle of jobs using standard Cron expressions (spec.schedule).
Timezone Support: Set the timezone in spec.timeZone to ensure jobs execute at the expected local time.
Concurrency Policy: Control concurrent behavior via spec.concurrencyPolicy:
- AllowConcurrent: Allows concurrent execution of multiple jobs (default).
- ForbidConcurrent: Skips the current scheduled execution if the previous job has not completed.
- ReplaceConcurrent: Terminates the previous job if it is still running and starts a new one.
History Management: Configure the number of successful (successfulJobsHistoryLimit) and failed (failedJobsHistoryLimit) job history records to retain; old jobs are automatically cleaned up.
Missed Schedule Handling: The startingDeadlineSeconds field allows tolerating scheduling delays within a certain timeframe; timeouts are considered missed executions.
Status Tracking: The CronJob status (status) tracks currently active jobs, the last scheduled time, and the last successful completion time for easier monitoring and management.

Related PRs: volcano-sh/apis#192, #4560, @GoingCharlie, @hwdef, @Monokaix

Usage example: Cron Volcano Job Example.

Support Label-based HyperNode Auto Discovery

Volcano officially launched network topology-aware scheduling capability in v1.12 and pioneered the UFM auto-discovery mechanism based on InfiniBand (IB) networks. However, for hardware clusters that do not support IB networks or use other network architectures (such as Ethernet), manually maintaining the network topology remains cumbersome.

To address this issue, the new version introduces a Label-based HyperNode auto-discovery mechanism. This feature provides users with a universal and flexible way to describe network topology, transforming complex topology management tasks into simple node label management.

This mechanism allows users to define the correspondence between topology levels and node labels in the volcano-controller-configmap. The Volcano controller periodically scans all nodes in the cluster and automatically performs the following tasks based on their labels:

Automatic Topology Construction: Automatically builds multi-layer HyperNode topology structures from top to bottom (e.g., rack -> switch -> node) based on a set of labels on the nodes.
Dynamic Maintenance: When node labels change, or nodes are added or removed, the controller automatically updates the members and structure of the HyperNodes, ensuring the topology information remains consistent with the cluster state.
Support for Multiple Topology Types: Allows users to define multiple independent network topologies simultaneously to adapt to different hardware clusters (e.g., GPU clusters, NPU clusters) or different network partitions.

Configuration example:

# volcano-controller-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-controller-configmap
  namespace: volcano-system
data:
  volcano-controller.conf: |
    networkTopologyDiscovery:
      - source: label
        enabled: true
        interval: 10m # Discovery interval
        config:
          networkTopologyTypes:
            # Define a topology type named topology-A
            topology-A:
              # Define topology levels, ordered from top to bottom
              - nodeLabel: "volcano.sh/hypercluster" # Top-level HyperNode
              - nodeLabel: "volcano.sh/hypernode"   # Middle-level HyperNode
              - nodeLabel: "kubernetes.io/hostname" # Bottom-level physical node

This feature is enabled by adding the label source to the Volcano controller's ConfigMap. The above configuration defines a three-layer topology structure named topology-A:

Top Level (Tier 2): Defined by the volcano.sh/hypercluster label.
Middle Level (Tier 1): Defined by the volcano.sh/hypernode label.
Bottom Level: Physical nodes, identified by the Kubernetes built-in kubernetes.io/hostname label.

When a node is labeled as follows, it will be automatically recognized and classified into the topology path cluster-s4 -> node-group-s0:

# Labels for node node-0
labels:
  kubernetes.io/hostname: node-0
  volcano.sh/hypernode: node-group-s0
  volcano.sh/hypercluster: cluster-s4

The label-based network topology auto-discovery feature offers excellent generality and flexibility. It is not dependent on specific network hardware (like IB), making it suitable for various heterogeneous clusters, and allows users to flexibly define hierarchical structures of any depth through labels. It automates complex topology maintenance tasks into simple node label management, significantly reducing operational costs and the risk of errors. Furthermore, this mechanism dynamically adapts to changes in cluster nodes and labels, maintaining the accuracy of topology information in real-time without manual intervention.

Related PR: #4629, @zhaoqi612

Usage documentation: HyperNode Auto Discovery.

Add Native Ray Framework Support

Ray is an open-source unified distributed computing framework whose core goal is to simplify parallel computing from single machines to large-scale clusters, especially suitable for scaling Python and AI applications. To manage and run Ray on Kubernetes, the community provides KubeRay—an operator specifically designed for Kubernetes. It acts as a bridge between Kubernetes and the Ray framework, greatly simplifying the deployment and management of Ray clusters and jobs.

Historically, running Ray workloads on Kubernetes primarily relied on the KubeRay Operator. KubeRay integrated Volcano in its v0.4.0 release (released in 2022) for scheduling and resource management of Ray Clusters, addressing issues like resource deadlocks in distributed training scenarios. With this new version of Volcano, users can now directly create and manage Ray clusters and submit computational tasks through native Volcano Jobs. This provides Ray users with an alternative usage scheme, allowing them to more directly utilize Volcano's capabilities such as Gang Scheduling, queue management and fair scheduling, and job lifecycle management for runni...

@JesseStutler

What's Changed

Automated cherry pick of #4422: Move kube-scheduler related metrics initilization to server.go to avoid panic by @JesseStutler in #4461
Automated cherry pick of #4473: fix node count reconcile by @Monokaix in #4488
[cherry-pick for 1.12]Fix incorrect definition of ReleaseNameEnvKey by @ouyangshengjia in #4490
[cherry-pick for 1.12]Fix the issue where SelectBestNode returns nil when plugin scores are negative by @guoqinwill in #4472
Automated cherry pick of #4487: Add missing capacity metrics in hierarchical queues by @JesseStutler in #4494
[Cherry-pick] Add bump version script; Make version release more automated by @JesseStutler in #4521
[Cherry-pick] fix: update podGroup when statefulSet update by @Poor12 in #4522
Automated: Bump version to v1.12.2 by @JesseStutler in #4518

Full Changelog: v1.12.1...v1.12.2

@Monokaix

What's Changed

Fix queue update conflicts when upgrading to new version by @Monokaix in #4336
Bump image to v1.12.1 by @Monokaix in #4337

Full Changelog: v1.12.0...v1.12.1

What's New

Welcome to the v1.12.0 release of Volcano! 🚀 🎉 📣
In this release, we have brought a bunch of significant enhancements that have long-awaited by community users.

Network Topology Aware Scheduling: Alpha Release

Volcano's network topology-aware scheduling, initially introduced as a preview in v1.11, has now reached its Alpha release in v1.12. This feature aims to optimize the deployment of AI tasks in large-scale training and inference scenarios, such as model parallel training and Leader-Worker inference. It achieves this by scheduling tasks within the same network topology performance domain, which reduces cross-switch communication and significantly enhances task efficiency. Volcano leverages the HyperNode CRD to abstract and represent heterogeneous hardware network topologies, supporting a hierarchical structure for simplified management.

Key features integrated in v1.12 include:

HyperNode Auto-Discovery: Volcano now offers automatic discovery of cluster network topologies. Users can configure the discovery type, and the system will automatically create and maintain hierarchical HyperNodes that reflect the actual cluster network topology. Currently, this supports InfiniBand (IB) networks by acquiring topology information via the UFM (Unified Fabric Manager) interface and automatically updating HyperNodes. Future plans include support for more network protocols like RoCE.
Prioritized HyperNode Selection:

This release introduces a scoring strategy based on both node-level and HyperNode-level evaluations, which are accumulated to determine the final HyperNode score.
- Node-level: It is recommended to configure the BinPack plugin to prioritize filling HyperNodes, thereby reducing resource fragmentation.
- HyperNode-level: Lower-level HyperNodes are preferred for better performance due to fewer cross-switch communications. For HyperNodes at the same level, those containing more tasks receive higher scores to reduce HyperNode-level resource fragmentation.
Support for Label Selector Node Matching:

HyperNode leaf nodes are associated with physical nodes in the cluster, supporting three matching strategies:
- Exact Match: Direct matching of node names.
- Regex Match: Matching node names using regular expressions.
- Label Match: Matching nodes via standard Label Selectors.

Related Documentation:

Related PRs: (#3874, #3894, #3969, #3971, #4068, #4213, #3897, #3887, @ecosysbin, @weapons97, @Xu-Wentao,@penggu @JesseStutler, @Monokaix)

Dynamic MIG Slicing for GPU Virtualization

Volcano's GPU virtualization feature now supports requesting partial GPU resources by memory and compute capacity. This, combined with Device Plugin integration, achieves hardware isolation and improves GPU utilization.

Traditional GPU virtualization restricts GPU usage by intercepting CUDA APIs (based on HAMI-Core software solutions). NVIDIA Ampere architecture introduced MIG (Multi-Instance GPU) technology, allowing a single physical GPU to be partitioned into multiple independent instances. However, general MIG solutions often pre-fix instance sizes, leading to resource waste and insufficient flexibility.

Volcano v1.12 provides dynamic MIG slicing and scheduling capabilities. It can select appropriate MIG instance sizes in real-time based on the user's requested GPU usage and employs a Best-Fit algorithm to minimize resource waste. It also supports GPU scoring strategies like BinPack and Spread to reduce resource fragmentation and enhance GPU utilization. Users can request resources using the unified volcano.sh/vgpu-number, volcano.sh/vgpu-cores, and volcano.sh/vgpu-memory APIs without needing to concern themselves with the underlying implementation.

Related Documentation:

Related PRs: (#4290, #3953, @sailorvii, @archlitchi)

Dynamic Resource Allocation (DRA) Support

Kubernetes DRA (Dynamic Resource Allocation) is a built-in Kubernetes feature designed to provide a more flexible and powerful way to manage heterogeneous hardware resources in a cluster, such as GPUs, FPGAs, and high-performance network cards. It addresses the limitations of traditional Device Plugins in certain advanced scenarios, enabling device vendors and platform administrators to better declare, allocate, and share these hardware resources with Pods and containers.

Volcano v1.12 adds support for DRA. This feature allows the cluster to dynamically allocate and manage external resources, enhancing Volcano's integration with the Kubernetes ecosystem and its resource management flexibility.

Related Documentation:
Unified Scheduling with DRA

Related PR: (#3799, @JesseStutler)

Volcano Global Supports Queue Capacity Management

Queues are a fundamental concept in Volcano. To enable tenant quota management in multi-cluster and multi-tenant environments, Volcano v1.12 introduces enhanced global queue capacity management. Users can now centrally limit tenant resource usage across multiple clusters. The configuration remains consistent with single-cluster setups: tenant quotas are defined by setting the capability field within the queue configuration.

Related PR: volcano-sh/volcano-global#16 (@tanberBro)

Security Enhancements

The Volcano community consistently focuses on security. In v1.12, beyond fine-grained control over sensitive permissions like ClusterRole, we've addressed and fixed the following potential security risks:

HTTP Server Timeout Settings: Metric and Healthz endpoints for all Volcano components have been configured with server-side ReadHeader, Read, and Write timeouts. This prevents prolonged resource occupation.
- PR: #4208
Warning Logs for Skipping SSL Certificate Verification: When client requests set insecureSkipVerify to true, a warning log is now added. We strongly advise enabling SSL certificate verification in production environments.
- PR: #4211
Volcano Scheduler pprof Endpoint Disabled by Default: To prevent the disclosure of sensitive program information, the Profiling data port (used for troubleshooting) is now disabled by default.
- PR: #4173
Removal of Unnecessary File Permissions: Unnecessary execution permissions have been removed from Go source files to maintain minimal file permissions.
- PR: #4171
Security Context and Non-Root Execution for Containers: All Volcano components now run with non-root privileges. We've added seccompProfile, SELinuxOptions, and set allowPrivilegeEscalation to false to prevent container privilege escalation. Additionally, only necessary Linux Capabilities are retained, comprehensively limiting container permissions.
- PR: #4207
HTTP Request Response Body Size Limit: For HTTP requests sent by the Extender Plugin and Elastic Search Service, their response body size is now limited. This prevents excessive resource consumption that could lead to OOM (Out Of Memory) issues.
- Disclosure: GHSA-hg79-fw4p-25p8

Performance Improvements in Large-Scale Scenarios

Volcano continuously optimizes performance. The new version, without affecting functionality, has by default removed and disabled some unnecessary Webhooks, improving performance in large-scale batch creation scenarios:

PodGroup Mutating Webhook Disabled by Default: When creating a PodGroup without specifying a queue, the system can now read from the Namespace to populate it. Since this scenario is uncommon, this Webhook is disabled by default. Users can enable it as needed.
Queue Status Validation Moved from Pod to PodGroup: When a queue is closed, task submission is disallowed. The original validation logic was performed during Pod creation. As Volcano's basic scheduling unit is PodGroup, migrating the validation to PodGroup creation is more logical. Since the number of PodGroups is less than Pods, this reduces Webhook calls, improving perfo...

@kevin-wangzefeng

Important:
This release addresses multiple critical security vulnerabilities. We strongly advise all users to upgrade to immediately to protect your systems and data.

Security Fixes

[Cherry-pick 1.11] Add http response body size limit (#4252 @kevin-wangzefeng )
[Cherry-pick 1.11] Add security context configuration (#4245 @JesseStutler)
Remove the execute permission for some files, chmod to 644 (#4171 @JesseStutler)
add a switch to control whether enable pprof in scheduler (#4173 @JesseStutler)
Add warning msg when TLS verification disabled(#4211 @Monokaix)
Add http server timeout(#4208 @Monokaix)

Other Improvements

Bump image to v1.11.2 (#4232 @JesseStutler)
Fix: remove controller-manager metrics that should not be introduced (#4202 @dongjiang1989)
Filter useless logs in binpack (#4240 @XbaoWu)

Important Notes Before Upgrading

Change: Volcano Scheduler pprof Endpoint Disabled by Default
For security enhancement, the pprof endpoint for the Volcano Scheduler is now disabled by default in this release. If you require this endpoint for debugging or monitoring, you will need to explicitly enable it post-upgrade. This can be achieved by:

If you are using helm, specifying custom.scheduler_pprof_enable=true during Helm installation or upgrade.
OR, manually setting the command-line argument --enable-pprof=true when starting the Volcano Scheduler.

Please be aware of the security implications before enabling this endpoint in production environments.

Releases: volcano-sh/volcano

v1.13.2

What's Changed

Bug fixes

Contributors

Uh oh!

v1.14.1

What's Changed

Bug fixes

Contributors

Uh oh!

v1.14.0

Summary

What's New

Key Features Overview

Key Feature Details

Scalable Multi-Scheduler with Dynamic Node Scheduling Shard (Alpha)

Fast Scheduling for AI Agent Workloads (Alpha)

Network Topology Aware Scheduling Enhancements

Colocation for Generic OS

Contributors

Uh oh!

v1.12.3

What's Changed

Bug fixes

Maintenance

Contributors

Uh oh!

v1.13.1

What's Changed

Bug fixes

Maintenance

Contributors

Uh oh!

v1.13.0

What's New

Support LeaderWorkerSet for Large Model Inference Scenarios

Introduce Cron VolcanoJob

Support Label-based HyperNode Auto Discovery

Add Native Ray Framework Support

Contributors

Uh oh!

v1.12.2

What's Changed

Contributors

Uh oh!

v1.12.1

What's Changed

Contributors

Uh oh!

v1.12.0

What's New

Network Topology Aware Scheduling: Alpha Release

Dynamic MIG Slicing for GPU Virtualization

Dynamic Resource Allocation (DRA) Support

Volcano Global Supports Queue Capacity Management

Security Enhancements

Performance Improvements in Large-Scale Scenarios

Contributors

Uh oh!

v1.11.2

Security Fixes

Other Improvements

Important Notes Before Upgrading

Contributors

Uh oh!