support ttl metrics for volcano-scheduler. by fengruotj · Pull Request #4760 · volcano-sh/volcano

fengruotj · 2025-11-28T06:23:44Z

What type of PR is this?

/kind feature

This PR is the resolution of the below github issue. Please review and do the needful.

What this PR does / why we need it:

Added a TTL mechanism to the scheduler’s Metrics.

volcano-scheduler stores Kubernetes objects in a local cache during scheduling. When scheduling begins, it takes a snapshot of this cache and executes scheduling logic based on configured plugins, during which metrics are recorded. If the snapshot and the local cache become inconsistent, metrics can leak. The TTL feature prevents this.

Which issue(s) this PR fixes:

Fixes volcano-scheduler memory leak problem

Special notes for your reviewer:

NONE

Does this PR introduce a user-facing change?

NONE

volcano-sh-bot · 2025-11-28T06:23:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign k82cn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2025-11-28T06:24:16Z

Summary of Changes

Hello @fengruotj, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a Time-To-Live (TTL) mechanism for metrics within the volcano-scheduler. This feature is crucial for preventing memory leaks that can occur when the scheduler's internal cache and its snapshot become out of sync, leading to stale metrics accumulating indefinitely. By introducing TTL, metrics related to job scheduling and queue resources will automatically expire and be cleaned up, ensuring more efficient memory usage and accurate reporting, thereby enhancing the scheduler's stability and performance.

Highlights

TTL Mechanism for Metrics: Introduced a Time-To-Live (TTL) mechanism for scheduler metrics to prevent memory leaks caused by stale metrics.
Memory Leak Resolution: This change directly addresses and resolves the reported volcano-scheduler memory leak problem by ensuring metrics are automatically cleaned up.
New Utility for TTL Metrics: Added a new utility, TTLCollectorWrapper, which extends Prometheus collectors with TTL functionality, allowing metrics to expire after a configurable duration.
Metrics Initialization and Usage Update: Modified the initialization and update logic for various job and queue-related metrics to utilize the new TTL-enabled wrappers, ensuring proper lifecycle management.
Extensive Test Coverage: Updated numerous test files across scheduler actions, cache, and plugins to correctly initialize and interact with the new TTL metrics, along with dedicated unit tests for the TTLCollectorWrapper.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a TTL mechanism for scheduler metrics to prevent memory leaks from stale metric data, which is a great improvement. The core logic is encapsulated in a TTLCollectorWrapper that handles metric expiration. The implementation is generally solid, but I've found a potential bug in the cache cleanup logic and some areas for improvement regarding configuration and test setup consistency. My review includes a fix for the bug and suggestions to make the TTL values configurable and to standardize test initializations.

gemini-code-assist · 2025-11-28T06:27:36Z

pkg/util/metrics_with_ttl.go

+func (w *CollectorWrapper) matchByLabelOrder(keyLabels []string, matchLabels prometheus.Labels) bool {
+	if len(w.labelNames) == 0 || len(keyLabels) != len(w.labelNames) {
+		return false
+	}
+
+	for i, labelName := range w.labelNames {
+		matchVal, ok := matchLabels[labelName]
+		if ok && matchVal != keyLabels[i] {
+			return false
+		}
+	}
+	return true
+}


The matchByLabelOrder function has incorrect logic for performing a partial match. The current implementation will return true even if some labels in matchLabels are not present in the metric's labels, as long as the present ones match. This can lead to incorrect metric cache cleanup in DeletePartialMatch. The logic should ensure that all labels provided in matchLabels are present and match the metric's labels.

func (w *CollectorWrapper) matchByLabelOrder(keyLabels []string, matchLabels prometheus.Labels) bool { if len(w.labelNames) == 0 || len(keyLabels) != len(w.labelNames) { return false } labelsFound := 0 for i, labelName := range w.labelNames { if matchVal, ok := matchLabels[labelName]; ok { if matchVal != keyLabels[i] { return false // Value does not match. } labelsFound++ } } return labelsFound == len(matchLabels) }

gemini-code-assist · 2025-11-28T06:27:36Z

pkg/scheduler/cache/event_handlers_test.go

+	metrics.InitTTLQueueMetrics(context.Background())
+	metrics.InitTTLJobMetrics(context.Background())


The TTL metrics initialization should not be done within a test function. Instead, please move this logic to a TestMain function for the cache package to ensure it's executed only once for all tests in this package.

Example TestMain function:

import ( "context" "os" "testing" "volcano.sh/volcano/pkg/scheduler/metrics" ) func TestMain(m *testing.M) { metrics.InitTTLQueueMetrics(context.Background()) metrics.InitTTLJobMetrics(context.Background()) os.Exit(m.Run()) }

gemini-code-assist · 2025-11-28T06:27:36Z

pkg/scheduler/metrics/metrics.go

+	expirationTime := time.Hour * 2
+	checkIntervalTime := time.Hour


The expirationTime and checkIntervalTime for TTL job metrics are hardcoded. This reduces flexibility. It would be better to make these values configurable via the scheduler's options, allowing administrators to tune them based on their specific environment and requirements.

gemini-code-assist · 2025-11-28T06:27:36Z

pkg/scheduler/metrics/queue.go

+	expirationTime := time.Hour * 2
+	checkIntervalTime := time.Hour


The expirationTime and checkIntervalTime for TTL queue metrics are hardcoded. This reduces flexibility. It would be better to make these values configurable via the scheduler's options, allowing administrators to tune them based on their specific environment and requirements.

gemini-code-assist · 2025-11-28T06:27:37Z

pkg/scheduler/plugins/priority/priority_test.go

 func init() {
 	options.Default()
+	metrics.InitTTLQueueMetrics(context.Background())
+	metrics.InitTTLJobMetrics(context.Background())
 }


Using an init() function for test setup is not idiomatic in Go. It's better to use a TestMain function, which provides more control over the test execution. Please refactor this init() function into a TestMain function. You will also need to import the os package.

Suggested change

func init() {

options.Default()

metrics.InitTTLQueueMetrics(context.Background())

metrics.InitTTLJobMetrics(context.Background())

}

func TestMain(m *testing.M) {

options.Default()

metrics.InitTTLQueueMetrics(context.Background())

metrics.InitTTLJobMetrics(context.Background())

os.Exit(m.Run())

}

hzxuzhonghu · 2025-12-01T04:18:13Z

Hope i can have some bandwidth to review today

hzxuzhonghu · 2025-12-01T07:59:31Z

I just see DeleteJobMetrics is called after each job is deleted. Can you explain why there is still mem leak?


// DeleteJobMetrics delete all metrics related to the job
func DeleteJobMetrics(jobName, queue, namespace string) {
	e2eJobSchedulingDuration.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingStartTime.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingLastTime.DeleteLabelValues(jobName, queue, namespace)
	unscheduleTaskCount.DeleteLabelValues(jobName)
	jobShare.DeleteLabelValues(namespace, jobName)
	jobRetryCount.DeleteLabelValues(jobName)
}

fengruotj · 2025-12-01T10:55:39Z

I just see DeleteJobMetrics is called after each job is deleted. Can you explain why there is still mem leak?


// DeleteJobMetrics delete all metrics related to the job
func DeleteJobMetrics(jobName, queue, namespace string) {
	e2eJobSchedulingDuration.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingStartTime.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingLastTime.DeleteLabelValues(jobName, queue, namespace)
	unscheduleTaskCount.DeleteLabelValues(jobName)
	jobShare.DeleteLabelValues(namespace, jobName)
	jobRetryCount.DeleteLabelValues(jobName)
}

In fact, the internal Jobs data structure used by Volcano for session scheduling is a snapshot of the current scheduler cache. During the scheduling process, if a job happens to be deleted, it will cause the corresponding job_name metrics tag to remain in prometheus metric.

Author: tanjie.master <tanjiemaster@gmail.com> Date: Fri Nov 28 14:21:37 2025 +0800 Signed-off-by: tanjie.master <tanjiemaster@gmail.com>

hzxuzhonghu · 2025-12-01T11:55:39Z

In fact, the internal Jobs data structure used by Volcano for session scheduling is a snapshot of the current scheduler cache.

Can you elaborate it a little bit more? How is this relaated to metrics?

fengruotj · 2025-12-01T13:48:26Z

In fact, the internal Jobs data structure used by Volcano for session scheduling is a snapshot of the current scheduler cache.

Can you elaborate it a little bit more? How is this relaated to metrics?

OK, just using queue metrics as an example. When the scheduler execute the openSession function in pkg/scheduler/framework/session.go:163, it will create an snapshot cache by using cache.Snapshot(). So the ssn data structure using an snapshot data. It initialized Jobs and Queues from cache snapshot. Then the ssn execute schedule process, the capacity plugin expose the queue metrics by using pkg/scheduler/plugins/capacity/capacity.go:325. At the same time before it expose the queue metrics, the queue has beee closed & deleted inpkg/scheduler/cache/event_handlers.go:1000. So the queue
metrics has already cleaned. Ok, go back for the capacity plugin, it then exposed the queue metrics. Finally the metrics will not deleted any more. So the same as job metrics.

hzxuzhonghu · 2025-12-02T02:08:56Z

@fengruotj Thanks for your explanation. I think i get it. The pkg/scheduler/cache/event_handlers.go:1000 handle queue immediately, but the scheduling procedure pkg/scheduler/plugins/capacity/capacity.go use a snapshot of the cache(record metrics here). So there could be a delay and race here.

But i see your solution seems too tricky. I am thinking about we introduce a delay deletion, this way is very simple without changing the metrics wrapper introduced here. We can add a delay queue, whenever a queue deleted, we add it to a delay queue NewTypedDelayingQueueWithConfig, using AddAfter. And startup a separate consumer read the queue and delete metrics.

fengruotj · 2025-12-02T07:07:28Z

@fengruotj Thanks for your explanation. I think i get it. The pkg/scheduler/cache/event_handlers.go:1000 handle queue immediately, but the scheduling procedure pkg/scheduler/plugins/capacity/capacity.go use a snapshot of the cache(record metrics here). So there could be a delay and race here.

But i see your solution seems too tricky. I am thinking about we introduce a delay deletion, this way is very simple without changing the metrics wrapper introduced here. We can add a delay queue, whenever a queue deleted, we add it to a delay queue NewTypedDelayingQueueWithConfig, using AddAfter. And startup a separate consumer read the queue and delete metrics.

Got, I see. Your solutions sounds like a good idea.

hajnalmt · 2026-01-14T08:58:16Z

/close

Thank you for the contribution @fengruotj !
This seems too tricky in the end, let's close this in favor of:
#4822

Maybe we should create a delay deletion feature for the metrics, but it's not that important now I think if the memory leak problems disappear.

volcano-sh-bot · 2026-01-14T08:58:21Z

@hajnalmt: Closed this PR.

Details

In response to this:

/close

Thank you for the contribution @fengruotj !
This seems too tricky in the end, let's close this in favor of:
#4822

Maybe we should create a delay deletion feature for the metrics, but it's not that important now I think if the memory leak problems disappear.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

volcano-sh-bot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 28, 2025

volcano-sh-bot requested review from Thor-wl, hudson741 and hzxuzhonghu November 28, 2025 06:23

volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 28, 2025

Copilot AI mentioned this pull request Nov 28, 2025

Add Free Disk Space step to E2E workflows #4759

Merged

gemini-code-assist bot reviewed Nov 28, 2025

View reviewed changes

fengruotj force-pushed the fix-scheduler-metrics branch 2 times, most recently from 34efa2f to dad7134 Compare December 1, 2025 09:48

fengruotj force-pushed the fix-scheduler-metrics branch from cc0b253 to 7121b5b Compare December 1, 2025 10:56

support ttl metrics for volcano-scheduler.

86323b9

Author: tanjie.master <tanjiemaster@gmail.com> Date: Fri Nov 28 14:21:37 2025 +0800 Signed-off-by: tanjie.master <tanjiemaster@gmail.com>

fengruotj force-pushed the fix-scheduler-metrics branch from 7121b5b to 86323b9 Compare December 1, 2025 11:22

fengruotj changed the title ~~[WIP] support ttl metrics for volcano-scheduler.~~ support ttl metrics for volcano-scheduler. Dec 1, 2025

volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2025

JesseStutler mentioned this pull request Dec 19, 2025

Fixed memory leak caused by metrics remaining due to race conditions #4822

Merged

volcano-sh-bot closed this Jan 14, 2026

		metrics.InitTTLQueueMetrics(context.Background())
		metrics.InitTTLJobMetrics(context.Background())

		expirationTime := time.Hour * 2
		checkIntervalTime := time.Hour

Conversation

fengruotj commented Nov 28, 2025

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

volcano-sh-bot commented Nov 28, 2025

Uh oh!

gemini-code-assist bot commented Nov 28, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

hzxuzhonghu commented Dec 1, 2025

Uh oh!

hzxuzhonghu commented Dec 1, 2025

Uh oh!

fengruotj commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hzxuzhonghu commented Dec 1, 2025

Uh oh!

fengruotj commented Dec 1, 2025

Uh oh!

hzxuzhonghu commented Dec 2, 2025

Uh oh!

fengruotj commented Dec 2, 2025

Uh oh!

hajnalmt commented Jan 14, 2026

Uh oh!

volcano-sh-bot commented Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fengruotj commented Dec 1, 2025 •

edited

Loading