Skip to content

support ttl metrics for volcano-scheduler.#4760

Closed
fengruotj wants to merge 1 commit intovolcano-sh:masterfrom
fengruotj:fix-scheduler-metrics
Closed

support ttl metrics for volcano-scheduler.#4760
fengruotj wants to merge 1 commit intovolcano-sh:masterfrom
fengruotj:fix-scheduler-metrics

Conversation

@fengruotj
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

This PR is the resolution of the below github issue. Please review and do the needful.

What this PR does / why we need it:

Added a TTL mechanism to the scheduler’s Metrics.

volcano-scheduler stores Kubernetes objects in a local cache during scheduling. When scheduling begins, it takes a snapshot of this cache and executes scheduling logic based on configured plugins, during which metrics are recorded. If the snapshot and the local cache become inconsistent, metrics can leak. The TTL feature prevents this.

Which issue(s) this PR fixes:

Fixes volcano-scheduler memory leak problem

Special notes for your reviewer:

NONE

Does this PR introduce a user-facing change?

NONE


@volcano-sh-bot volcano-sh-bot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Nov 28, 2025
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign k82cn for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 28, 2025
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @fengruotj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements a Time-To-Live (TTL) mechanism for metrics within the volcano-scheduler. This feature is crucial for preventing memory leaks that can occur when the scheduler's internal cache and its snapshot become out of sync, leading to stale metrics accumulating indefinitely. By introducing TTL, metrics related to job scheduling and queue resources will automatically expire and be cleaned up, ensuring more efficient memory usage and accurate reporting, thereby enhancing the scheduler's stability and performance.

Highlights

  • TTL Mechanism for Metrics: Introduced a Time-To-Live (TTL) mechanism for scheduler metrics to prevent memory leaks caused by stale metrics.
  • Memory Leak Resolution: This change directly addresses and resolves the reported volcano-scheduler memory leak problem by ensuring metrics are automatically cleaned up.
  • New Utility for TTL Metrics: Added a new utility, TTLCollectorWrapper, which extends Prometheus collectors with TTL functionality, allowing metrics to expire after a configurable duration.
  • Metrics Initialization and Usage Update: Modified the initialization and update logic for various job and queue-related metrics to utilize the new TTL-enabled wrappers, ensuring proper lifecycle management.
  • Extensive Test Coverage: Updated numerous test files across scheduler actions, cache, and plugins to correctly initialize and interact with the new TTL metrics, along with dedicated unit tests for the TTLCollectorWrapper.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a TTL mechanism for scheduler metrics to prevent memory leaks from stale metric data, which is a great improvement. The core logic is encapsulated in a TTLCollectorWrapper that handles metric expiration. The implementation is generally solid, but I've found a potential bug in the cache cleanup logic and some areas for improvement regarding configuration and test setup consistency. My review includes a fix for the bug and suggestions to make the TTL values configurable and to standardize test initializations.

Comment on lines +142 to +154
func (w *CollectorWrapper) matchByLabelOrder(keyLabels []string, matchLabels prometheus.Labels) bool {
if len(w.labelNames) == 0 || len(keyLabels) != len(w.labelNames) {
return false
}

for i, labelName := range w.labelNames {
matchVal, ok := matchLabels[labelName]
if ok && matchVal != keyLabels[i] {
return false
}
}
return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The matchByLabelOrder function has incorrect logic for performing a partial match. The current implementation will return true even if some labels in matchLabels are not present in the metric's labels, as long as the present ones match. This can lead to incorrect metric cache cleanup in DeletePartialMatch. The logic should ensure that all labels provided in matchLabels are present and match the metric's labels.

func (w *CollectorWrapper) matchByLabelOrder(keyLabels []string, matchLabels prometheus.Labels) bool {
	if len(w.labelNames) == 0 || len(keyLabels) != len(w.labelNames) {
		return false
	}

	labelsFound := 0
	for i, labelName := range w.labelNames {
		if matchVal, ok := matchLabels[labelName]; ok {
			if matchVal != keyLabels[i] {
				return false // Value does not match.
			}
			labelsFound++
		}
	}

	return labelsFound == len(matchLabels)
}

Comment on lines +154 to +155
metrics.InitTTLQueueMetrics(context.Background())
metrics.InitTTLJobMetrics(context.Background())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TTL metrics initialization should not be done within a test function. Instead, please move this logic to a TestMain function for the cache package to ensure it's executed only once for all tests in this package.

Example TestMain function:

import (
	"context"
	"os"
	"testing"
	"volcano.sh/volcano/pkg/scheduler/metrics"
)

func TestMain(m *testing.M) {
	metrics.InitTTLQueueMetrics(context.Background())
	metrics.InitTTLJobMetrics(context.Background())
	os.Exit(m.Run())
}

Comment on lines +134 to +135
expirationTime := time.Hour * 2
checkIntervalTime := time.Hour
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expirationTime and checkIntervalTime for TTL job metrics are hardcoded. This reduces flexibility. It would be better to make these values configurable via the scheduler's options, allowing administrators to tune them based on their specific environment and requirements.

Comment on lines +56 to +57
expirationTime := time.Hour * 2
checkIntervalTime := time.Hour
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The expirationTime and checkIntervalTime for TTL queue metrics are hardcoded. This reduces flexibility. It would be better to make these values configurable via the scheduler's options, allowing administrators to tune them based on their specific environment and requirements.

Comment on lines 38 to 42
func init() {
options.Default()
metrics.InitTTLQueueMetrics(context.Background())
metrics.InitTTLJobMetrics(context.Background())
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using an init() function for test setup is not idiomatic in Go. It's better to use a TestMain function, which provides more control over the test execution. Please refactor this init() function into a TestMain function. You will also need to import the os package.

Suggested change
func init() {
options.Default()
metrics.InitTTLQueueMetrics(context.Background())
metrics.InitTTLJobMetrics(context.Background())
}
func TestMain(m *testing.M) {
options.Default()
metrics.InitTTLQueueMetrics(context.Background())
metrics.InitTTLJobMetrics(context.Background())
os.Exit(m.Run())
}

@hzxuzhonghu
Copy link
Copy Markdown
Member

Hope i can have some bandwidth to review today

@hzxuzhonghu
Copy link
Copy Markdown
Member

I just see DeleteJobMetrics is called after each job is deleted. Can you explain why there is still mem leak?


// DeleteJobMetrics delete all metrics related to the job
func DeleteJobMetrics(jobName, queue, namespace string) {
	e2eJobSchedulingDuration.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingStartTime.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingLastTime.DeleteLabelValues(jobName, queue, namespace)
	unscheduleTaskCount.DeleteLabelValues(jobName)
	jobShare.DeleteLabelValues(namespace, jobName)
	jobRetryCount.DeleteLabelValues(jobName)
}

@fengruotj fengruotj force-pushed the fix-scheduler-metrics branch 2 times, most recently from 34efa2f to dad7134 Compare December 1, 2025 09:48
@fengruotj
Copy link
Copy Markdown
Contributor Author

fengruotj commented Dec 1, 2025

I just see DeleteJobMetrics is called after each job is deleted. Can you explain why there is still mem leak?


// DeleteJobMetrics delete all metrics related to the job
func DeleteJobMetrics(jobName, queue, namespace string) {
	e2eJobSchedulingDuration.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingStartTime.DeleteLabelValues(jobName, queue, namespace)
	e2eJobSchedulingLastTime.DeleteLabelValues(jobName, queue, namespace)
	unscheduleTaskCount.DeleteLabelValues(jobName)
	jobShare.DeleteLabelValues(namespace, jobName)
	jobRetryCount.DeleteLabelValues(jobName)
}

In fact, the internal Jobs data structure used by Volcano for session scheduling is a snapshot of the current scheduler cache. During the scheduling process, if a job happens to be deleted, it will cause the corresponding job_name metrics tag to remain in prometheus metric.

@fengruotj fengruotj force-pushed the fix-scheduler-metrics branch from cc0b253 to 7121b5b Compare December 1, 2025 10:56
Author:    tanjie.master <tanjiemaster@gmail.com>
Date:      Fri Nov 28 14:21:37 2025 +0800
Signed-off-by: tanjie.master <tanjiemaster@gmail.com>
@fengruotj fengruotj force-pushed the fix-scheduler-metrics branch from 7121b5b to 86323b9 Compare December 1, 2025 11:22
@hzxuzhonghu
Copy link
Copy Markdown
Member

In fact, the internal Jobs data structure used by Volcano for session scheduling is a snapshot of the current scheduler cache.

Can you elaborate it a little bit more? How is this relaated to metrics?

@fengruotj fengruotj changed the title [WIP] support ttl metrics for volcano-scheduler. support ttl metrics for volcano-scheduler. Dec 1, 2025
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 1, 2025
@fengruotj
Copy link
Copy Markdown
Contributor Author

In fact, the internal Jobs data structure used by Volcano for session scheduling is a snapshot of the current scheduler cache.

Can you elaborate it a little bit more? How is this relaated to metrics?

OK, just using queue metrics as an example. When the scheduler execute the openSession function in pkg/scheduler/framework/session.go:163, it will create an snapshot cache by using cache.Snapshot(). So the ssn data structure using an snapshot data. It initialized Jobs and Queues from cache snapshot. Then the ssn execute schedule process, the capacity plugin expose the queue metrics by using pkg/scheduler/plugins/capacity/capacity.go:325. At the same time before it expose the queue metrics, the queue has beee closed & deleted inpkg/scheduler/cache/event_handlers.go:1000. So the queue
metrics has already cleaned. Ok, go back for the capacity plugin, it then exposed the queue metrics. Finally the metrics will not deleted any more. So the same as job metrics.

@hzxuzhonghu
Copy link
Copy Markdown
Member

@fengruotj Thanks for your explanation. I think i get it. The pkg/scheduler/cache/event_handlers.go:1000 handle queue immediately, but the scheduling procedure pkg/scheduler/plugins/capacity/capacity.go use a snapshot of the cache(record metrics here). So there could be a delay and race here.

But i see your solution seems too tricky. I am thinking about we introduce a delay deletion, this way is very simple without changing the metrics wrapper introduced here. We can add a delay queue, whenever a queue deleted, we add it to a delay queue NewTypedDelayingQueueWithConfig, using AddAfter. And startup a separate consumer read the queue and delete metrics.

@fengruotj
Copy link
Copy Markdown
Contributor Author

@fengruotj Thanks for your explanation. I think i get it. The pkg/scheduler/cache/event_handlers.go:1000 handle queue immediately, but the scheduling procedure pkg/scheduler/plugins/capacity/capacity.go use a snapshot of the cache(record metrics here). So there could be a delay and race here.

But i see your solution seems too tricky. I am thinking about we introduce a delay deletion, this way is very simple without changing the metrics wrapper introduced here. We can add a delay queue, whenever a queue deleted, we add it to a delay queue NewTypedDelayingQueueWithConfig, using AddAfter. And startup a separate consumer read the queue and delete metrics.

Got, I see. Your solutions sounds like a good idea.

@hajnalmt
Copy link
Copy Markdown
Member

/close

Thank you for the contribution @fengruotj !
This seems too tricky in the end, let's close this in favor of:
#4822

Maybe we should create a delay deletion feature for the metrics, but it's not that important now I think if the memory leak problems disappear.

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

@hajnalmt: Closed this PR.

Details

In response to this:

/close

Thank you for the contribution @fengruotj !
This seems too tricky in the end, let's close this in favor of:
#4822

Maybe we should create a delay deletion feature for the metrics, but it's not that important now I think if the memory leak problems disappear.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

volcano-scheduler memory leak problem

4 participants