support ttl metrics for volcano-scheduler.#4760
support ttl metrics for volcano-scheduler.#4760fengruotj wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello @fengruotj, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request implements a Time-To-Live (TTL) mechanism for metrics within the volcano-scheduler. This feature is crucial for preventing memory leaks that can occur when the scheduler's internal cache and its snapshot become out of sync, leading to stale metrics accumulating indefinitely. By introducing TTL, metrics related to job scheduling and queue resources will automatically expire and be cleaned up, ensuring more efficient memory usage and accurate reporting, thereby enhancing the scheduler's stability and performance. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a TTL mechanism for scheduler metrics to prevent memory leaks from stale metric data, which is a great improvement. The core logic is encapsulated in a TTLCollectorWrapper that handles metric expiration. The implementation is generally solid, but I've found a potential bug in the cache cleanup logic and some areas for improvement regarding configuration and test setup consistency. My review includes a fix for the bug and suggestions to make the TTL values configurable and to standardize test initializations.
| func (w *CollectorWrapper) matchByLabelOrder(keyLabels []string, matchLabels prometheus.Labels) bool { | ||
| if len(w.labelNames) == 0 || len(keyLabels) != len(w.labelNames) { | ||
| return false | ||
| } | ||
|
|
||
| for i, labelName := range w.labelNames { | ||
| matchVal, ok := matchLabels[labelName] | ||
| if ok && matchVal != keyLabels[i] { | ||
| return false | ||
| } | ||
| } | ||
| return true | ||
| } |
There was a problem hiding this comment.
The matchByLabelOrder function has incorrect logic for performing a partial match. The current implementation will return true even if some labels in matchLabels are not present in the metric's labels, as long as the present ones match. This can lead to incorrect metric cache cleanup in DeletePartialMatch. The logic should ensure that all labels provided in matchLabels are present and match the metric's labels.
func (w *CollectorWrapper) matchByLabelOrder(keyLabels []string, matchLabels prometheus.Labels) bool {
if len(w.labelNames) == 0 || len(keyLabels) != len(w.labelNames) {
return false
}
labelsFound := 0
for i, labelName := range w.labelNames {
if matchVal, ok := matchLabels[labelName]; ok {
if matchVal != keyLabels[i] {
return false // Value does not match.
}
labelsFound++
}
}
return labelsFound == len(matchLabels)
}| metrics.InitTTLQueueMetrics(context.Background()) | ||
| metrics.InitTTLJobMetrics(context.Background()) |
There was a problem hiding this comment.
The TTL metrics initialization should not be done within a test function. Instead, please move this logic to a TestMain function for the cache package to ensure it's executed only once for all tests in this package.
Example TestMain function:
import (
"context"
"os"
"testing"
"volcano.sh/volcano/pkg/scheduler/metrics"
)
func TestMain(m *testing.M) {
metrics.InitTTLQueueMetrics(context.Background())
metrics.InitTTLJobMetrics(context.Background())
os.Exit(m.Run())
}| expirationTime := time.Hour * 2 | ||
| checkIntervalTime := time.Hour |
There was a problem hiding this comment.
| expirationTime := time.Hour * 2 | ||
| checkIntervalTime := time.Hour |
There was a problem hiding this comment.
| func init() { | ||
| options.Default() | ||
| metrics.InitTTLQueueMetrics(context.Background()) | ||
| metrics.InitTTLJobMetrics(context.Background()) | ||
| } |
There was a problem hiding this comment.
Using an init() function for test setup is not idiomatic in Go. It's better to use a TestMain function, which provides more control over the test execution. Please refactor this init() function into a TestMain function. You will also need to import the os package.
| func init() { | |
| options.Default() | |
| metrics.InitTTLQueueMetrics(context.Background()) | |
| metrics.InitTTLJobMetrics(context.Background()) | |
| } | |
| func TestMain(m *testing.M) { | |
| options.Default() | |
| metrics.InitTTLQueueMetrics(context.Background()) | |
| metrics.InitTTLJobMetrics(context.Background()) | |
| os.Exit(m.Run()) | |
| } |
|
Hope i can have some bandwidth to review today |
|
I just see DeleteJobMetrics is called after each job is deleted. Can you explain why there is still mem leak? |
34efa2f to
dad7134
Compare
In fact, the internal Jobs data structure used by Volcano for session scheduling is a snapshot of the current scheduler cache. During the scheduling process, if a job happens to be deleted, it will cause the corresponding job_name metrics tag to remain in prometheus metric. |
cc0b253 to
7121b5b
Compare
Author: tanjie.master <tanjiemaster@gmail.com> Date: Fri Nov 28 14:21:37 2025 +0800 Signed-off-by: tanjie.master <tanjiemaster@gmail.com>
7121b5b to
86323b9
Compare
Can you elaborate it a little bit more? How is this relaated to metrics? |
OK, just using queue metrics as an example. When the scheduler execute the openSession function in |
|
@fengruotj Thanks for your explanation. I think i get it. The pkg/scheduler/cache/event_handlers.go:1000 handle queue immediately, but the scheduling procedure But i see your solution seems too tricky. I am thinking about we introduce a delay deletion, this way is very simple without changing the metrics wrapper introduced here. We can add a delay queue, whenever a queue deleted, we add it to a delay queue |
Got, I see. Your solutions sounds like a good idea. |
|
/close Thank you for the contribution @fengruotj ! Maybe we should create a delay deletion feature for the metrics, but it's not that important now I think if the memory leak problems disappear. |
|
@hajnalmt: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind feature
This PR is the resolution of the below github issue. Please review and do the needful.
What this PR does / why we need it:
Added a TTL mechanism to the scheduler’s Metrics.
volcano-scheduler stores Kubernetes objects in a local cache during scheduling. When scheduling begins, it takes a snapshot of this cache and executes scheduling logic based on configured plugins, during which metrics are recorded. If the snapshot and the local cache become inconsistent, metrics can leak. The TTL feature prevents this.
Which issue(s) this PR fixes:
Fixes volcano-scheduler memory leak problem
Special notes for your reviewer:
NONE
Does this PR introduce a user-facing change?
NONE