Fixed memory leak caused by metrics remaining due to race conditions#4822
Fixed memory leak caused by metrics remaining due to race conditions#4822volcano-sh-bot merged 1 commit intovolcano-sh:masterfrom
Conversation
Summary of ChangesHello @halcyon-r, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a bug where scheduler metrics, specifically Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to fix a memory leak by preventing metric updates for terminated jobs. The approach of adding a check before updating metrics is correct and has been applied to drf and gang plugins. However, I've identified a critical data race in the new IsJobTerminated function that needs to be addressed to ensure thread safety.
d695610 to
9c29153
Compare
hajnalmt
left a comment
There was a problem hiding this comment.
Looks good to me, I had one question only.
Checked the occurancies too and the commit is fine (we don't update these metrics elsewehere which is surprising).
pkg/scheduler/cache/interface.go
Outdated
| SharedDRAManager() framework.SharedDRAManager | ||
|
|
||
| // IsJobTerminated returns if the job was terminated | ||
| IsJobTerminated(jobId string) bool |
There was a problem hiding this comment.
Why haven't you used (jobId api.jobId) here too?
You would have spared the casting in cache.go
There was a problem hiding this comment.
Yes, you are right. I‘ll optimize this part.
| } | ||
|
|
||
| func (sc *SchedulerCache) IsJobTerminated(jobId string) bool { | ||
| sc.Mutex.Lock() |
There was a problem hiding this comment.
I have created an enhacement request for this as this should be a Read Lock on the long run.
See: #4824
Verify that the job still exists before set job's metrics. Signed-off-by: hairuiyang <hairuiyang@deeproute.ai>
hajnalmt
left a comment
There was a problem hiding this comment.
/lgtm
Thank you for the contribution and the optimization!
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Good catch
@JesseStutler Should we backport to previous release
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hajnalmt, hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@halcyon-r So it's the same issue here but with different solutions? #4760 I prefer to check if the job still exist rather than use ttl way, but also let @fengruotj takes a look |
| SharedDRAManager() framework.SharedDRAManager | ||
|
|
||
| // IsJobTerminated returns if the job was terminated | ||
| IsJobTerminated(jobId api.JobID) bool |
There was a problem hiding this comment.
I feel that we are too arbitrary to add a method to the cache interface here, the cache is not like the design that an interface should have
Verify that the job still exists before set job's metrics.
What type of PR is this?
/kind bug
What this PR does / why we need it:
Before writing the
unschedule_task_count,job_share, andjob_retry_countsmetrics, check if the job exists to prevent leftover metrics.Which issue(s) this PR fixes:
Fixes #4821
Special notes for your reviewer:
Does this PR introduce a user-facing change?