Skip to content

fix: reduce scheduler retry and notification scanning#2274

Merged
yohamta0 merged 5 commits into
mainfrom
fix-issue-546-scheduler-memory
Jun 8, 2026
Merged

fix: reduce scheduler retry and notification scanning#2274
yohamta0 merged 5 commits into
mainfrom
fix-issue-546-scheduler-memory

Conversation

@yohamta0

@yohamta0 yohamta0 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Skip DAG-run notification event reads when no notification destinations are configured by advancing the monitor cursor to the current event head.
  • Add file-backed retry-candidate sidecars so the scheduler retry scanner no longer performs broad failed-status history scans in normal operation.
  • Keep the retry-candidate sidecar self-healing from durable status files, including missing/dirty sidecars and stale candidates after run deletion.

Root Cause

The scheduler-related growth came from two independent history-reading paths: the notification monitor read DAG-run events even when there were no destinations to notify, and the retry scanner used generic failed-status listing across current-day run history. Both paths scaled with accumulated history rather than actionable work.

Testing

  • go test ./internal/persis/file/dagrun -run 'TestStoreListRetryCandidates(RebuildsDirtyCandidateDirectory|RemovesCandidateWhenRunIsGone|IgnoresChildAttemptStatusFiles)' -count=1
  • go test ./internal/persis/file/dagrun -count=1
  • go test ./internal/persis/file/dagrun ./internal/persis/file/eventstore ./internal/service/chatbridge ./internal/service/scheduler ./internal/service/healthcheck -count=1
  • go test ./internal/core/exec ./internal/runtime/... -count=1
  • git diff --check

Summary by cubic

Reduce memory and I/O by removing broad history scans in scheduler retries and notification monitoring. Skip event reads when no destinations are set and use file-backed retry-candidate sidecars for targeted retry scans.

  • Bug Fixes
    • Notifications: NotificationMonitor advances the source cursor to head when NotificationDestinations() is empty, avoiding DAG-run event reads and only delivering future events once destinations are set.
    • Scheduler retries: introduce .dagrun.retry-candidates sidecars updated on status writes; RetryScanner uses ListRetryCandidates when available (falls back to ListStatuses otherwise) to avoid wide failed-status scans.
    • Self-healing and hardening: confine retry-candidate file reads to the sidecar directory; rebuild sidecars when missing, dirty, or corrupted; remove stale candidates when runs are deleted and tolerate cleanup races; ignore child attempt status files.

Written for commit aacbf2e. Summary will update on new commits.

Review in cubic

Summary by CodeRabbit

  • New Features
    • Added persistent tracking of failed DAG run candidates to improve retry reliability and recovery operations.
    • Optimized notification delivery to skip event processing when notification destinations are not configured, reducing unnecessary operations.

Skip DAG-run notification event reads while no destinations are configured by advancing the monitor cursor to the current head.

Use file-backed retry candidate sidecars for scheduler retry scans, with rebuild and stale-candidate cleanup paths so status files remain the source of truth.

Add regression coverage for no-destination notifications and retry candidate listing.
@coderabbitai

coderabbitai Bot commented Jun 8, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds file-based retry candidate persistence and integrates it into the retry scanner to optimize failed DAG run discovery. Concurrently, it optimizes the notification monitor to skip event processing when no destinations are configured. The changes span storage persistence, scheduler logic, and monitor behavior.

Changes

Retry candidate tracking and scheduler integration

Layer / File(s) Summary
Retry candidate persistence
internal/persis/file/dagrun/retry_candidates.go, internal/persis/file/dagrun/attempt.go, internal/persis/file/dagrun/retry_candidates_external_test.go
Adds Store.ListRetryCandidates public API and file-based storage under per-day directories with dirty-marker rebuild. Attempt.Write now updates candidate files after status write. Tests validate persistence, rebuild semantics, staleness filtering, and child-attempt exclusion.
Retry scanner integration
internal/service/scheduler/retry_scanner.go, internal/service/scheduler/export_test.go, internal/service/scheduler/retry_scanner_external_test.go
Introduces retryCandidateLister interface; listFailedRuns checks whether dagRunStore implements it and delegates to ListRetryCandidates when available, falling back to ListStatuses otherwise. Tests validate both code paths with fixed clock.

Monitor notification optimization

Layer / File(s) Summary
Destination-aware event skipping
internal/service/chatbridge/monitor.go, internal/service/chatbridge/monitor_external_test.go
pollSource checks NotificationDestinations() before reading events; when empty, advances cursor to DAG-run head and returns early. New advanceSourceCursorToHead helper queries event service head and updates persisted cursor. Tests verify monitor skips event reads but advances cursor when no destinations, and delivers only future events after destinations are added.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: reduce scheduler retry and notification scanning' accurately describes the main changes: optimization work that reduces unnecessary scanning operations in the scheduler retry scanner and notification monitor.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description covers all required template sections: clear summary of changes, detailed bullet points of what was implemented, related root cause analysis, comprehensive testing instructions, and checklist items.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-issue-546-scheduler-memory

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/persis/file/dagrun/retry_candidates.go`:
- Around line 91-94: The code currently swallows read/unmarshal errors from
readRetryCandidateFile(candidatePath) and simply continues, which lets corrupted
retry-candidate sidecars silently block retries; update the error path so that
on failure you remove or quarantine the corrupted sidecar (e.g., delete
candidatePath), log the corruption, and mark the corresponding dagrun/day as
dirty or enqueue a rebuild (call the existing markDayDirty/markDagRunDirty or
rebuild trigger helper used elsewhere) so retries are rescheduled; ensure you
still continue processing after cleanup so the loop remains robust.

In `@internal/service/chatbridge/monitor_external_test.go`:
- Around line 193-200: The Never assertion's short 50*time.Millisecond window in
the require.Never call makes the "old-run" absence check flaky; increase the
timeout (the first duration argument) to a larger value (e.g.,
200-500*time.Millisecond) to allow async delivery paths to settle while keeping
the poll interval (second argument) unchanged, targeting the call that iterates
transport.deliveredNames() and looks for "old-run".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 10c68639-cf5c-468a-9700-ed148f7be43e

📥 Commits

Reviewing files that changed from the base of the PR and between 109b4f2 and d6ed0d9.

📒 Files selected for processing (8)
  • internal/persis/file/dagrun/attempt.go
  • internal/persis/file/dagrun/retry_candidates.go
  • internal/persis/file/dagrun/retry_candidates_external_test.go
  • internal/service/chatbridge/monitor.go
  • internal/service/chatbridge/monitor_external_test.go
  • internal/service/scheduler/export_test.go
  • internal/service/scheduler/retry_scanner.go
  • internal/service/scheduler/retry_scanner_external_test.go

Comment on lines +91 to +94
candidate, err := readRetryCandidateFile(candidatePath)
if err != nil {
continue
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Corrupted retry-candidate sidecars are silently dropped without self-healing.

At Line 91-Line 94, read/unmarshal failures are ignored with continue. A bad sidecar can then suppress retries for that run indefinitely because this path does not mark the day dirty or trigger rebuild.

Suggested fix
 		candidatePath := filepath.Join(candidateDir, entry.Name())
 		candidate, err := readRetryCandidateFile(candidatePath)
 		if err != nil {
-			continue
+			// self-heal broken sidecars
+			_ = fileutil.Remove(candidatePath)
+			_ = fileutil.WriteFileAtomic(
+				filepath.Join(dayPath, retryCandidateDirtyFileName),
+				[]byte("dirty\n"),
+				0600,
+			)
+			continue
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/persis/file/dagrun/retry_candidates.go` around lines 91 - 94, The
code currently swallows read/unmarshal errors from
readRetryCandidateFile(candidatePath) and simply continues, which lets corrupted
retry-candidate sidecars silently block retries; update the error path so that
on failure you remove or quarantine the corrupted sidecar (e.g., delete
candidatePath), log the corruption, and mark the corresponding dagrun/day as
dirty or enqueue a rebuild (call the existing markDayDirty/markDagRunDirty or
rebuild trigger helper used elsewhere) so retries are rescheduled; ensure you
still continue processing after cleanup so the loop remains robust.

Comment on lines +193 to +200
require.Never(t, func() bool {
for _, name := range transport.deliveredNames() {
if name == "old-run" {
return true
}
}
return false
}, 50*time.Millisecond, 5*time.Millisecond)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Increase the require.Never window to reduce CI timing flakes.

Line 193’s 50*time.Millisecond window is tight for asynchronous polling/delivery paths and can pass/fail nondeterministically on slower runners. Use a longer window so the “old-run is never delivered” assertion is stable.

Suggested change
-	require.Never(t, func() bool {
+	require.Never(t, func() bool {
 		for _, name := range transport.deliveredNames() {
 			if name == "old-run" {
 				return true
 			}
 		}
 		return false
-	}, 50*time.Millisecond, 5*time.Millisecond)
+	}, 300*time.Millisecond, 10*time.Millisecond)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
require.Never(t, func() bool {
for _, name := range transport.deliveredNames() {
if name == "old-run" {
return true
}
}
return false
}, 50*time.Millisecond, 5*time.Millisecond)
require.Never(t, func() bool {
for _, name := range transport.deliveredNames() {
if name == "old-run" {
return true
}
}
return false
}, 300*time.Millisecond, 10*time.Millisecond)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/service/chatbridge/monitor_external_test.go` around lines 193 - 200,
The Never assertion's short 50*time.Millisecond window in the require.Never call
makes the "old-run" absence check flaky; increase the timeout (the first
duration argument) to a larger value (e.g., 200-500*time.Millisecond) to allow
async delivery paths to settle while keeping the poll interval (second argument)
unchanged, targeting the call that iterates transport.deliveredNames() and looks
for "old-run".

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread internal/persis/file/dagrun/retry_candidates.go

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 8 files

Re-trigger cubic

@yohamta0 yohamta0 merged commit 3c5dc6d into main Jun 8, 2026
11 checks passed
@yohamta0 yohamta0 deleted the fix-issue-546-scheduler-memory branch June 8, 2026 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant