Skip to content

fix: bound scheduler latest status lookup#2278

Merged
yohamta0 merged 7 commits into
mainfrom
fix-issue-546-scheduler-history-scan
Jun 10, 2026
Merged

fix: bound scheduler latest status lookup#2278
yohamta0 merged 7 commits into
mainfrom
fix-issue-546-scheduler-history-scan

Conversation

@yohamta0

@yohamta0 yohamta0 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add a validated latest-attempt pointer for file-backed DAG run history so scheduler status checks avoid rescanning growing run directories
  • reuse evaluated empty DAG env data during subprocess dispatch to avoid unnecessary YAML reloads
  • add regression coverage and a benchmark for latest-attempt lookup with a large run directory

Root cause

The scheduler asks for latest DAG status while planning scheduled runs. On the file backend, latest-attempt lookup could fall back to walking and sorting the current day run directory, so frequent schedules accumulated file traversal, I/O, and cache pressure as data/dag-runs grew.

Validation

  • core, file persistence, runtime, and scheduler Go tests passed locally
  • latest-attempt benchmark passed locally
  • binary build passed locally

Fixes #546


Summary by cubic

Bound the scheduler’s latest-status lookup on the file backend by persisting a validated pointer to the latest attempt, avoiding rescans of large run directories. Also stabilized shell command substitution and scheduler timing tests, skipped no-op status rewrites, reused evaluated empty DAG envs, and hardened notification monitor assertions. Fixes #546.

  • Bug Fixes
    • Persist and validate a .dagrun.latest pointer in dagrun; update it on attempt writes; respect context cancellation.
    • Store reads the pointer first; falls back to scan and refreshes the pointer; honors the “today” cutoff.
    • Stabilized eval substitution: parse configured shell args, preserve resolved shell paths (incl. spaces), add PowerShell -NoProfile/-NonInteractive, avoid duplicate -Command.
    • Reuse evaluated empty source envs in ResolveEnvWithWarnings; minor env slice preallocation.
    • Skip no-op status compaction to avoid replacing unchanged status files.
    • Stabilize chatbridge notification monitor tests by asserting on head cursor offset to avoid flaky reads.
    • Stabilize scheduler catchup watermark integration timing to avoid minute-boundary flakiness.

Written for commit 10c9254. Summary will update on new commits.

Review in cubic

Summary by CodeRabbit

  • New Features

    • Persistent latest-attempt pointer for DAG runs to speed and stabilize latest-run resolution.
  • Bug Fixes

    • Smarter environment reuse to avoid re-evaluating already-evaluated environments.
    • More robust DAG-run status writes and compaction with safer pointer updates and fallbacks.
  • Tests

    • Expanded tests and benchmarks covering latest-attempt pointer behavior, environment reuse, shell invocation handling, and run-status persistence.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 379935e7-f0d6-4d40-8241-16c0f0822027

📥 Commits

Reviewing files that changed from the base of the PR and between 0ce603e and 10c9254.

📒 Files selected for processing (16)
  • internal/cmn/eval/export_test.go
  • internal/cmn/eval/options_test.go
  • internal/cmn/eval/substitute.go
  • internal/cmn/eval/substitute_external_test.go
  • internal/cmn/eval/substitute_unix_test.go
  • internal/core/spec/dag.go
  • internal/core/spec/runtime_env.go
  • internal/core/spec/runtime_env_external_test.go
  • internal/intg/queue/queue_test.go
  • internal/persis/file/dagrun/attempt.go
  • internal/persis/file/dagrun/attempt_external_test.go
  • internal/persis/file/dagrun/export_test.go
  • internal/persis/file/dagrun/latest_attempt.go
  • internal/persis/file/dagrun/latest_attempt_external_test.go
  • internal/persis/file/dagrun/store.go
  • internal/service/chatbridge/monitor_external_test.go

📝 Walkthrough

Walkthrough

Adds a durable .dagrun.latest pointer with update/retrieval/locking and store integration, refactors shell command construction and exposes test helpers, preallocates env slices and relaxes reuse checks for evaluated empty envs, and updates related tests and benchmarks.

Changes

Environment Evaluation Optimization

Layer / File(s) Summary
Environment caching and reuse logic
internal/core/spec/dag.go, internal/core/spec/runtime_env.go, internal/core/spec/runtime_env_external_test.go
buildEnvs preallocates slice capacity by variable count; canReuseCurrentEnv refines reuse conditions to permit already-evaluated empty environments; test verifies reuse of evaluated empty source env.

Shell-command construction refactor and tests

Layer / File(s) Summary
Shell command implementation
internal/cmn/eval/substitute.go
Split configured shell into executable+args and use helpers to append appropriate flags for PowerShell (-NoProfile, -NonInteractive, -Command placement), cmd, and other shells.
Test helpers and external tests
internal/cmn/eval/export_test.go, internal/cmn/eval/substitute_external_test.go, internal/cmn/eval/substitute_unix_test.go, internal/cmn/eval/options_test.go
Expose BuildShellCommandForTest for external tests; add tests validating PowerShell stable invocation, configured-arg parsing, de-duplication of flags, handling of paths-with-spaces, and Windows DefaultShell test path.

Latest Attempt Pointer Persistence

Layer / File(s) Summary
Pointer persistence infrastructure
internal/persis/file/dagrun/latest_attempt.go
Introduce .dagrun.latest JSON pointer file and parsing/validation of referenced status-file paths; implement pointer JSON shape and path-segment validation.
Pointer update and retrieval logic
internal/persis/file/dagrun/latest_attempt.go
Implement updateLatestAttemptPointer with directory locking, ordering comparison to avoid regressing the pointer, and atomic write; implement DataRoot.latestAttemptFromPointer to load and validate pointer targets and return attempts or a sentinel invalid-pointer error.
Integration with attempts and compaction
internal/persis/file/dagrun/attempt.go, internal/persis/file/dagrun/export_test.go
Call updateLatestAttemptPointer from Attempt.Write (log warning on pointer-update failure); refactor compaction to use statusForCompactionLocked which scans for invalid/duplicate lines and decides whether to rewrite. Test-export helpers expose pointer update/path helpers for tests.
Store lookup preferring pointer
internal/persis/file/dagrun/store.go
LatestAttempt now prefers pointer-based lookup (today-scoped and full), falls back to scanning on pointer failure, and refreshes the pointer after successful fallback reads.
Tests and benchmarks
internal/persis/file/dagrun/latest_attempt_external_test.go, internal/persis/file/dagrun/attempt_external_test.go
Tests verifying LatestAttempt resolves via persisted pointer even with later statusless run dirs, pointer update respects canceled contexts, benchmark exercising repeated lookups with thousands of statusless dirs, and attempt close behavior test ensuring single status file retention.

Scheduler catch-up test tweak

Layer / File(s) Summary
Catch-up time and fixture
internal/intg/queue/queue_test.go
Replace stableCurrentMinute helper with time.Now().UTC().Truncate(time.Minute) and increase the catchup_window fixture from "2m" to "10m".

Chatbridge monitor tests

Layer / File(s) Summary
Monitor store offset tracking and assertions
internal/service/chatbridge/monitor_external_test.go
Add lastHeadOffset to in-memory monitorEventStore, record it in NotificationHeadCursor, expose lastHead(); update tests to assert cursor advancement via store.lastHead().

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • dagucloud/dagu#2269: Modifies canReuseCurrentEnv and env-reuse logic in runtime_env.go directly related to this PR's env optimization changes.
  • dagucloud/dagu#2274: Touches Attempt.Write at the same post-status checkpoint; changes are related to pointer/update and retry-candidate handling.
  • dagucloud/dagu#2233: Related to env resolution and ResolveEnvWithWarnings behavior used by the new env reuse test.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.24% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: bounding the scheduler's latest status lookup, which is the core objective of this PR.
Description check ✅ Passed The description includes a comprehensive summary, detailed changes, root cause analysis, validation details, and linked issue reference, substantially exceeding template requirements.
Linked Issues check ✅ Passed The PR directly addresses issue #546 by persisting a latest-attempt pointer to avoid rescanning large run directories, reusing evaluated empty DAG envs, and including regression tests, all targeting the reported memory leak caused by frequent scheduler runs.
Out of Scope Changes check ✅ Passed All changes are clearly scoped to address issue #546: pointer persistence/validation, env reuse, shell command stabilization for subprocess dispatch, test improvements, and benchmark additions—no extraneous modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix-issue-546-scheduler-history-scan

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/persis/file/dagrun/latest_attempt.go`:
- Around line 36-78: updateLatestAttemptPointer ignores its ctx; add
cancellation checks so the function stops and releases the lock promptly if the
context is canceled. After successfully acquiring lock (after lock.TryLock
returns nil and before calling latestAttemptInfoFromPointer) check
ctx.Err()/select on ctx.Done and if canceled unlock (use lock.Unlock) and return
ctx.Err(); also check context again immediately before performing the file write
(before json.Marshal or at least before fileutil.WriteFileAtomic) and if
canceled unlock and return ctx.Err(); ensure the existing unlock boolean
semantics (unlock variable and deferred unlock) are preserved so the lock is
always released when returning on context cancellation; touch symbols:
updateLatestAttemptPointer, lock.TryLock, latestAttemptInfoFromPointer,
fileutil.WriteFileAtomic, latestAttemptPointerPath, and candidate/ptr to locate
changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 426c071d-cab0-45f0-be34-2c6aaa559033

📥 Commits

Reviewing files that changed from the base of the PR and between c8b0a7e and 09dab00.

📒 Files selected for processing (7)
  • internal/core/spec/dag.go
  • internal/core/spec/runtime_env.go
  • internal/core/spec/runtime_env_external_test.go
  • internal/persis/file/dagrun/attempt.go
  • internal/persis/file/dagrun/latest_attempt.go
  • internal/persis/file/dagrun/latest_attempt_external_test.go
  • internal/persis/file/dagrun/store.go

Comment thread internal/persis/file/dagrun/latest_attempt.go Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 7 files

Re-trigger cubic

@yohamta0

Copy link
Copy Markdown
Collaborator Author

@codex full review

@yohamta0

Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@yohamta0 yohamta0 closed this Jun 10, 2026
@yohamta0 yohamta0 reopened this Jun 10, 2026
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
✅ Action performed

Full review finished.

@yohamta0 yohamta0 merged commit 87d860b into main Jun 10, 2026
20 checks passed
@yohamta0 yohamta0 deleted the fix-issue-546-scheduler-history-scan branch June 10, 2026 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

memory leak - high memory usage

1 participant