fix: preserve step working dir on retry#2264
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemove logic copying a persisted node WorkingDir into a step's Dir during retry plan creation; tests were added/refactored to ensure DAG step Dir (including variable expressions) is authoritative, stale runtime WorkingDir is cleared, and PlanEnv.WorkingDir resolves correctly. ChangesWorking Directory Handling in Retry Plans
Sequence Diagram(s)sequenceDiagram
participant Client
participant CreateRetryPlan
participant retryStepForNode
participant StepsMap
participant NodeStateStore
Client->>CreateRetryPlan: request retry plan
CreateRetryPlan->>retryStepForNode: get step for node
retryStepForNode->>StepsMap: lookup step by name
StepsMap-->>retryStepForNode: return authoritative Step
retryStepForNode-->>CreateRetryPlan: return Step (no WorkingDir copy)
CreateRetryPlan->>NodeStateStore: clear target.State().WorkingDir, set NodeNotStarted
NodeStateStore-->>CreateRetryPlan: persisted state updated
CreateRetryPlan-->>Client: return retry plan with PlanEnv.WorkingDir determined (new run or restored source)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
internal/runtime/plan_test.go (1)
522-527: ⚡ Quick winStrengthen the test case to validate precedence.
The second test case sets
dagStepDirandstateWorkDirto the same value ("/remote/app"). This doesn't effectively test that the configured DAG stepDirtakes precedence over the persisted nodeWorkingDirwhen they differ.💡 Proposed test case improvement
{ name: "configured step working dir remains configured", dagStepDir: "/remote/app", - stateWorkDir: "/remote/app", + stateWorkDir: "/var/lib/dagu/data/dag-runs/example/work", wantStepDir: "/remote/app", },This change would clearly demonstrate that the DAG-configured
Dirwins even when the node's persistedWorkingDirdiffers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/runtime/plan_test.go` around lines 522 - 527, The test case named "configured step working dir remains configured" currently sets dagStepDir and stateWorkDir to the same value, so it doesn't prove precedence; change the test fixture so dagStepDir is "/remote/app" and stateWorkDir is a different path (e.g. "/remote/old") and ensure the expectation wantStepDir remains "/remote/app" to verify that the DAG-configured Dir takes precedence over the persisted node WorkingDir (refer to dagStepDir, stateWorkDir, wantStepDir in the test).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@internal/runtime/plan_test.go`:
- Around line 522-527: The test case named "configured step working dir remains
configured" currently sets dagStepDir and stateWorkDir to the same value, so it
doesn't prove precedence; change the test fixture so dagStepDir is "/remote/app"
and stateWorkDir is a different path (e.g. "/remote/old") and ensure the
expectation wantStepDir remains "/remote/app" to verify that the DAG-configured
Dir takes precedence over the persisted node WorkingDir (refer to dagStepDir,
stateWorkDir, wantStepDir in the test).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c3c0303e-f085-43b0-a85b-3916eb77d5ed
📒 Files selected for processing (2)
internal/runtime/plan.gointernal/runtime/plan_test.go
💤 Files with no reviewable changes (1)
- internal/runtime/plan.go
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
1 issue found across 11 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
internal/runtime/plan_test.go (1)
589-590: ⚡ Quick winMake the persisted step differ from the restored source step.
Right now Line 590 seeds
node.Step.Dirwith the same${STEP_WORK_DIR}value as the DAG, so this test still passes if retry planning reuses the persisted step snapshot instead of rebinding from the restored DAG definition. Seed the node with the expanded concrete path (or any other mismatched value) so the assertion on Line 603 actually proves the source DAG won.💡 Suggested fix
node := runtime.NodeWithData(runtime.NodeData{ - Step: core.Step{Name: "target", Dir: "${STEP_WORK_DIR}"}, + Step: core.Step{Name: "target", Dir: restoredStepWorkDir}, State: runtime.NodeState{ Status: core.NodeFailed, WorkingDir: "/stale/effective/work/dir", }, })Also applies to: 601-606
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/runtime/plan_test.go` around lines 589 - 590, The persisted node's Step.Dir is being seeded with the same template value ("${STEP_WORK_DIR}") as the DAG, so the test doesn't prove rebinding occurred; change the initialization of the test node created via runtime.NodeWithData(runtime.NodeData{ Step: core.Step{Name: "target", Dir: ... }}) to use a concrete/expanded path (e.g., an absolute or different string) instead of the template so node.Step.Dir differs from the DAG's template; update the other similar seedings in the same test (the block around the node created at the 601-606 area) so the assertion that the restored DAG's value overwrote the persisted snapshot actually verifies rebinding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@internal/runtime/plan_test.go`:
- Around line 525-529: The tests are brittle on Windows because
TestCreateRetryPlan_PreservesExplicitStepWorkingDirOnly uses a hardcoded
"/remote/app" which filepath.IsAbs treats as non-absolute on Windows; update the
test to use an OS-agnostic absolute path (e.g., build via filepath.Join or
filepath.VolumeName + separator) or compute an absolute tmp dir and set
dagStepDir/env working dir via resolveExpandedDir so env.WorkingDir matches
wantEnvDir across platforms; also strengthen
TestCreateRetryPlan_ExplicitVariableStepDirUsesRestoredSourceDefinition by
changing the persisted node’s Step.Dir to a value different from
"${STEP_WORK_DIR}" so the assertion target.Step().Dir == "${STEP_WORK_DIR}"
actually validates that rebindRetryNodesToSteps replaces it (adjust persisted
node creation used by rebindRetryNodesToSteps accordingly).
---
Nitpick comments:
In `@internal/runtime/plan_test.go`:
- Around line 589-590: The persisted node's Step.Dir is being seeded with the
same template value ("${STEP_WORK_DIR}") as the DAG, so the test doesn't prove
rebinding occurred; change the initialization of the test node created via
runtime.NodeWithData(runtime.NodeData{ Step: core.Step{Name: "target", Dir: ...
}}) to use a concrete/expanded path (e.g., an absolute or different string)
instead of the template so node.Step.Dir differs from the DAG's template; update
the other similar seedings in the same test (the block around the node created
at the 601-606 area) so the assertion that the restored DAG's value overwrote
the persisted snapshot actually verifies rebinding.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 8e1abf3d-5a0c-4035-a88e-d4d591a1ec6f
📒 Files selected for processing (2)
internal/runtime/plan_test.gointernal/service/frontend/api/v1/dagruns_test.go
|
@coderabbitai review |
✅ Action performedReview finished.
|
|
@coderabbitai review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
♻️ Duplicate comments (2)
internal/runtime/plan_test.go (2)
662-668:⚠️ Potential issue | 🟠 Major | ⚡ Quick winStrengthen the test by using a different persisted node Step.Dir value.
The persisted node's
Step.Diris currently"${STEP_WORK_DIR}", which matches the DAG step's Dir. The assertiontarget.Step().Dir == "${STEP_WORK_DIR}"would pass even ifrebindRetryNodesToStepsfailed to replace the persisted value. Change the persisted node'sStep.Dirto a different value (e.g.,"/stale/dir") so the assertion actually validates that retry planning replaces it with the DAG source definition.🔧 Suggested fix
node := runtime.NodeWithData(runtime.NodeData{ - Step: core.Step{Name: "target", Dir: "${STEP_WORK_DIR}"}, + Step: core.Step{Name: "target", Dir: "/stale/dir"}, State: runtime.NodeState{ Status: core.NodeFailed, WorkingDir: "/stale/effective/work/dir", }, })🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/runtime/plan_test.go` around lines 662 - 668, The test constructs a persisted node using runtime.NodeWithData(runtime.NodeData{...}) where the persisted node Step.Dir is currently set to "${STEP_WORK_DIR}", which masks a failure of rebindRetryNodesToSteps; update the persisted node's Step.Dir (the runtime.NodeData.Step.Dir field) to a different value such as "/stale/dir" so that after calling rebindRetryNodesToSteps the assertion on target.Step().Dir verifies the DAG's Dir was actually applied; change the literal in the test setup and keep the assertion same to validate rebindRetryNodesToSteps behavior.
598-603:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix Windows portability in the "configured step working dir" test case.
The hardcoded
/remote/apppath breaks on Windows becausefilepath.IsAbs("/remote/app")returns false (no drive letter or UNC prefix), causing the path to be treated as relative. This makesenv.WorkingDirOS-dependent and contradicts the test'swantEnvDirexpectation.🔧 Suggested fix
func TestCreateRetryPlan_PreservesExplicitStepWorkingDirOnly(t *testing.T) { t.Parallel() + configuredStepDir := t.TempDir() + tests := []struct { name string dagStepDir string stateWorkDir string wantStepDir string wantEnvDir string }{ { name: "implicit step working dir ignores persisted effective work dir", stateWorkDir: "/var/lib/dagu/data/dag-runs/example/dag-runs/2026/06/04/dag-run_20260604_134911Z_run/work", wantStepDir: "", wantEnvDir: "fresh-run-work-dir", }, { name: "configured step working dir keeps restored source dir", - dagStepDir: "/remote/app", + dagStepDir: configuredStepDir, stateWorkDir: "/var/lib/dagu/data/dag-runs/example/work", - wantStepDir: "/remote/app", - wantEnvDir: "/remote/app", + wantStepDir: configuredStepDir, + wantEnvDir: configuredStepDir, }, }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/runtime/plan_test.go` around lines 598 - 603, The test uses a hardcoded "/remote/app" which is not considered an absolute path on Windows; update the test case fields dagStepDir, wantStepDir and wantEnvDir to build an OS-absolute path instead of the literal string. Replace the literal "/remote/app" with a platform-safe construction (e.g. create the path via filepath.Join(string(os.PathSeparator), "remote", "app") or use filepath.FromSlash and ensure it is rooted) so filepath.IsAbs returns true on Windows; adjust the values referenced in the test case (dagStepDir, wantStepDir, wantEnvDir) accordingly in internal/runtime/plan_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@internal/runtime/plan_test.go`:
- Around line 662-668: The test constructs a persisted node using
runtime.NodeWithData(runtime.NodeData{...}) where the persisted node Step.Dir is
currently set to "${STEP_WORK_DIR}", which masks a failure of
rebindRetryNodesToSteps; update the persisted node's Step.Dir (the
runtime.NodeData.Step.Dir field) to a different value such as "/stale/dir" so
that after calling rebindRetryNodesToSteps the assertion on target.Step().Dir
verifies the DAG's Dir was actually applied; change the literal in the test
setup and keep the assertion same to validate rebindRetryNodesToSteps behavior.
- Around line 598-603: The test uses a hardcoded "/remote/app" which is not
considered an absolute path on Windows; update the test case fields dagStepDir,
wantStepDir and wantEnvDir to build an OS-absolute path instead of the literal
string. Replace the literal "/remote/app" with a platform-safe construction
(e.g. create the path via filepath.Join(string(os.PathSeparator), "remote",
"app") or use filepath.FromSlash and ensure it is rooted) so filepath.IsAbs
returns true on Windows; adjust the values referenced in the test case
(dagStepDir, wantStepDir, wantEnvDir) accordingly in
internal/runtime/plan_test.go.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4d11e064-18c1-4d40-b94f-ed411703cb7c
📒 Files selected for processing (1)
internal/runtime/plan_test.go
Summary
Node.Stepsnapshotsworking_dir, commands, scripts, output paths, executor config, retry policy, and repeat policy come from the restored DAGworking_diras source configuration, while implicit steps use the fresh retry run work directoryRoot cause
Retry planning previously treated persisted runtime node data as authoritative step configuration. The direct symptom was copying every
NodeState.WorkingDirintostep.Dir; for steps without explicitworking_dir, that value is Dagu runtime state for a specific run, not user-authored step config.During retry plan creation the stale runtime path became
step.Dir; thenresolveWorkingDirgavestep.Dirprecedence over the fresh retry run work directory. A retry could therefore try to run in a previous DAG run's temporary work directory.The broader invariant is that retry execution must rebuild each step from the restored DAG source. Persisted node data supplies runtime status, logs, retry counters, and historical execution state only. It must not overwrite source step fields.
This is not SSH-only. The issue report matches the command/default shell path, where
cdfailed against a stale Dagu run work directory. SSH steps can show the same planner bug as a remotecdto a local Dagu path when no stepworking_dirwas configured, but SSH is a symptom path, not the root cause.Fixes #2263
Testing
make lintgo test ./internal/runtime ./internal/cmd ./internal/service/frontend/api/v1go test ./internal/runtime -run 'TestRetryPlan_ReplacesPersistedRuntimeMutatedStepSnapshot|TestRetryPlan_RebindsStepsFromRestoredDAG|TestStepRetryPlan_RebindsStepsFromRestoredDAG|TestCreateRetryPlan_PreservesExplicitStepWorkingDirOnly|TestCreateRetryPlan_ExplicitVariableStepDirUsesRestoredSourceDefinition|TestCreateRetryPlan_CommandStepUsesNewRunWorkDirAfterRetry' -count=1go test ./internal/runtime/agent -run 'TestRetryNodesUseRestoredDAGStepDefinition|TestRetryNodesRejectMissingRestoredSourceStep' -count=1go test ./internal/runtime/transform -count=1go test -race ./internal/intg/distr -run 'TestParallel_MixedLocalAndDistributed' -count=1go test -race ./internal/intg/distr -run 'TestParallel_MixedLocalAndDistributed' -count=3go test -race ./internal/intg/distr -count=140e2cab546c6f82e6597bf5432dcdbffede0b387: passed (run 27012347392)