Skip to content

fix: preserve step working dir on retry#2264

Merged
yohamta0 merged 10 commits into
mainfrom
fix/retry-ssh-working-dir
Jun 5, 2026
Merged

fix: preserve step working dir on retry#2264
yohamta0 merged 10 commits into
mainfrom
fix/retry-ssh-working-dir

Conversation

@yohamta0

@yohamta0 yohamta0 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • rebuild retry nodes from restored DAG step definitions instead of trusting persisted Node.Step snapshots
  • keep persisted node data limited to runtime state and history, while source step fields such as working_dir, commands, scripts, output paths, executor config, retry policy, and repeat policy come from the restored DAG
  • keep explicitly configured step working_dir as source configuration, while implicit steps use the fresh retry run work directory
  • remove the persisted working-directory provenance snapshot approach to avoid growing DAG run status data
  • add regression coverage for the command/default retry path reported in bug: DAG run fails on retry. #2263, stale persisted retry state, and explicit step working dirs

Root cause

Retry planning previously treated persisted runtime node data as authoritative step configuration. The direct symptom was copying every NodeState.WorkingDir into step.Dir; for steps without explicit working_dir, that value is Dagu runtime state for a specific run, not user-authored step config.

During retry plan creation the stale runtime path became step.Dir; then resolveWorkingDir gave step.Dir precedence over the fresh retry run work directory. A retry could therefore try to run in a previous DAG run's temporary work directory.

The broader invariant is that retry execution must rebuild each step from the restored DAG source. Persisted node data supplies runtime status, logs, retry counters, and historical execution state only. It must not overwrite source step fields.

This is not SSH-only. The issue report matches the command/default shell path, where cd failed against a stale Dagu run work directory. SSH steps can show the same planner bug as a remote cd to a local Dagu path when no step working_dir was configured, but SSH is a symptom path, not the root cause.

Fixes #2263

Testing

  • make lint
  • go test ./internal/runtime ./internal/cmd ./internal/service/frontend/api/v1
  • go test ./internal/runtime -run 'TestRetryPlan_ReplacesPersistedRuntimeMutatedStepSnapshot|TestRetryPlan_RebindsStepsFromRestoredDAG|TestStepRetryPlan_RebindsStepsFromRestoredDAG|TestCreateRetryPlan_PreservesExplicitStepWorkingDirOnly|TestCreateRetryPlan_ExplicitVariableStepDirUsesRestoredSourceDefinition|TestCreateRetryPlan_CommandStepUsesNewRunWorkDirAfterRetry' -count=1
  • go test ./internal/runtime/agent -run 'TestRetryNodesUseRestoredDAGStepDefinition|TestRetryNodesRejectMissingRestoredSourceStep' -count=1
  • go test ./internal/runtime/transform -count=1
  • go test -race ./internal/intg/distr -run 'TestParallel_MixedLocalAndDistributed' -count=1
  • go test -race ./internal/intg/distr -run 'TestParallel_MixedLocalAndDistributed' -count=3
  • go test -race ./internal/intg/distr -count=1
  • GitHub Actions CI on head 40e2cab546c6f82e6597bf5432dcdbffede0b387: passed (run 27012347392)

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Remove logic copying a persisted node WorkingDir into a step's Dir during retry plan creation; tests were added/refactored to ensure DAG step Dir (including variable expressions) is authoritative, stale runtime WorkingDir is cleared, and PlanEnv.WorkingDir resolves correctly.

Changes

Working Directory Handling in Retry Plans

Layer / File(s) Summary
Step directory handling in retry creation
internal/runtime/plan.go
retryStepForNode no longer copies node persisted WorkingDir to step.Dir; it returns the authoritative step definition unchanged and still errors when the step is missing.
Frontend dagruns retry test updates
internal/service/frontend/api/v1/dagruns_test.go
TestRetryDAGRunStartsLocalRetrySubprocess updated to use retry.marker in DAG_RUN_WORK_DIR, validate subprocess CWD, record original node WorkingDir, overwrite persisted attempt with a stale work dir, trigger retry, and assert the retried node's WorkingDir is restored to the original.
Runtime retry-plan tests
internal/runtime/plan_test.go
Added TestRetryPlan_ReplacesPersistedRuntimeMutatedStepSnapshot and table-driven tests: TestCreateRetryPlan_PreservesExplicitStepWorkingDirOnly, TestCreateRetryPlan_ExplicitVariableStepDirUsesRestoredSourceDefinition, and TestCreateRetryPlan_CommandStepUsesNewRunWorkDirAfterRetry to validate explicit Dir preservation, ${...} expression handling, clearing of stale WorkingDir, and fresh-run work dir behavior.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant CreateRetryPlan
  participant retryStepForNode
  participant StepsMap
  participant NodeStateStore
  Client->>CreateRetryPlan: request retry plan
  CreateRetryPlan->>retryStepForNode: get step for node
  retryStepForNode->>StepsMap: lookup step by name
  StepsMap-->>retryStepForNode: return authoritative Step
  retryStepForNode-->>CreateRetryPlan: return Step (no WorkingDir copy)
  CreateRetryPlan->>NodeStateStore: clear target.State().WorkingDir, set NodeNotStarted
  NodeStateStore-->>CreateRetryPlan: persisted state updated
  CreateRetryPlan-->>Client: return retry plan with PlanEnv.WorkingDir determined (new run or restored source)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • dagucloud/dagu#2111: Also adjusts retry working-directory handling; that PR added a WorkingDir-driven override while this PR removes the override so only configured DAG step Dir is preserved.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: preserve step working dir on retry' accurately summarizes the main change: modifying retry planning to preserve explicitly configured step working directories instead of copying persisted runtime paths.
Linked Issues check ✅ Passed The PR successfully addresses issue #2263 by preventing persisted per-run working directories from being incorrectly promoted into step configuration during retry planning, ensuring retries use fresh work directories instead of stale paths.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing retry working directory handling: modifications to plan.go for retry logic, comprehensive test coverage additions in plan_test.go and dagruns_test.go, with no unrelated changes present.
Description check ✅ Passed The pull request description is comprehensive and well-structured, covering all required template sections with detailed information about the changes and their rationale.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/retry-ssh-working-dir

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
internal/runtime/plan_test.go (1)

522-527: ⚡ Quick win

Strengthen the test case to validate precedence.

The second test case sets dagStepDir and stateWorkDir to the same value ("/remote/app"). This doesn't effectively test that the configured DAG step Dir takes precedence over the persisted node WorkingDir when they differ.

💡 Proposed test case improvement
 		{
 			name:         "configured step working dir remains configured",
 			dagStepDir:   "/remote/app",
-			stateWorkDir: "/remote/app",
+			stateWorkDir: "/var/lib/dagu/data/dag-runs/example/work",
 			wantStepDir:  "/remote/app",
 		},

This change would clearly demonstrate that the DAG-configured Dir wins even when the node's persisted WorkingDir differs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/runtime/plan_test.go` around lines 522 - 527, The test case named
"configured step working dir remains configured" currently sets dagStepDir and
stateWorkDir to the same value, so it doesn't prove precedence; change the test
fixture so dagStepDir is "/remote/app" and stateWorkDir is a different path
(e.g. "/remote/old") and ensure the expectation wantStepDir remains
"/remote/app" to verify that the DAG-configured Dir takes precedence over the
persisted node WorkingDir (refer to dagStepDir, stateWorkDir, wantStepDir in the
test).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@internal/runtime/plan_test.go`:
- Around line 522-527: The test case named "configured step working dir remains
configured" currently sets dagStepDir and stateWorkDir to the same value, so it
doesn't prove precedence; change the test fixture so dagStepDir is "/remote/app"
and stateWorkDir is a different path (e.g. "/remote/old") and ensure the
expectation wantStepDir remains "/remote/app" to verify that the DAG-configured
Dir takes precedence over the persisted node WorkingDir (refer to dagStepDir,
stateWorkDir, wantStepDir in the test).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c3c0303e-f085-43b0-a85b-3916eb77d5ed

📥 Commits

Reviewing files that changed from the base of the PR and between 24d79c8 and 8c4d260.

📒 Files selected for processing (2)
  • internal/runtime/plan.go
  • internal/runtime/plan_test.go
💤 Files with no reviewable changes (1)
  • internal/runtime/plan.go

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

Re-trigger cubic

yohamta0 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 11 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread internal/runtime/plan.go Outdated
@yohamta0

yohamta0 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
internal/runtime/plan_test.go (1)

589-590: ⚡ Quick win

Make the persisted step differ from the restored source step.

Right now Line 590 seeds node.Step.Dir with the same ${STEP_WORK_DIR} value as the DAG, so this test still passes if retry planning reuses the persisted step snapshot instead of rebinding from the restored DAG definition. Seed the node with the expanded concrete path (or any other mismatched value) so the assertion on Line 603 actually proves the source DAG won.

💡 Suggested fix
 	node := runtime.NodeWithData(runtime.NodeData{
-		Step: core.Step{Name: "target", Dir: "${STEP_WORK_DIR}"},
+		Step: core.Step{Name: "target", Dir: restoredStepWorkDir},
 		State: runtime.NodeState{
 			Status:     core.NodeFailed,
 			WorkingDir: "/stale/effective/work/dir",
 		},
 	})

Also applies to: 601-606

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/runtime/plan_test.go` around lines 589 - 590, The persisted node's
Step.Dir is being seeded with the same template value ("${STEP_WORK_DIR}") as
the DAG, so the test doesn't prove rebinding occurred; change the initialization
of the test node created via runtime.NodeWithData(runtime.NodeData{ Step:
core.Step{Name: "target", Dir: ... }}) to use a concrete/expanded path (e.g., an
absolute or different string) instead of the template so node.Step.Dir differs
from the DAG's template; update the other similar seedings in the same test (the
block around the node created at the 601-606 area) so the assertion that the
restored DAG's value overwrote the persisted snapshot actually verifies
rebinding.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@internal/runtime/plan_test.go`:
- Around line 525-529: The tests are brittle on Windows because
TestCreateRetryPlan_PreservesExplicitStepWorkingDirOnly uses a hardcoded
"/remote/app" which filepath.IsAbs treats as non-absolute on Windows; update the
test to use an OS-agnostic absolute path (e.g., build via filepath.Join or
filepath.VolumeName + separator) or compute an absolute tmp dir and set
dagStepDir/env working dir via resolveExpandedDir so env.WorkingDir matches
wantEnvDir across platforms; also strengthen
TestCreateRetryPlan_ExplicitVariableStepDirUsesRestoredSourceDefinition by
changing the persisted node’s Step.Dir to a value different from
"${STEP_WORK_DIR}" so the assertion target.Step().Dir == "${STEP_WORK_DIR}"
actually validates that rebindRetryNodesToSteps replaces it (adjust persisted
node creation used by rebindRetryNodesToSteps accordingly).

---

Nitpick comments:
In `@internal/runtime/plan_test.go`:
- Around line 589-590: The persisted node's Step.Dir is being seeded with the
same template value ("${STEP_WORK_DIR}") as the DAG, so the test doesn't prove
rebinding occurred; change the initialization of the test node created via
runtime.NodeWithData(runtime.NodeData{ Step: core.Step{Name: "target", Dir: ...
}}) to use a concrete/expanded path (e.g., an absolute or different string)
instead of the template so node.Step.Dir differs from the DAG's template; update
the other similar seedings in the same test (the block around the node created
at the 601-606 area) so the assertion that the restored DAG's value overwrote
the persisted snapshot actually verifies rebinding.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 8e1abf3d-5a0c-4035-a88e-d4d591a1ec6f

📥 Commits

Reviewing files that changed from the base of the PR and between 30f1478 and 9da7fad.

📒 Files selected for processing (2)
  • internal/runtime/plan_test.go
  • internal/service/frontend/api/v1/dagruns_test.go

Comment thread internal/runtime/plan_test.go Outdated
@yohamta0

yohamta0 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@yohamta0

yohamta0 commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
internal/runtime/plan_test.go (2)

662-668: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Strengthen the test by using a different persisted node Step.Dir value.

The persisted node's Step.Dir is currently "${STEP_WORK_DIR}", which matches the DAG step's Dir. The assertion target.Step().Dir == "${STEP_WORK_DIR}" would pass even if rebindRetryNodesToSteps failed to replace the persisted value. Change the persisted node's Step.Dir to a different value (e.g., "/stale/dir") so the assertion actually validates that retry planning replaces it with the DAG source definition.

🔧 Suggested fix
 	node := runtime.NodeWithData(runtime.NodeData{
-		Step: core.Step{Name: "target", Dir: "${STEP_WORK_DIR}"},
+		Step: core.Step{Name: "target", Dir: "/stale/dir"},
 		State: runtime.NodeState{
 			Status:     core.NodeFailed,
 			WorkingDir: "/stale/effective/work/dir",
 		},
 	})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/runtime/plan_test.go` around lines 662 - 668, The test constructs a
persisted node using runtime.NodeWithData(runtime.NodeData{...}) where the
persisted node Step.Dir is currently set to "${STEP_WORK_DIR}", which masks a
failure of rebindRetryNodesToSteps; update the persisted node's Step.Dir (the
runtime.NodeData.Step.Dir field) to a different value such as "/stale/dir" so
that after calling rebindRetryNodesToSteps the assertion on target.Step().Dir
verifies the DAG's Dir was actually applied; change the literal in the test
setup and keep the assertion same to validate rebindRetryNodesToSteps behavior.

598-603: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix Windows portability in the "configured step working dir" test case.

The hardcoded /remote/app path breaks on Windows because filepath.IsAbs("/remote/app") returns false (no drive letter or UNC prefix), causing the path to be treated as relative. This makes env.WorkingDir OS-dependent and contradicts the test's wantEnvDir expectation.

🔧 Suggested fix
 func TestCreateRetryPlan_PreservesExplicitStepWorkingDirOnly(t *testing.T) {
 	t.Parallel()
 
+	configuredStepDir := t.TempDir()
+
 	tests := []struct {
 		name         string
 		dagStepDir   string
 		stateWorkDir string
 		wantStepDir  string
 		wantEnvDir   string
 	}{
 		{
 			name:         "implicit step working dir ignores persisted effective work dir",
 			stateWorkDir: "/var/lib/dagu/data/dag-runs/example/dag-runs/2026/06/04/dag-run_20260604_134911Z_run/work",
 			wantStepDir:  "",
 			wantEnvDir:   "fresh-run-work-dir",
 		},
 		{
 			name:         "configured step working dir keeps restored source dir",
-			dagStepDir:   "/remote/app",
+			dagStepDir:   configuredStepDir,
 			stateWorkDir: "/var/lib/dagu/data/dag-runs/example/work",
-			wantStepDir:  "/remote/app",
-			wantEnvDir:   "/remote/app",
+			wantStepDir:  configuredStepDir,
+			wantEnvDir:   configuredStepDir,
 		},
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/runtime/plan_test.go` around lines 598 - 603, The test uses a
hardcoded "/remote/app" which is not considered an absolute path on Windows;
update the test case fields dagStepDir, wantStepDir and wantEnvDir to build an
OS-absolute path instead of the literal string. Replace the literal
"/remote/app" with a platform-safe construction (e.g. create the path via
filepath.Join(string(os.PathSeparator), "remote", "app") or use
filepath.FromSlash and ensure it is rooted) so filepath.IsAbs returns true on
Windows; adjust the values referenced in the test case (dagStepDir, wantStepDir,
wantEnvDir) accordingly in internal/runtime/plan_test.go.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@internal/runtime/plan_test.go`:
- Around line 662-668: The test constructs a persisted node using
runtime.NodeWithData(runtime.NodeData{...}) where the persisted node Step.Dir is
currently set to "${STEP_WORK_DIR}", which masks a failure of
rebindRetryNodesToSteps; update the persisted node's Step.Dir (the
runtime.NodeData.Step.Dir field) to a different value such as "/stale/dir" so
that after calling rebindRetryNodesToSteps the assertion on target.Step().Dir
verifies the DAG's Dir was actually applied; change the literal in the test
setup and keep the assertion same to validate rebindRetryNodesToSteps behavior.
- Around line 598-603: The test uses a hardcoded "/remote/app" which is not
considered an absolute path on Windows; update the test case fields dagStepDir,
wantStepDir and wantEnvDir to build an OS-absolute path instead of the literal
string. Replace the literal "/remote/app" with a platform-safe construction
(e.g. create the path via filepath.Join(string(os.PathSeparator), "remote",
"app") or use filepath.FromSlash and ensure it is rooted) so filepath.IsAbs
returns true on Windows; adjust the values referenced in the test case
(dagStepDir, wantStepDir, wantEnvDir) accordingly in
internal/runtime/plan_test.go.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4d11e064-18c1-4d40-b94f-ed411703cb7c

📥 Commits

Reviewing files that changed from the base of the PR and between 9da7fad and 0967829.

📒 Files selected for processing (1)
  • internal/runtime/plan_test.go

@yohamta0 yohamta0 merged commit 3bd56d5 into main Jun 5, 2026
11 checks passed
@yohamta0 yohamta0 deleted the fix/retry-ssh-working-dir branch June 5, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: DAG run fails on retry.

1 participant