Skip to content

fix(workflows): preserve completed node state across DAG multi-resume cycles#1530

Merged
Wirasm merged 2 commits into
coleam00:devfrom
kagura-agent:fix/dag-resume-multi-resume-state-loss
May 18, 2026
Merged

fix(workflows): preserve completed node state across DAG multi-resume cycles#1530
Wirasm merged 2 commits into
coleam00:devfrom
kagura-agent:fix/dag-resume-multi-resume-state-loss

Conversation

@kagura-agent

@kagura-agent kagura-agent commented May 2, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Problem: DAG workflows lose completed node state on the second resume, causing previously successful nodes to be re-executed
  • Why it matters: Multi-resume workflows waste resources re-running completed work and may produce inconsistent results
  • What changed: Query now includes node_skipped_prior_success events; skip events now preserve original output for $nodeId.output substitution; failure messages name specific failed nodes
  • What did not change (scope boundary): DB schema unchanged, write path to events table unchanged, existing node_completed event format unchanged

UX Journey

Before

  User                     DAG Executor                    Event Store
  ────                     ────────────                    ───────────
  starts workflow ───────▶ executes nodes A, B, C
                           A completes ──────────────────▶ node_completed(A)
                           B fails
  resumes (1st) ─────────▶ queries node_completed
                           finds A ◀─────────────────────  node_completed(A)
                           skips A ──────────────────────▶ node_skipped(A) {reason only}
                           re-runs B, C
  resumes (2nd) ─────────▶ queries node_completed
                           ❌ misses A ◀────────────────── node_skipped not queried
                           re-executes A (WRONG)

After

  User                     DAG Executor                    Event Store
  ────                     ────────────                    ───────────
  starts workflow ───────▶ executes nodes A, B, C
                           A completes ──────────────────▶ node_completed(A)
                           B fails
  resumes (1st) ─────────▶ queries completed + skipped
                           finds A ◀─────────────────────  node_completed(A)
                           skips A ──────────────────────▶ node_skipped(A) {output preserved}
                           re-runs B, C
  resumes (2nd) ─────────▶ queries completed + skipped
                           ✅ finds A ◀─────────────────── node_skipped(A) included
                           skips A correctly

Architecture Diagram

Before

  dag-executor.ts                    workflow-events.ts (DB)
  ┌──────────────────┐               ┌─────────────────────────┐
  │ executeDagWorkflow│               │ getCompletedDagNodeOuts │
  │  skip event:     │──────────────▶│ WHERE type =            │
  │  {reason only}   │               │   node_completed        │
  └──────────────────┘               └─────────────────────────┘

After

  dag-executor.ts [~]                workflow-events.ts (DB) [~]
  ┌──────────────────┐               ┌─────────────────────────┐
  │ executeDagWorkflow│               │ getCompletedDagNodeOuts │
  │  skip event:     │──────────────▶│ WHERE type IN (         │
  │  {reason+output} │               │   node_completed,       │
  │  better fail msgs│               │   node_skipped_prior)   │
  └──────────────────┘               └─────────────────────────┘

Connection inventory:

From To Status Notes
dag-executor.ts workflow-events.ts modified Query now includes skip events; skip event data now includes output
dag-executor.ts platform.sendMessage modified Failure messages now name specific failed nodes

Label Snapshot

  • Risk: risk: low
  • Size: size: S
  • Scope: core, workflows
  • Module: core:workflow-events, workflows:dag-executor

Change Metadata

  • Change type: bug
  • Primary scope: multi

Linked Issue

Validation Evidence (required)

# All 18 workflow-events tests pass (including new multi-resume test)
bun run test -- packages/core/src/db/workflow-events.test.ts  # ✅ 18 pass
# All 201 dag-executor tests pass
bun run test -- packages/workflows/src/dag-executor.test.ts   # ✅ 201 pass
bun run type-check  # ✅ clean across all packages
# lint-staged (eslint + prettier) passed on commit
  • Evidence provided: test results, type-check clean
  • If any command is intentionally skipped, explain why: Full bun run validate not run locally; CI covers remaining checks

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No

Compatibility / Migration

  • Backward compatible? Yes — additive query change (reads more event types); new node_output field in skip event data is backward-compatible since existing consumers only check reason
  • Config/env changes? No
  • Database migration needed? No

Human Verification (required)

  • Verified scenarios: Multi-resume cycle preserves completed node outputs; $nodeId.output substitution works across resumes; failure messages name specific failed nodes
  • Edge cases checked: All nodes fail, single node fails with downstream skips, prior skip events without node_output field (backward compat)
  • What was not verified: Production multi-resume with complex DAG topologies

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: getCompletedDagNodeOutputs() returns more results (includes skip events); node_skipped_prior_success data payload gains node_output field; error message format changes for !anyCompleted path
  • Potential unintended effects: Dashboards or monitors parsing the old "no successful nodes" string would need updating
  • Guardrails/monitoring for early detection: Existing 201-test dag-executor suite validates behavior

Rollback Plan (required)

  • Fast rollback command/path: Revert the commit. Database schema is unchanged — only query filter and event data payload differ.
  • Feature flags or config toggles: None
  • Observable failure symptoms: Previously completed nodes being re-executed on second resume; generic failure messages instead of specific node names

Risks and Mitigations

  • Risk: Old node_skipped_prior_success events (without node_output) return undefined output
    • Mitigation: Code uses priorCompletedNodes.get(node.id) ?? "" with fallback to empty string; existing consumers only check reason field

Summary by CodeRabbit

  • Bug Fixes

    • Improved workflow failure messaging to display specific failing node IDs instead of generic error text.
    • Enhanced skipped-node handling to preserve output values from previously completed workflow runs.
  • Tests

    • Added comprehensive test coverage for multi-resume scenarios, node skipping edge cases, and failure message accuracy.

Review Change Stack

… cycles (coleam00#1520)

Bug: getCompletedDagNodeOutputs only queried 'node_completed' events, but
resumed runs emit 'node_skipped_prior_success' for already-completed nodes.
On a second resume, previously completed nodes were re-executed because
their skip events were invisible to the query.

Fix:
- Query both 'node_completed' and 'node_skipped_prior_success' event types
  in getCompletedDagNodeOutputs so multi-resume correctly identifies all
  previously completed nodes
- Store original node_output in node_skipped_prior_success event data so
  the output is available for $nodeId.output substitution on next resume
- Improve error messaging when all nodes fail: name the specific failed
  nodes and count skipped downstream nodes instead of the generic
  'no successful nodes' message

Closes coleam00#1520
@coderabbitai

coderabbitai Bot commented May 2, 2026

Copy link
Copy Markdown

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7186b360-2d3d-4333-900d-296de5f2f0f3

📥 Commits

Reviewing files that changed from the base of the PR and between 4996847 and b1c94a1.

📒 Files selected for processing (2)
  • packages/core/src/db/workflow-events.test.ts
  • packages/workflows/src/dag-executor.test.ts
✅ Files skipped from review due to trivial changes (1)
  • packages/workflows/src/dag-executor.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/core/src/db/workflow-events.test.ts

📝 Walkthrough

Walkthrough

This PR fixes the multi-resume DAG workflow bug by enabling getCompletedDagNodeOutputs to query both node_completed and node_skipped_prior_success events. The executor now stores node_output in skip-prior-success events and improves failure messages to include specific failed node IDs instead of generic "no successful nodes" text.

Changes

Multi-Resume State Preservation

Layer / File(s) Summary
Event Data Shape
packages/workflows/src/dag-executor.ts
When skipping nodes due to prior success, node_skipped_prior_success events now include node_output field (using prior run's output) alongside the reason.
Database Query
packages/core/src/db/workflow-events.ts
getCompletedDagNodeOutputs extends its SQL filter to query both node_completed and node_skipped_prior_success events, enabling output recovery across multiple resume cycles.
Failure Messaging
packages/workflows/src/dag-executor.ts
When no nodes complete, the failure message now lists failed node IDs (e.g., failed: node(s) ... failed.) instead of generic "no successful nodes" text.
Tests & Validation
packages/core/src/db/workflow-events.test.ts, packages/workflows/src/dag-executor.test.ts
New tests for node_skipped_prior_success output collection and updated executor tests to expect node-specific failure messages and empty node_output when prior map had undefined.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰
Hops skip, keeps the prize inside,
Outputs saved where memories hide,
Resume once more, the map stays true,
No lost nodes hopping back anew,
A carrot-coded win for you!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the main change: preserving completed node state across DAG multi-resume cycles, which is the primary objective of this PR.
Description check ✅ Passed PR description comprehensively covers all required template sections including problem, changes, UX journey, architecture diagrams, validation evidence, security impact, compatibility, human verification, side effects, and rollback plan.
Linked Issues check ✅ Passed Code changes fully address all coding requirements from issue #1520: query now includes node_skipped_prior_success events [#1520], skip events preserve output [#1520], failure messages name specific failed nodes [#1520], and backward-compatible implementation confirmed [#1520].
Out of Scope Changes check ✅ Passed All changes are in-scope: only modify query logic and event data payload to fix multi-resume state loss and improve error messages as specified in #1520; no schema changes or unrelated refactoring introduced.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Wirasm

Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator

@kagura-agent related to #1520 — DAG multi-resume state preservation.

@Wirasm

Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator

@kagura-agent related to #1391 — DAG resume state concerns overlap with #1520 fix.

@Wirasm

Wirasm commented May 4, 2026

Copy link
Copy Markdown
Collaborator

Hi @kagura-agent — thanks for opening this PR.

This repository uses a PR template at .github/pull_request_template.md with several required
sections. A few of them appear to be empty or placeholder here:

  • UX Journey
  • Architecture Diagram
  • Label Snapshot
  • Change Metadata
  • Linked Issue
  • Validation Evidence
  • Security Impact
  • Compatibility / Migration
  • Human Verification
  • Risks and Mitigations

Could you fill those out (even briefly)? The template
helps reviewers understand scope, risk, and rollback — it
speeds up review significantly.

If a section genuinely doesn't apply, just write "N/A" in
it rather than leaving it blank.

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Done — filled out all the template sections (UX Journey, Architecture Diagram, Label Snapshot, Change Metadata, etc.). Thanks for pointing that out, @Wirasm!

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Hey @Wirasm 👋 same as #1532 — template is filled out. Anything I should adjust here? Happy to iterate!

@Wirasm

Wirasm commented May 11, 2026

Copy link
Copy Markdown
Collaborator

Review Summary

Verdict: minor-fixes-needed

This PR fixes a real bug in DAG workflow multi-resume by making getCompletedDagNodeOutputs query both node_completed and node_skipped_prior_success events — previously-completed nodes now survive a second resume. The code quality and error handling are solid, but one new behavior is missing test coverage.

Suggested fixes

  • packages/workflows/src/dag-executor.ts:2564–2567: The node_output field added to node_skipped_prior_success events has no test coverage. Extend emits node_skipped_prior_success event for resumed nodes to assert the field value:

    expect(skippedEvent[0].data.node_output).toBe('prior output');

    Also test the empty-string fallback when priorCompletedNodes.get(node.id) is undefined.

  • packages/workflows/src/dag-executor.ts:3065–3072: Add a dedicated test for the updated failure message (showing failed node names). Assert that a failing node produces nodes fail-bash-xxx failed and does not contain "no successful nodes".

  • packages/workflows/src/dag-executor.test.ts:1134, 6173, 6267: Three comments are stale — update them from "no successful nodes" to describe the new failure message.

Minor / nice-to-have

  • packages/workflows/src/dag-executor.ts:3064: The plural-suffixed grammar (node${s}, node${s were}) is redundant inside if (!anyCompleted) — move the failedNodes logic outside the guard for clarity.

  • packages/core/src/db/workflow-events.test.ts:208: Consider adding a test with only node_skipped_prior_success rows (no node_completed) to verify the query handles that edge case.

Compliments

  • Well-scoped fix with clear scope boundaries documented in the PR body.
  • New test returns outputs from node_skipped_prior_success events (multi-resume) covers the happy path cleanly.
  • Improved failure messaging surfaces specific failed node names instead of generic "no successful nodes" — much better for debugging.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage.

…age format

- Assert node_output field in node_skipped_prior_success event test
- Add test for empty-string fallback when node ID has undefined output
- Add dedicated test verifying failure message names failing nodes
- Update stale comment from 'no successful nodes' to new behavior
- Add edge-case test for only node_skipped_prior_success rows (no node_completed)
@kagura-agent

Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @Wirasm! Pushed b1c94a1a addressing your feedback:

Required fixes — all done:

  1. ✅ Added expect(skippedEvent[0].data.node_output).toBe('prior output') to the existing node_skipped_prior_success test + a new test for the empty-string fallback when the map value is undefined
  2. ✅ Added dedicated test 'failure message names the failing node instead of generic summary' — asserts the message contains the node name and does NOT contain "no successful nodes"
  3. ✅ Updated stale comment at line 4667 from "no successful nodes" summarynames the failing node(s)

Nice-to-have — done:

  • ✅ Added edge-case test in workflow-events.test.ts with only node_skipped_prior_success rows (no node_completed)
  • Skipped the plural-logic refactor — it's cosmetic and could introduce risk

All tests pass (203 dag-executor, 19 workflow-events) + type-check clean.

@Wirasm

Wirasm commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Review Summary

Verdict: ready-to-merge

This PR fixes a silent data-loss bug in the DAG workflow executor where outputs from previously-skipped nodes were dropped on multi-resume, and improves failure messages to name the specific failing node instead of a generic summary. All review aspects pass — code is clean, error handling is consistent, and tests are adequate. One optional test improvement noted below.

Minor / nice-to-have

  • packages/core/src/db/workflow-events.ts:127 — No test yet for the mixed scenario where both node_completed and node_skipped_prior_success exist for the same step_name. The query returns both rows with no de-duplication; the Map keeps whichever row arrives last. This would require an unusual re-resume cycle to trigger, so it's low risk. (from test-coverage)

  • packages/workflows/src/dag-executor.ts:2566 — The ?? '' fallback is covered, but the test uses undefined as the Map value rather than omitting the key entirely. Consider a test that omits the step from priorCompletedNodes to hit the exact code path. (from test-coverage)

Compliments

The PR is tightly scoped with clear causal connections between the three sub-changes (query filter, output preservation, failure message). The ?? '' nullish-coalescing approach is correct — it handles the edge case of an entry with undefined value while treating null/undefined node_output as a signal to skip the field. Adding a dedicated regression test for the failure message is exactly the right call. Contributor addressed prior review feedback promptly.


Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage, comment-quality.

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Hi @coleam00 👋 friendly ping — this PR has been open for ~12 days and received a positive review from @Wirasm. Would you be able to take a look when you get a chance? Happy to make any adjustments if needed.

@Wirasm

Wirasm commented May 14, 2026

Copy link
Copy Markdown
Collaborator

Review Summary

Verdict: ready-to-merge

Solid PR — the node_output capture in node_skipped_prior_success events is the key fix enabling $nodeId.output references to work correctly for skipped nodes during resume, and the node-specific failure messages are a meaningful UX improvement. Error handling follows project patterns throughout; tests cover the critical edge cases.

Blocking issues

None.

Suggested fixes

  • packages/workflows/src/dag-executor.ts:2565: Add an assertion in the "skips nodes that appear in priorCompletedNodes" test verifying the emitted node_skipped_prior_success event's data.node_output field — e.g. expect(skippedEvent[0].data.node_output).toBe('prior step1 output'). The existing tests confirm the map is populated, but don't lock in that the emitted event field reflects it.

Minor / nice-to-have

  • dag-executor.test.ts: The updated message-match assertions (e.g. messages.find((m) => m.includes('failed') && m.includes('fail'))) are slightly loose — they match any message containing "failed" and "fail" anywhere. A tighter pattern like m.includes('failed:') or m.includes('failed') && m.includes('node') would add a bit more confidence, though the primary test already covers the key behavior.

Compliments

  • The ?? '' fallback on priorCompletedNodes.get(node.id) is consistent with the pattern used for completed state events — good normalization.
  • The failWorkflowRun .catch() pattern (log at error level, no re-throw) correctly reflects that a DB write failure after an already-failing workflow should not cause a double failure.

Reviewed via maintainer-review-pr workflow (Pi/Minimax). Aspects run: code-review, error-handling, test-coverage.

@kagura-agent

Copy link
Copy Markdown
Contributor Author

Hi @coleam00 👋 friendly ping — @Wirasm reviewed this as ready-to-merge on May 14. Is there anything else needed from my side, or could this be merged? Happy to make any adjustments.

@Wirasm Wirasm merged commit aaf7f9a into coleam00:dev May 18, 2026
4 checks passed
cropse pushed a commit to cropse/Archon that referenced this pull request May 19, 2026
… cycles (coleam00#1530)

* fix(workflows): preserve completed node state across DAG multi-resume cycles (coleam00#1520)

Bug: getCompletedDagNodeOutputs only queried 'node_completed' events, but
resumed runs emit 'node_skipped_prior_success' for already-completed nodes.
On a second resume, previously completed nodes were re-executed because
their skip events were invisible to the query.

Fix:
- Query both 'node_completed' and 'node_skipped_prior_success' event types
  in getCompletedDagNodeOutputs so multi-resume correctly identifies all
  previously completed nodes
- Store original node_output in node_skipped_prior_success event data so
  the output is available for $nodeId.output substitution on next resume
- Improve error messaging when all nodes fail: name the specific failed
  nodes and count skipped downstream nodes instead of the generic
  'no successful nodes' message

Closes coleam00#1520

* test: address review feedback — assert node_output, test failure message format

- Assert node_output field in node_skipped_prior_success event test
- Add test for empty-string fallback when node ID has undefined output
- Add dedicated test verifying failure message names failing nodes
- Update stale comment from 'no successful nodes' to new behavior
- Add edge-case test for only node_skipped_prior_success rows (no node_completed)
@Wirasm Wirasm mentioned this pull request May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAG workflow loses completed node state on second resume (multi-resume bug)

2 participants