Skip to content

fix(delegate-task): harden child-session first-prompt fallback recovery#3825

Merged
code-yeongyu merged 5 commits into
code-yeongyu:devfrom
tw-yshuang:fix/delegated-child-session-early-failure-fallback
May 15, 2026
Merged

fix(delegate-task): harden child-session first-prompt fallback recovery#3825
code-yeongyu merged 5 commits into
code-yeongyu:devfrom
tw-yshuang:fix/delegated-child-session-early-failure-fallback

Conversation

@tw-yshuang

@tw-yshuang tw-yshuang commented May 7, 2026

Copy link
Copy Markdown
Contributor

Summary

  • fix delegated child-session fallback recovery when the first prompt fails before session history is persisted
  • introduce a shared bootstrap context so background and sync delegated flows register retry payloads and fallback metadata before first prompt dispatch
  • add cleanup and regression coverage for exhaustion, non-retryable failures, isolation, and terminal-path cleanup

Background

In the delegated child-session flow, a subagent session can fail on the very first promptAsync call because of a provider or model error.

Before this change, that path had two related problems:

  1. background child-session fallback context registration could race behind first-prompt dispatch
  2. runtime fallback depended on session.messages to recover the last user prompt, so empty-history first-prompt failures could not retry

That left delegated child sessions stuck in early failure without correctly advancing to fallback models.

First-prompt early-failure flow after this change

  1. A delegated child session is created from the parent session
  2. Before the first prompt is sent, the system registers delegated bootstrap context containing:
    • retry prompt parts
    • fallback chain
    • category metadata
  3. The first prompt is dispatched
  4. If the first prompt fails with a retryable error:
    • background and sync delegated paths enter fallback handling
    • runtime fallback first tries persisted session.messages
    • if history is still empty and the session is a delegated child session, it falls back to the bootstrap retry parts
  5. Fallback models are attempted in order with existing exhaustion semantics preserved
  6. On retry handoff, completion, cancellation, crash/error, session deletion, shutdown, or launch-time terminal interrupt, the session-scoped bootstrap and fallback metadata are cleared
flowchart TD
    A[Parent session delegates child task] --> B[Create child session]
    B --> C[Register delegated bootstrap context before first prompt]
    C --> D[Send first prompt]

    D -->|Success| E[Continue normal background or sync flow]
    D -->|Failure| F{Retryable error?}

    F -->|No| G[Enter terminal path]
    F -->|Yes| H{session.messages has user parts?}

    H -->|Yes| I[Use persisted user parts for retry]
    H -->|No| J{Delegated bootstrap exists?}

    J -->|No| G
    J -->|Yes| K[Use bootstrap retry parts]

    I --> L[Pick next fallback model]
    K --> L
    L --> M{Fallback chain exhausted?}

    M -->|No| N[Retry with next fallback model]
    M -->|Yes| G

    N --> O{Retry succeeds?}
    O -->|Yes| P[Continue task execution]
    O -->|No| F

    E --> Q[Terminal cleanup]
    P --> Q
    G --> Q

    Q --> R[Clear bootstrap prompt context]
    Q --> S[Clear fallback-chain registration]
    Q --> T[Clear category registration]
Loading

What changed

Shared delegated bootstrap contract

  • add a shared bootstrap state to preserve delegated child-session retry parts before first prompt dispatch
  • register fallback chain and category context before the first prompt in both background and sync delegated paths

Background path atomic registration

  • move background fallback-context registration into the launch path itself
  • remove the dependency on delayed polling or backfill to make critical fallback metadata available

Runtime fallback support for empty-history retry

  • preserve the existing history-first retry path
  • fall back to delegated bootstrap retry parts only when the session is a delegated child session and persisted history is still empty

Sync path alignment and cleanup

  • align sync delegated execution with the shared bootstrap contract
  • ensure retry-created sessions and terminal paths clear bootstrap and fallback context so session-scoped state does not leak

Regression coverage

  • background first-prompt retryable failure
  • sync first-prompt fallback ordering
  • delegated empty-history runtime retry
  • exhaustion behavior
  • non-retryable failures do not retry
  • concurrent delegated session isolation
  • background and sync terminal cleanup

Affected areas

  • src/features/background-agent/manager.ts
  • src/tools/delegate-task/background-task.ts
  • src/tools/delegate-task/sync-task.ts
  • src/tools/delegate-task/sync-prompt-sender.ts
  • src/hooks/runtime-fallback/auto-retry.ts
  • src/hooks/runtime-fallback/last-user-retry-parts.ts
  • src/shared/delegated-child-session-bootstrap.ts

Verification

bun test src/features/background-agent/manager.test.ts src/features/background-agent/fallback-retry-handler.test.ts src/features/background-agent/spawner.test.ts src/hooks/runtime-fallback/index.test.ts src/tools/delegate-task/sync-task.test.ts
bun run typecheck
  • 256 pass, 0 fail
  • typecheck passed

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

Summary by cubic

Fixes delegated child-session fallback when the first prompt fails before any history exists. Captures the first-prompt retry payload at launch and cleans up on every terminal path to keep state consistent and isolated.

  • Bug Fixes
    • Introduced shared/delegated-child-session-bootstrap to register retry parts, fallback chain, and category before the first prompt in both background and sync flows.
    • Background manager registers the bootstrap at launch and centralizes cleanup (clears bootstrap, session fallback chain, and category) on completion, cancellation, retry handoff, deletion, shutdown, and terminal launch failures; preserves late wiring when the parent aborts early.
    • Runtime auto-retry now passes sessionID to last-user lookup and uses bootstrap retry parts when history is empty; does not invent payloads if no bootstrap exists.
    • Sync path aligns with the bootstrap contract, injects the built prompt text into sendSyncPrompt, and clears per-session bootstrap/fallback context on retries and finish; concurrent sessions remain isolated.
    • Merged latest dev and resolved manager/runtime-fallback conflicts while keeping delegated bootstrap wiring and upstream poll-recovery; expanded tests for empty-history retries, terminal launch cleanup, scheduled-retry cleanup, cancellation/completion cleanup, isolation, exhaustion, and non-retryable errors.

Written for commit 12f5233. Summary will update on new commits.

Capture delegated child-session retry context before the first prompt so fallback recovery still works when session history is empty. Align background and sync launch paths around the same bootstrap contract, clear session-scoped fallback state on every terminal path, and lock the behavior with regression coverage for first-prompt retries, exhaustion, isolation, and cleanup.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 10 files

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Requires human review: Cannot guarantee 100% no regressions; risk of edge-case behavioral changes despite passing tests and AI review.

Keep the new manager-side bootstrap registration as the primary path, but restore a compatibility fallback when the parent call aborts before the child session id resolves. This preserves late delegated session wiring for already-launched background tasks without reverting the new bootstrap architecture.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@tw-yshuang

Copy link
Copy Markdown
Contributor Author

Follow-up fix pushed in 291eeed8.

This keeps the new manager-side bootstrap registration as the primary path, but restores a compatibility fallback for the background delegated path when the parent call aborts before the child session id resolves.

Why this follow-up was needed:

  • the new architecture intentionally moved delegated bootstrap/fallback registration into the earlier manager-side launch path
  • however, CI still covered an older but still valid contract: once a background delegated task has been successfully launched, a parent-side abort should not prevent the child session from finishing its late wiring if the session id resolves shortly afterward

What this follow-up does:

  • preserves the new pre-dispatch bootstrap architecture
  • restores late session wiring only as a compatibility fallback in background-task.ts
  • late-resolved background child sessions still receive:
    • delegated bootstrap context
    • fallback-chain registration
    • category registration

This avoids regressing the stronger new design while preserving the older tool-boundary behavior that existing CI still expects.

Re-verified locally with:

bun test src/tools/delegate-task/background-task.test.ts
bun test src/features/background-agent/manager.test.ts src/hooks/runtime-fallback/index.test.ts src/tools/delegate-task/sync-task.test.ts
bun run typecheck

All passed locally.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 1 file (changes from recent commits).

Requires human review: While the PR provides comprehensive regression tests and logic for a valid fix, the scale of changes and introduction of new global state for session management warrant human verification to guarantee 100% regression safety.

Reconcile the latest dev branch changes with the delegated child-session fallback work. Preserve the upstream background-agent updates while keeping the delegated bootstrap cleanup and compatibility wiring fixes intact, then re-verify the affected regression suites and typecheck.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@tw-yshuang

Copy link
Copy Markdown
Contributor Author

Quick update after the last round of conflict resolution:

  • I resolved the previous merge conflicts in src/features/background-agent/manager.ts and src/features/background-agent/manager.test.ts
  • I re-ran the affected local verification after that merge resolution:
bun test src/features/background-agent/manager.test.ts src/features/background-agent/fallback-retry-handler.test.ts src/features/background-agent/spawner.test.ts src/hooks/runtime-fallback/index.test.ts src/tools/delegate-task/background-task.test.ts src/tools/delegate-task/sync-task.test.ts
bun run typecheck

Those checks passed locally (275 pass, 0 fail, plus clean typecheck), and the latest PR CI for the current head is green as well.

One more thing changed afterward: dev moved again after I resolved the earlier conflicts. I re-checked mergeability against the latest origin/dev, and the previous manager.ts / manager.test.ts conflicts are gone, but there is now a fresh merge conflict in:

  • src/tools/delegate-task/sync-task.test.ts

So at this point:

  • current PR head: CI is green
  • current mergeability against the latest dev: needs one new conflict resolution in sync-task.test.ts

tw-yshuang and others added 2 commits May 12, 2026 00:37
Sync the PR branch with the latest dev branch and resolve the remaining conflict in sync-task.test.ts while preserving both the new upstream poll-recovery coverage and this branch's delegated bootstrap cleanup and isolation coverage. Re-verified the affected delegated fallback suites and typecheck after the merge resolution.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Sync the PR branch with the newest dev branch and resolve the new import-level conflicts in background-agent manager and runtime-fallback tests. Preserve both the delegated bootstrap coverage from this branch and the newer upstream test utilities and runtime wiring changes, then re-verify the affected delegated fallback suites and typecheck.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@code-yeongyu code-yeongyu merged commit cd33f3a into code-yeongyu:dev May 15, 2026
8 checks passed
code-yeongyu added a commit that referenced this pull request May 15, 2026
…strap

Revert "Merge pull request #3825 from tw-yshuang/fix/delegated-child-session-early-failure-fallback"
@code-yeongyu

Copy link
Copy Markdown
Owner

Reverted on dev via #4044 — full reasoning there.

Short version: after merge, the PR's own delegated child-session empty-history fallback retries with captured bootstrap prompt test fails on a clean root bun test --timeout 30000 run (6828/1 → revert restores 6823/0). The race this targets (delegated child session first promptAsync failure with empty persisted history) is narrow and partly handled by the existing persisted-history fallback path, so keeping dev green and reopening after the test is stabilized is the cleaner path.

To re-land:

  1. Rebase onto current dev (which now includes the shared promptAsync gate, ralph-loop display-name fix, session-recovery persistent dedupe, tool-pair-validator subagent skip, and thinking_block_modified recovery).
  2. Drive the failing test red/green explicitly against that base and confirm root bun test --timeout 30000 is 0 fail.
  3. Reopen as a fresh PR.

wsycarlos pushed a commit to wsycarlos/oh-my-openagent that referenced this pull request May 15, 2026
…gated-child-session-early-failure-fallback"

This reverts commit cd33f3a, reversing
changes made to 521c99c.
islee23520 pushed a commit to islee23520/oh-my-openagent that referenced this pull request May 16, 2026
…erral

PR code-yeongyu#3825 introduced a delegated child-session bootstrap to capture first-prompt retry payloads before history is persisted, addressing the empty-history fallback gap. After merge the PR's own regression test failed on clean root bun test (6828 pass / 1 fail), so PR code-yeongyu#4044 reverted it. Ship v4.2.0 with the bug documented and a workaround so users have an explicit story for the unfixed delegated child-session early-failure path. Reland will target v4.2.1.

Closes BLOCKER-4 (Path B - reland deferred to v4.2.1)
islee23520 pushed a commit to islee23520/oh-my-openagent that referenced this pull request May 16, 2026
…OCKER-4)

PR code-yeongyu#3825's fac90d6 introduced a shared bootstrap context to fix delegated child-session fallback when the first prompt fails before any session history is persisted. PR code-yeongyu#4044 reverted that fix because its own regression test failed on a clean root suite (6828 pass / 1 fail). The bug remains unaddressed in v4.2.0; reland is deferred.

This commit documents the symptom, history, workaround, and tracking issue so users have visibility.

Closes BLOCKER-4 via Path B (documentation).

Refs PR code-yeongyu#3825, PR code-yeongyu#4044, issue code-yeongyu#4059.
islee23520 pushed a commit to islee23520/oh-my-openagent that referenced this pull request May 16, 2026
Update the v4.2.0 known issue with the filed follow-up issue and exact PR code-yeongyu#3825/code-yeongyu#4044 commit details.

Refs code-yeongyu#4059.
islee23520 pushed a commit to islee23520/oh-my-openagent that referenced this pull request May 16, 2026
…t deferral

PR code-yeongyu#3825 added a shared bootstrap context to capture delegated
child-session retry payloads before the first prompt dispatch, so
empty-history failures could still retry through the fallback chain.
The PR's own regression test failed on clean root bun test after merge
(6828 pass / 1 fail). PR code-yeongyu#4044 reverted the merge to keep dev green.

Ship v4.2.0 with the bug documented and a workaround so users have an
explicit story for the unfixed delegated child-session early-failure
path. Reland targets v4.2.1 once the regression test is stabilized.

Closes BLOCKER-4 (Path B - documentation, reland deferred to v4.2.1)
islee23520 pushed a commit to islee23520/oh-my-openagent that referenced this pull request May 16, 2026
Document all 7+ BLOCKER+HIGH fixes, breaking-change-free additions
(public exports), known issue for delegated child-session fallback
(PR code-yeongyu#3825 deferred to v4.2.1), and internal-only changes.

Closes L14
islee23520 pushed a commit to islee23520/oh-my-openagent that referenced this pull request May 16, 2026
…erral

PR code-yeongyu#3825 introduced a delegated child-session bootstrap to capture first-prompt retry payloads before history is persisted, addressing the empty-history fallback gap. After merge the PR's own regression test failed on clean root bun test (6828 pass / 1 fail), so PR code-yeongyu#4044 reverted it. Ship v4.2.0 with the bug documented and a workaround so users have an explicit story for the unfixed delegated child-session early-failure path. Reland will target v4.2.1.

Closes BLOCKER-4 (Path B - reland deferred to v4.2.1)
@tw-yshuang

tw-yshuang commented May 17, 2026

Copy link
Copy Markdown
Contributor Author

Follow-up note: this PR was already reverted on dev via #4044, and the underlying delegated child-session first-prompt / empty-history fallback problem has since been addressed more completely by #4074.

The relevant upstream fix is in #4074, especially commit ba64868: fix(runtime-fallback): carry delegated system and tools through bootstrap retry. That implementation reuses the existing delegated child-session bootstrap state instead of adding a separate runtime-fallback cache, and it preserves the full delegated retry context:

  • retry parts
  • system prompt
  • tool gates
  • delegated fallback/category context

Because #4074 covers the same failure mode with a cleaner and broader upstream implementation, I am not going to re-land the older #3825 approach. Any further validation should be based on current dev after #4074 rather than on this reverted branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants