Skip to content

feat: make Daytona workflow runs resilient to workspace, tool-policy, and model timeout failures (#331)#339

Merged
marccampbell merged 2 commits into
mainfrom
feat/daytona-resilience-331
Jun 3, 2026
Merged

feat: make Daytona workflow runs resilient to workspace, tool-policy, and model timeout failures (#331)#339
marccampbell merged 2 commits into
mainfrom
feat/daytona-resilience-331

Conversation

@elasticclaw-factory

Copy link
Copy Markdown
Contributor

This PR addresses issue #331 by making Daytona-backed workflow runs fail fast or recover cleanly instead of leaving stuck agent sessions.

Changes

1. Workspace readiness gate (bootstrapDaytona)

  • Before setting bootstrap_ok=1 or starting the bridge, verify every configured repository is present at the expected path with a .git directory
  • If repos are missing, mark the claw as failed with a sanitized, actionable bootstrap error stored in bootstrap_diagnostic
  • Uses new setBootstrapStatusWithDiagnostic helper to persist both status and diagnostic for UI display

2. Tool-loop detection (detectToolLoop)

  • Extended to catch repeated exec/elevated/tool-policy failures:
    • exec failed:
    • elevated is not available
    • tool-policy
  • Injects a corrective hub message after 3+ occurrences, similar to existing edit/write/read loop detection

3. Model timeout classification (heartbeat handler)

  • When gateway is unhealthy for 8+ consecutive checks (~4 minutes) while streaming, classify as 'model timeout'
  • Persists diagnostic in bootstrap_diagnostic column and broadcasts it to the UI as a retryable failure

4. Bootstrap diagnostic persistence

  • Added bootstrap_diagnostic column to claws table (schema + migration)
  • Added BootstrapDiagnostic field to types.Claw
  • stopAgentWithReason now persists sanitized diagnostic to DB and includes it in the broadcast payload
  • setBootstrapStatusWithDiagnostic helper added for gating failures

5. Tests

  • Added daytona_resilience_test.go with tests for:
    • elevated failure loop detection
    • tool-policy failure loop detection
    • single failure not triggering loop
    • mixed failure patterns
    • bootstrap output sanitization for workspace diagnostics

Verification

  • go build ./... passes
  • go test ./pkg/hub/... passes (all 100+ tests)

Closes #331

… and model timeout failures

This commit addresses issue #331 by making Daytona-backed workflow runs
fail fast or recover cleanly instead of leaving stuck agent sessions.

Changes:

1. Workspace readiness gate (bootstrapDaytona):
   - Before setting bootstrap_ok=1 or starting the bridge, verify every
     configured repository is present at the expected path with a .git dir.
   - If repos are missing, mark the claw as failed with a sanitized,
     actionable bootstrap error (bootstrap_diagnostic).
   - Uses setBootstrapStatusWithDiagnostic to persist both status and
     diagnostic for UI display.

2. Tool-loop detection (detectToolLoop):
   - Extended to catch repeated exec/elevated/tool-policy failures:
     - 'exec failed:'
     - 'elevated is not available'
     - 'tool-policy'
   - Injects a corrective hub message after 3+ occurrences, similar to
     existing edit/write/read loop detection.

3. Model timeout classification (heartbeat handler):
   - When gateway is unhealthy for 8+ consecutive checks (~4 minutes) while
     streaming, classify as 'model timeout'.
   - Persists diagnostic in bootstrap_diagnostic column and broadcasts it
     to the UI as a retryable failure.

4. Bootstrap diagnostic persistence:
   - Added bootstrap_diagnostic column to claws table (schema + migration).
   - Added BootstrapDiagnostic field to types.Claw.
   - stopAgentWithReason now persists sanitized diagnostic to DB and
     includes it in the broadcast payload.
   - setBootstrapStatusWithDiagnostic helper added for gating failures.

5. Tests:
   - Added daytona_resilience_test.go with tests for:
     - elevated failure loop detection
     - tool-policy failure loop detection
     - single failure not triggering loop
     - mixed failure patterns
     - bootstrap output sanitization for workspace diagnostics
@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Comments Outside Diff (1)

  1. pkg/hub/server.go, line 2103-2114 (link)

    P1 The bootstrapDiagnostic field is fetched from the DB (the query and scan were both updated correctly), but it is not included in the initial WebSocket payload emitted to reconnecting clients. Any user who opens the dashboard after an agent has already failed will receive bootstrap_status but silently lose the diagnostic that explains why it failed — defeating the purpose of this PR.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat: make Daytona workflow runs resilie..." | Re-trigger Greptile

Comment thread pkg/hub/server.go Outdated
@greptile-apps

greptile-apps Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Reviews (2): Last reviewed commit: "Address Daytona resilience review feedba..." | Re-trigger Greptile

@marccampbell marccampbell merged commit 7f0b35e into main Jun 3, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make Daytona workflow runs resilient to workspace, tool-policy, and model timeout failures

1 participant