Skip to content

Make Daytona workflow runs resilient to workspace, tool-policy, and model timeout failures #331

@marccampbell

Description

@marccampbell

Problem

Daytona-backed workflow runs can become brittle when the sandbox, OpenClaw runtime, or selected model provider fails after the claw has already been created. A run can appear active while the agent is blocked on missing workspace files, unavailable tool capabilities, or repeated model-provider idle timeouts.

This issue is intentionally sanitized. It does not include any task/story-specific details from the workflow that exposed the problem.

Observed failure classes

  • The agent attempted to read files under an expected cloned repository path, but the path was missing in the Daytona workspace.
  • The agent attempted an elevated shell command even though OpenClaw reported elevated execution was unavailable for the current runtime/provider policy.
  • The selected model provider timed out after the OpenClaw idle timeout and no fallback profile was configured, leaving the session stalled until the remote connection dropped.

Code context

  • Daytona provisioning is launched asynchronously from pkg/hub/workflow_creator.go after the claw DB row is inserted and marked provisioning.
  • pkg/hub/server.go sets Daytona claws to starting, then runs bootstrapDaytona in the background.
  • bootstrapDaytona currently pins OpenClaw to 2026.5.20, installs optional Nix/Docker, configures the gateway, stages plugin dependencies, writes workspace files, configures GitHub credentials, optionally clones repos, then sets bootstrap_ok=1 and starts claw-bridge.
  • Repository clone verification exists in the GitHub App path, but workspace repository availability should be treated as a first-class readiness gate whenever a workflow declares required repositories.
  • detectToolLoop currently catches repeated edit/write/read errors, but not repeated exec/elevated/tool-policy failures.

Desired behavior

A workflow run should fail fast or recover cleanly instead of leaving the operator with a stuck agent session.

  1. Workspace readiness should be explicit:

    • Before setting bootstrap_ok=1 or starting the bridge, verify every configured workspace repository is present at the expected path and has a .git directory.
    • If repositories cannot be cloned because credentials are missing/unavailable, mark the claw as failed with a sanitized reason instead of starting the agent against an incomplete workspace.
    • Persist a concise bootstrap diagnostic that the UI can show without requiring SSH/log spelunking.
  2. Tool policy should match the runtime:

    • If Daytona/OpenClaw direct runtime cannot use elevated execution, the agent should be told that up front in injected context or OpenClaw config.
    • Detect repeated exec failed: elevated is not available errors and inject a corrective hub message, similar to the existing tool-loop detection.
    • Consider exposing tool capability status in the claw details panel so operators know whether elevated, browser, file, and process tools are available.
  3. Model-provider timeouts should be recoverable:

    • Detect OpenClaw/provider idle timeout failures and mark the current turn as failed/retryable instead of leaving the session stuck.
    • Add optional fallback model/profile configuration for workflow runs, especially for providers/models known to have long first-token latency.
    • Surface a clear UI/status message such as model timeout: no fallback configured with a retry action.
    • Consider increasing or configuring idle timeout for slow models, but only with a visible status so true hangs are still detectable.
  4. Connection loss should not hide root cause:

    • If SSH/Daytona disconnects after runtime failure, preserve the last known classified failure in the claw record.
    • Status should distinguish offline, model_timeout, bootstrap_failed, and workspace_incomplete where possible.

Proposed implementation plan

  • Add a Daytona bootstrap readiness checklist and store per-step results in bootstrap_status or a structured column/table.
  • Promote repository clone/verify into a reusable required-workspace step that runs independently of the credential mechanism.
  • Extend tool-loop detection to include process/elevated/tool-policy failures.
  • Add model timeout classification based on bridge/OpenClaw messages, then expose it in claw status and UI.
  • Add tests for:
    • missing required repository prevents bootstrap_ok=1 and bridge start;
    • repeated elevated exec failures cause a corrective hub message;
    • model timeout/fallback-unavailable messages are persisted as a classified failure.

Acceptance criteria

  • A Daytona workflow with missing required repos does not start the agent; it fails with a sanitized, actionable bootstrap error.
  • Repeated unavailable elevated execution does not loop silently; the hub injects guidance or blocks the tool path.
  • Model idle timeout with no fallback produces a visible failed/retryable status and preserved diagnostic.
  • No workflow issue/story content, secrets, tokens, or raw long logs are included in persisted failure comments or GitHub issue comments.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-needs-reviewThe agent is asking a human for review

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions