Make Daytona workflow runs resilient to workspace, tool-policy, and model timeout failures

## Problem

Daytona-backed workflow runs can become brittle when the sandbox, OpenClaw runtime, or selected model provider fails after the claw has already been created. A run can appear active while the agent is blocked on missing workspace files, unavailable tool capabilities, or repeated model-provider idle timeouts.

This issue is intentionally sanitized. It does not include any task/story-specific details from the workflow that exposed the problem.

## Observed failure classes

- The agent attempted to read files under an expected cloned repository path, but the path was missing in the Daytona workspace.
- The agent attempted an elevated shell command even though OpenClaw reported elevated execution was unavailable for the current runtime/provider policy.
- The selected model provider timed out after the OpenClaw idle timeout and no fallback profile was configured, leaving the session stalled until the remote connection dropped.

## Code context

- Daytona provisioning is launched asynchronously from `pkg/hub/workflow_creator.go` after the claw DB row is inserted and marked `provisioning`.
- `pkg/hub/server.go` sets Daytona claws to `starting`, then runs `bootstrapDaytona` in the background.
- `bootstrapDaytona` currently pins OpenClaw to `2026.5.20`, installs optional Nix/Docker, configures the gateway, stages plugin dependencies, writes workspace files, configures GitHub credentials, optionally clones repos, then sets `bootstrap_ok=1` and starts `claw-bridge`.
- Repository clone verification exists in the GitHub App path, but workspace repository availability should be treated as a first-class readiness gate whenever a workflow declares required repositories.
- `detectToolLoop` currently catches repeated edit/write/read errors, but not repeated exec/elevated/tool-policy failures.

## Desired behavior

A workflow run should fail fast or recover cleanly instead of leaving the operator with a stuck agent session.

1. Workspace readiness should be explicit:
   - Before setting `bootstrap_ok=1` or starting the bridge, verify every configured workspace repository is present at the expected path and has a `.git` directory.
   - If repositories cannot be cloned because credentials are missing/unavailable, mark the claw as failed with a sanitized reason instead of starting the agent against an incomplete workspace.
   - Persist a concise bootstrap diagnostic that the UI can show without requiring SSH/log spelunking.

2. Tool policy should match the runtime:
   - If Daytona/OpenClaw direct runtime cannot use elevated execution, the agent should be told that up front in injected context or OpenClaw config.
   - Detect repeated `exec failed: elevated is not available` errors and inject a corrective hub message, similar to the existing tool-loop detection.
   - Consider exposing tool capability status in the claw details panel so operators know whether elevated, browser, file, and process tools are available.

3. Model-provider timeouts should be recoverable:
   - Detect OpenClaw/provider idle timeout failures and mark the current turn as failed/retryable instead of leaving the session stuck.
   - Add optional fallback model/profile configuration for workflow runs, especially for providers/models known to have long first-token latency.
   - Surface a clear UI/status message such as `model timeout: no fallback configured` with a retry action.
   - Consider increasing or configuring idle timeout for slow models, but only with a visible status so true hangs are still detectable.

4. Connection loss should not hide root cause:
   - If SSH/Daytona disconnects after runtime failure, preserve the last known classified failure in the claw record.
   - Status should distinguish `offline`, `model_timeout`, `bootstrap_failed`, and `workspace_incomplete` where possible.

## Proposed implementation plan

- Add a Daytona bootstrap readiness checklist and store per-step results in `bootstrap_status` or a structured column/table.
- Promote repository clone/verify into a reusable required-workspace step that runs independently of the credential mechanism.
- Extend tool-loop detection to include process/elevated/tool-policy failures.
- Add model timeout classification based on bridge/OpenClaw messages, then expose it in claw status and UI.
- Add tests for:
  - missing required repository prevents `bootstrap_ok=1` and bridge start;
  - repeated elevated exec failures cause a corrective hub message;
  - model timeout/fallback-unavailable messages are persisted as a classified failure.

## Acceptance criteria

- A Daytona workflow with missing required repos does not start the agent; it fails with a sanitized, actionable bootstrap error.
- Repeated unavailable elevated execution does not loop silently; the hub injects guidance or blocks the tool path.
- Model idle timeout with no fallback produces a visible failed/retryable status and preserved diagnostic.
- No workflow issue/story content, secrets, tokens, or raw long logs are included in persisted failure comments or GitHub issue comments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Daytona workflow runs resilient to workspace, tool-policy, and model timeout failures #331

Problem

Observed failure classes

Code context

Desired behavior

Proposed implementation plan

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Make Daytona workflow runs resilient to workspace, tool-policy, and model timeout failures #331

Description

Problem

Observed failure classes

Code context

Desired behavior

Proposed implementation plan

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions