Skip to content

Brokered Azure lease allocation can time out before a late lease becomes ready #215

@brokemac79

Description

@brokemac79

Summary

Brokered Azure Linux lease allocation can time out after the CLI/coordinator wait window with no lease was returned, even though a lease from the same Azure path can later appear as active/ready and be usable via crabbox run --id.

This makes OpenClaw's Azure-backed Crabbox default unreliable for proof runs: the user-facing command fails as unavailable, while the underlying Azure VM may still be provisioning or may become usable too late for the caller.

Environment

  • Date observed: 2026-06-05
  • Client OS/shell: Windows, PowerShell
  • Crabbox CLI: 0.26.0
  • Broker: https://crabbox.openclaw.ai
  • Auth: GitHub broker auth for org openclaw
  • Repository/worktree: C:\oc-work\oc-87735
  • Repo config: OpenClaw .crabbox.yaml
  • Default OpenClaw provider in that repo: azure
  • Azure config in repo: location: eastus2
  • Tested provider/type: azure, Standard_D4ads_v6, market=on-demand, target=linux
  • Local rsync: installed and runnable (C:\Users\marti\.local\bin\rsync.cmd, rsync 3.4.2)

crabbox doctor --provider azure reached the broker/provider and reported provider=azure coordinator_secrets=ready. The only local doctor failure was the existing Windows config permission warning:

failed  config   C:\Users\marti\AppData\Roaming\crabbox\config.yaml: permissions 0666 want 0600
ok      broker   auth=github owner=martin_cleary@yahoo.co.uk org=openclaw default_type=Standard_D32ads_v6
ok      provider provider=azure coordinator_secrets=ready

Reproduction

From C:\oc-work\oc-87735:

pnpm crabbox:run -- --type Standard_D4ads_v6 --market on-demand --idle-timeout 10m --ttl 20m --timing-json --no-sync --no-hydrate --stop-after always --shell -- "echo CRABBOX_AZURE_SMOKE_OK; uname -srm; whoami; pwd"

This is intentionally a tiny no-sync/no-hydrate command so the result isolates lease allocation/SSH readiness rather than repo sync or test setup.

Observed Behavior

The command waited for a coordinator lease for the full 10-minute acquire window, then failed:

[crabbox] bin=..\..\Users\marti\.local\bin\crabbox.exe version=0.26.0 provider=azure providers=...
recording run run_87a63bdd35c2
coordinator lease class=standard preferred_type=Standard_D4ads_v6 keep=false slug=amber-barnacle idle_timeout=10m0s ttl=20m0s
waiting for coordinator lease provider=azure slug=amber-barnacle elapsed=30s timeout=10m0s
...
waiting for coordinator lease provider=azure slug=amber-barnacle elapsed=9m30s timeout=10m0s
timed out waiting for coordinator lease after 10m0s provider=azure target=linux type=Standard_D4ads_v6 slug=amber-barnacle lease=cbx_ddea6cab6b52; no lease was returned; next_action=check coordinator/cloud logs and retry, then run `crabbox stop --provider azure --target linux --id cbx_ddea6cab6b52` if a late lease appears

Immediately after the timeout, the hinted late lease id was not visible to the user:

crabbox status --provider azure --id cbx_ddea6cab6b52
coordinator GET /v1/leases/cbx_ddea6cab6b52: http 404: {"error":"not_found"}

In the same troubleshooting session, a separate Azure attempt from another chat showed the more worrying late-lease behavior directly: the command timed out after 10 minutes, but the reported late lease later appeared in the user-visible lease list as active/ready:

crabbox-harbor-crab-78988ccd active Standard_D4ads_v6 20.101.44.161 lease=cbx_2513f241d618 slug=harbor-crab keep=false target=linux

Status/inspect showed it was ready:

cbx_2513f241d618 slug=harbor-crab provider=azure target=linux state=active type=Standard_D4ads_v6 host=20.101.44.161 ready=true has_host=true idle_timeout=1h30m0s

A no-sync attach command against that late/ready Azure lease succeeded:

pnpm crabbox:run -- --provider azure --id cbx_2513f241d618 --no-sync --no-hydrate --timing-json --stop-after never --shell -- "echo CRABBOX_AZURE_REUSE_OK; uname -srm; whoami; pwd"

Output:

CRABBOX_AZURE_REUSE_OK
Linux 7.0.0-1004-azure x86_64
crabbox
/work/crabbox/cbx_2513f241d618/oc-87735

Timing summary:

{"provider":"azure","leaseId":"cbx_2513f241d618","slug":"harbor-crab","syncMs":0,"syncSkipped":true,"commandMs":1845,"totalMs":2353,"exitCode":0,"runId":"run_c85ec4db1c50","machineType":"Standard_D4ads_v6"}

Additional Cleanup Evidence

After filing this issue, the portal showed both late Azure leases as active:

  • cbx_2513f241d618 / harbor-crab
  • cbx_ddea6cab6b52 / amber-barnacle

harbor-crab released successfully:

crabbox stop --provider azure --target linux --id cbx_2513f241d618
released lease=cbx_2513f241d618 server=crabbox-harbor-crab-78988ccd

amber-barnacle is more concerning. It is visible as active/ready even though keep=false, idle_timeout=10m0s, and expiresAt is already in the past:

cbx_ddea6cab6b52 slug=amber-barnacle provider=azure target=linux state=active type=Standard_D4ads_v6 host=52.157.75.123 ready=true has_host=true idle_for=28m2s idle_timeout=10m0s expires=2026-06-05T18:37:13.214Z

A manual release attempt with a long local timeout failed at the broker release endpoint:

crabbox stop --provider azure --target linux --id cbx_ddea6cab6b52
Post "https://crabbox.openclaw.ai/v1/leases/cbx_ddea6cab6b52/release": context deadline exceeded

A follow-up list/status/inspect still showed it active/ready. So this issue covers both late lease visibility and a cleanup/release timeout for at least one late Azure lease.

Expected Behavior

One of these should happen:

  1. Azure allocation returns the lease once the VM becomes SSH-ready, within the configured wait window for normal OpenClaw proof runs.
  2. If Azure provisioning is legitimately slow, the CLI/coordinator reports a precise capacity/provisioning-delay state instead of a generic acquire timeout.
  3. If a lease is still provisioning after the caller times out, late lease cleanup/status is reliable: the hinted lease id should be visible, inspectable, and stoppable once it exists.
  4. The CLI should not leave the operator in a state where the proof run fails but a paid Azure lease later becomes active outside the failed run's control path.

Why This Looks Reportable

This does not appear to be a local user-auth problem:

  • Broker auth is configured and works for the openclaw org.
  • crabbox doctor --provider azure reaches the broker/provider and reports Azure coordinator secrets ready.
  • An already-ready Azure lease can be attached to and used successfully.
  • AWS brokered leases are usable from the same machine/session.

This also does not appear to be repo sync/test setup, because the failing repro uses --no-sync --no-hydrate and only tries to run echo, uname, whoami, and pwd.

Related PR history suggests small Azure Linux brokered warmups have previously completed well inside 10 minutes:

The current symptom is therefore either a real Azure capacity/provisioning latency issue that needs better surfacing, or a coordinator/CLI late-lease lifecycle bug.

Image

above screenshot from the UI, which i could see the boxes available afterwards

Acceptance Criteria

  • A fresh brokered Azure no-sync smoke either succeeds or returns a clear, actionable capacity/provisioning status.
  • Late Azure leases created by timed-out attempts are consistently visible to status/inspect/list once they exist.
  • Timed-out attempts do not leave untracked active Azure leases, or the CLI provides a reliable cleanup command that works after late provisioning completes.
  • If the right fix is a longer/default Azure acquire timeout for Standard_D4ads_v6/managed OS disk paths, document that expectation in the provider docs and OpenClaw .crabbox.yaml guidance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P2Normal priority bug or improvement with limited blast radius.clawsweeper:needs-live-reproClawSweeper needs live local, crabbox, or manual validation to confirm this issue.clawsweeper:needs-maintainer-reviewClawSweeper marked this issue as needing maintainer review before automation.clawsweeper:needs-product-decisionClawSweeper marked this issue as needing a product or behavior decision.clawsweeper:no-new-fix-prClawSweeper does not recommend queueing a new automated fix PR for this issue.impact:otherThis issue has meaningful maintainer-visible impact outside the owned taxonomy.issue-rating: 🐚 platinum hermitGood issue quality with a plausible reproduction path needing some confirmation.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions