Skip to content

bug(onboard): Onboard does not detect or recover from TLS cert errors during sandbox create #1933

@jyaunches

Description

@jyaunches

Summary

When openshell sandbox create fails with a TLS certificate error (invalid peer certificate: BadSignature), NemoClaw's onboard flow exits with exit code 1 and a generic recovery suggestion. The classifySandboxCreateFailure() function in bin/lib/onboard.js does not have a case for TLS/certificate errors, so the failure is classified as "unknown".

This means:

  1. The user gets no specific guidance on how to fix the cert mismatch
  2. nemoclaw onboard --resume retries the same failing sandbox create without fixing the underlying TLS trust
  3. The 5+ minutes spent building and uploading the sandbox image is wasted on each retry

Environment

  • Hardware: NVIDIA DGX Spark (Founders Edition), GB10, 128 GB unified memory
  • Architecture: aarch64
  • OS: Ubuntu 24.04.4 LTS
  • NemoClaw version: 0.0.16 (from source, main branch as of 2026-04-15)
  • OpenShell version: 0.0.26

Reproduction Steps

  1. Complete a NemoClaw onboard with a sandbox running
  2. Gateway gets destroyed and recreated (e.g., during troubleshooting or due to containerd issues — see OpenShell docs: restructure spark-install.md for clearer onboarding flow #857)
  3. Run nemoclaw onboard to create a new sandbox
  4. Image builds successfully (all 45 Dockerfile steps pass)
  5. Image uploads to gateway successfully
  6. openshell sandbox create fails with:
    Error: × status: Unavailable, message: "invalid peer certificate: BadSignature",
    │ details: [], metadata: MetadataMap { headers: {} }
    ├─▶ transport error
    ├─▶ invalid peer certificate: BadSignature
    ╰─▶ invalid peer certificate: BadSignature
    
  7. Onboard prints generic recovery:
    Recovery: nemoclaw onboard --resume
    Or:      nemoclaw onboard
    

Root Cause

In bin/lib/onboard.js, classifySandboxCreateFailure() (line ~1072) checks for:

  • image_transfer_timeout (regex: failed to read image export stream|Timeout error)
  • image_transfer_reset (regex: Connection reset by peer)
  • sandbox_create_incomplete (regex: Created sandbox:)

But there is no case for TLS/certificate errors like BadSignature, invalid peer certificate, handshake verification failed, or transport error. These all fall through to the "unknown" catch-all.

Additionally, the onboard flow at step [2/8] (gateway startup) validates that the gateway is healthy but does not verify that the openshell CLI's TLS trust is still valid against the running gateway before proceeding to the image build.

Expected Behavior

  1. classifySandboxCreateFailure() should detect certificate/TLS errors and suggest:

    Hint: TLS certificate mismatch — the gateway's certificate does not match the CLI's cached trust.
    Fix:  openshell gateway trust -g nemoclaw
    Then: nemoclaw onboard --resume
    
  2. Onboard step [2/8] should verify TLS trust after confirming gateway health, before starting the image build. If trust is stale, auto-refresh it.

  3. Ideally, openshell sandbox create itself should detect the cert mismatch and auto-retry with a trust refresh (this would be an OpenShell fix — see OpenShell fix: revert "ci: remove redundant docs workflow" #856).

Suggested Code Change

In classifySandboxCreateFailure(), add before the final return:

if (/invalid peer certificate|BadSignature|handshake verification failed|certificate verify failed/i.test(text)) {
  return {
    kind: "tls_cert_mismatch",
    uploadedToGateway,
  };
}

And in printSandboxCreateRecoveryHints():

if (failure.kind === "tls_cert_mismatch") {
  console.error("  Hint: TLS certificate mismatch — the gateway certificate changed since the CLI last trusted it.");
  console.error("  Fix:  openshell gateway trust -g nemoclaw");
  console.error("  Then: nemoclaw onboard --resume");
  return;
}

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: cliCommand line interface, flags, terminal UX, or outputarea: sandboxOpenShell sandbox lifecycle, runtime, config, or recoveryplatform: ubuntuAffects Ubuntu Linux environments

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions