Skip to content

feat: proactive Slack token health check and gateway crash prevention#119

Merged
Siri-Ray merged 4 commits intomainfrom
feat/slack-token-health-check
Mar 6, 2026
Merged

feat: proactive Slack token health check and gateway crash prevention#119
Siri-Ray merged 4 commits intomainfrom
feat/slack-token-health-check

Conversation

@lefarcen
Copy link
Copy Markdown
Collaborator

@lefarcen lefarcen commented Mar 6, 2026

Summary

  • Add POST /api/internal/pools/{poolId}/check-slack-tokens endpoint that validates Slack bot tokens via auth.test, marks invalid
    ones as error, and triggers config regeneration
  • Gateway sidecar calls this at bootstrap (before OpenClaw starts) and every 5 minutes via runSlackTokenHealthLoop
  • Handle tokens_revoked and app_uninstalled Slack events reactively in slack-events.ts
  • Add tokens_revoked and app_uninstalled to Slack App manifest bot events
  • Clear stale gateway lock files at bootstrap and before each restart to prevent GatewayLockError CrashLoop
  • Add timeout (120 attempts / ~2 min) to waitGatewayReady to prevent infinite spin when OpenClaw fails to start
  • Add startup crash analysis doc covering all discovered OpenClaw crash scenarios

Context

When a Slack bot token is revoked, OpenClaw crashes on auth.test() during startup. The sidecar restarts it, but the config still
contains the dead account, causing a persistent CrashLoop. A separate issue: stale gateway lock files from unclean exits also cause
GatewayLockError → CrashLoop.

This PR adds three layers of defense:

  1. Proactive: periodic token validation removes dead accounts before they crash OpenClaw
  2. Reactive: tokens_revoked / app_uninstalled events trigger immediate cleanup
  3. Resilience: lock file cleanup and waitGatewayReady timeout prevent two other CrashLoop patterns

Summary by CodeRabbit

  • UX Improvements

    • Simplified the Slack OAuth manual setup workflow with streamlined instructions and reduced step count.
  • Reliability Improvements

    • Enhanced gateway startup robustness with readiness probe timeout safeguards to prevent infinite retries.
    • Improved cleanup of stale gateway lock files during startup and process restart.
  • Documentation

    • Added comprehensive design documentation covering gateway startup crash scenarios and implemented mitigations.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 6, 2026

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Free

Run ID: 6c684a18-8231-4b01-ab6e-658b7dabe590

📥 Commits

Reviewing files that changed from the base of the PR and between 2a0cb9b and 9074799.

📒 Files selected for processing (1)
  • apps/web/src/components/channel-setup/slack-oauth-view.tsx
🚧 Files skipped from review as they are similar to previous changes (1)
  • apps/web/src/components/channel-setup/slack-oauth-view.tsx

📝 Walkthrough

Walkthrough

The changes implement startup resilience improvements for OpenClaw gateway, including Slack token validation prior to bootstrap, stale lock file cleanup mechanisms across startup phases, a readiness probe attempt timeout to prevent infinite retries, and updated Slack OAuth setup instructions. A comprehensive design document cataloging startup crash scenarios and their mitigations is added.

Changes

Cohort / File(s) Summary
Startup Resilience
apps/gateway/src/bootstrap.ts, apps/gateway/src/gateway-health.ts, apps/gateway/src/openclaw-process.ts
Added Slack token health check during bootstrap startup, implemented stale gateway lock file cleanup in multiple startup phases (bootstrap and process restart), and introduced MAX_READY_ATTEMPTS timeout cap (120) to prevent infinite readiness probe retries.
UI Updates
apps/web/src/components/channel-setup/slack-oauth-view.tsx
Removed "Open Slack App Dashboard" UI block from Step 2 of manual Slack OAuth setup flow and adjusted step numbering accordingly.
Documentation
docs/design-docs/index.md, docs/design-docs/openclaw-startup-crash-analysis.md
Added reference to new design document in index and created comprehensive design doc detailing 12 fatal startup crash scenarios (F1–F12), channel-specific failures, sidecar-level failures (X1–X4), and mapped mitigations including Slack token checks, lock cleanup, and timeout adjustments.

Sequence Diagram(s)

sequenceDiagram
    participant Gateway as Gateway Bootstrap
    participant SlackAPI as Slack Token<br/>Validator
    participant FS as File System
    participant SessionMgr as Session Lock<br/>Cleanup
    participant GatewayLocks as Gateway Lock<br/>Cleanup
    
    Gateway->>SlackAPI: checkSlackTokens()
    alt Token Check Passes
        SlackAPI-->>Gateway: success
        Gateway->>Gateway: log token health OK
    else Token Check Fails
        SlackAPI-->>Gateway: error reason
        Gateway->>Gateway: log warning, continue
    end
    
    Gateway->>SessionMgr: clearStaleSessionLocks()
    SessionMgr->>FS: cleanup session.* files
    FS-->>SessionMgr: complete
    SessionMgr-->>Gateway: done
    
    Gateway->>GatewayLocks: clearStaleGatewayLocks()
    GatewayLocks->>FS: readdir tmpdir/openclaw-{uid}
    FS-->>GatewayLocks: list files
    GatewayLocks->>FS: remove gateway.*.lock files
    FS-->>GatewayLocks: cleanup complete
    GatewayLocks-->>Gateway: log cleanup count
    
    Gateway->>Gateway: fetchInitialConfigWithRetry()
Loading

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 Lock files tumble, tokens ring so true,
Timeouts tick and startup's born anew,
Slack dashboards fade, but steps remain,
Gateway hops through startup's domain—
Crash docs chronicle the way,
Resilience blooms today!


Note

🎁 Summarized by CodeRabbit Free

Your organization is on the Free plan. CodeRabbit will generate a high-level summary and a walkthrough for each pull request. For a comprehensive line-by-line review, please upgrade your subscription to CodeRabbit Pro by visiting https://app.coderabbit.ai/login.

Comment @coderabbitai help to get the list of available commands and usage tips.

@Siri-Ray Siri-Ray merged commit 6875bbd into main Mar 6, 2026
2 checks passed
lefarcen added a commit that referenced this pull request Apr 1, 2026
Add a dismiss-once promo banner on Home (both idle and active states)
and a two-step modal: GitHub Star → join Feishu group to apply for a
Seedance 2.0 experience Key. Visual style aligned with the prototype
in refly-ai/agent-digital-cowork PR #119 and #120.
alchemistklk pushed a commit that referenced this pull request Apr 1, 2026
* feat(web): add Seedance 2.0 promo banner and modal flow

Add a dismiss-once promo banner on Home (both idle and active states)
and a two-step modal: GitHub Star → join Feishu group to apply for a
Seedance 2.0 experience Key. Visual style aligned with the prototype
in refly-ai/agent-digital-cowork PR #119 and #120.

* fix(web): update Seedance tutorial URL to docs.nexu.io

* fix(web): resolve biome a11y lint errors in seedance promo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants