Skip to content

feat: wire SQLite message store into active gateway for SIGTERM resilience #65650

@rijhsinghani

Description

@rijhsinghani

Problem

When the gateway receives SIGTERM mid-processing (e.g., config reload, launchd restart), in-flight messages are permanently lost. Socket Mode has already ACK'd the event, so Slack never re-delivers it. The agent goes silent — indistinguishable from "still thinking."

Real incident: An agent received a thread message, started processing (10 tool calls, 68 seconds of work), then SIGTERM hit at the exact moment a tool call was in progress. The run was killed, no reply was ever sent, and the message was permanently lost from the queue.

Root cause: The gateway config change at 20:12:55 triggered a SIGTERM at 20:16:43 via a reload cascade. Related: #43406 (SIGTERM on config reload).

Proposed Fix

The SQLite message store (src/message-store.ts) already exists as Phase 26 prep but is not wired into the active gateway code path (src/index.ts + src/bolt-app.ts). The fix is pure wiring — no new infrastructure:

1. message-store.ts — Add requeueProcessing()

export function requeueProcessing(): number {
  const db = getDb();
  const result = db.prepare(
    `UPDATE message_queue SET status = 'queued', picked_up_by = NULL WHERE status = 'processing'`
  ).run();
  return result.changes;
}

2. enqueueMessage() — Add optional status + picked_up_by params

Allows single-INSERT with status='processing' to halve hot-path SQLite writes (WAL mode).

3. bolt-app.ts — Persist before routeMessage()

const mqId = enqueueMessage({
  channel, thread_ts, user_id, text, files,
  status: "processing",
  picked_up_by: "bolt-handler",
});
try {
  await routeMessage({...});
  markDone(mqId);
} catch (err) {
  markFailed(mqId, err instanceof Error ? err.message : String(err));
}

4. index.ts — Startup + shutdown wiring

  • Startup: initMessageDb() before Socket Mode connects
  • After connect: Re-dispatch orphaned messages via peekMessages(20) + Promise.allSettled
  • Shutdown: requeueProcessing() + closeMessageDb() before markCleanShutdown()

Impact

  • Zero message loss on SIGTERM/restart
  • Hot-path cost: 1 synchronous SQLite write per message (WAL mode, sub-ms)
  • Orphan re-dispatch capped at 20 messages with Promise.allSettled
  • Backward compatible — all existing behavior preserved

Related Issues

Reference Implementation

We have a working implementation at https://github.com/rijhsinghani/claude-slack-bridge/pull/67 (merged). Happy to contribute a PR if this approach aligns with the Phase 26 roadmap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.impact:message-lossChannel message delivery can be lost, duplicated, or misrouted.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions