Skip to content

[Ubuntu 24.04][Agent&Skills] NemoHermes Slack socket-mode idle reconnect silently drops inbound @mention messages #3582

@hulynn

Description

@hulynn

Description

Description

After Hermes+Slack onboard, the bot answers inbound @mention messages correctly. After ~5 min of socket idleness the slack_bolt AsyncApp socket_mode_listener silently reconnects to Slack; the FIRST @mention arriving during/just after that reconnect window is not delivered to Hermes. /sandbox/.hermes/logs/agent.log shows zero "inbound message" entry for the dropped event, no error, and `nemoclaw  doctor` still reports the bot healthy. Reproduced 3x in one session.
Environment
Device:        Ubuntu 24.04 server (local-mercl@10.176.187.156)
OS:            Ubuntu 24.04.4 LTS
Architecture:  x86_64
Node.js:       v22.22.3
npm:           N/A (env shim issue, not bug-relevant)
Docker:        Docker version 29.4.3, build 055a478
OpenShell CLI: openshell 0.0.39
NemoClaw:      v0.0.43
Hermes Agent:  v2026.4.23
LLM Provider:  NVIDIA Endpoints, model nvidia/nemotron-3-super-120b-a12b
Steps to Reproduce
1. nemoclaw onboard --agent hermes --no-gpu
   - provider: NVIDIA Endpoints
   - model: nvidia/nemotron-3-super-120b-a12b
   - messaging: Slack (provide xoxb-, xapp- tokens)
   - allowlist: a real Slack workspace member ID (e.g. U0AR85ATALW)
2. Wait for sandbox build + Hermes gateway up. Verify:
   docker exec  grep "Socket Mode connected" /sandbox/.hermes/logs/agent.log
3. /invite @ into a target public channel in the workspace
4. From the allowlisted Slack user, @mention the bot once:
   "@ what is 5+2?"
   -> bot replies in ~10s (round-trip logged: "inbound message: platform=slack chat=Cxxxxx ... -> response ready ... -> Sending response")
5. Wait > 5 min idle (slack_bolt's socket idle window).
6. From the same user, @mention the bot again with any prompt.
7. Observe sandbox logs:
   docker exec  tail -50 /sandbox/.hermes/logs/agent.log
Expected Result
Every @mention reaching Slack should be delivered to Hermes and either replied to or logged as filtered (allowlist mismatch, etc.). No silent drop. After idle reconnect, slack_bolt should fetch missed events from Slack's events_api buffer.
Actual Result
Step 7 — /sandbox/.hermes/logs/agent.log has zero "inbound message" entry corresponding to the step-6 @mention. Bot does not reply in the Slack thread. nemoclaw  doctor reports the bot healthy.

Observed 3 times in one test session (T5948596 manual run, 2026-05-15):
  17:34 CT  @mention dropped — during socket reconnect at 09:34:25 UTC
  17:55 CT  @mention dropped — during socket reconnect at 09:55:17 UTC
  18:54 CT  @mention "5+2 等于多少" worked (reply "7" in 9.3s). Two follow-up @mentions in same thread dropped after 11:00 UTC. agent.log shows agent-cache idle-TTL evict at 11:01:41 then no inbound events ever again.
Working baselines (proves the pipeline works when socket is warm)
- DM "hi bot, can you hear me in DM?" at 09:59:33 UTC
  -> reply 12.1s, 135 chars
- @mention "can you hear me now?" at 10:07:42 UTC (immediately after the DM warmed the socket)
  -> reply 5m48s with HTTP 429 fallback, 106 chars
- @mention "5+2 等于多少" at 10:59:50 UTC
  -> reply "7" in 9.3s

Drops vs. successes correlate with socket idle/reconnect timing, not with allowlist, channel membership, or app event-subscription state (verified: bot is in channel; "[Slack] Authenticated as @" log entry present; allowlist contains the sender's real member ID).
Workaround
- Keep socket warm by sending dummy traffic every < 5 min.
- Or destroy + re-onboard before each test session.
- Or accept that the FIRST @mention after idle may be lost; resend.
Impact
- Manual test case T5948596 (Slack message inbound to Hermes) is flaky.
- Production users who don't interact frequently will lose their first message after idle, with no error and no visible bot state change — bot appears offline though it is actually fine.
Root-cause hypothesis
slack_bolt AsyncApp's socket_mode_listener idle reconnect does not re-fetch events that arrived while the previous socket was being torn down. Slack delivers the event via the closed socket; the freshly reconnected socket never receives it (no replay/ack mechanism).
Logs
2026-05-15 09:50:36 [Slack] Authenticated as @nemoclawtest in workspace mercuriusSpace
2026-05-15 09:50:36 [Slack] Socket Mode connected (1 workspace(s))
2026-05-15 09:50:37 slack_bolt.AsyncApp: A new session (s_8150689680683) has been established
2026-05-15 09:55:17 slack_bolt.AsyncApp: The session (s_8150689680683) seems to be already closed. Reconnecting...
2026-05-15 09:55:17 slack_bolt.AsyncApp: The old session (s_8150689680683) has been abandoned
2026-05-15 09:55:17 slack_bolt.AsyncApp: A new session (s_8150689794933) has been established
(Lynn's 09:55 @mention should have arrived here; never logged)

2026-05-15 09:59:33 gateway.run: inbound message: platform=slack user=Lynn Hu chat=D0AQUPCS6BH msg='hi bot...'
2026-05-15 09:59:45 gateway.run: response ready: time=12.1s api_calls=1 response=135 chars
2026-05-15 09:59:45 [Slack] Sending response to D0AQUPCS6BH
2026-05-15 10:59:50 gateway.run: inbound message: platform=slack user=Lynn Hu chat=C0ARFANAQCW msg='lynn 测试下连接,你告诉我下5+2等于多少'
2026-05-15 11:00:00 gateway.run: response ready: time=9.3s api_calls=1 response=1 chars
2026-05-15 11:01:41 gateway.run: Agent cache idle-TTL evict: session=...:D0AQUPCS6BH:1778839163 (idle=3716s)
(Lynn's two follow-up @mentions after 11:00 — never logged)

Bug Details

Field Value
Priority Unprioritized
Action Dev - Open - To fix
Disposition Open issue
Module Machine Learning - NemoClaw
Keyword NemoClaw, NemoClaw_Agent&Skills, NEMOCLAW_GH_SYNC_APPROVAL, NemoClaw_Policy&Network, NemoClaw-SWQA-Test-Blocker

[NVB#6180485]

Metadata

Metadata

Assignees

Labels

NV QABugs found by the NVIDIA QA Teamintegration: hermesHermes integration behaviorintegration: slackSlack integration or channel behaviorplatform: ubuntuAffects Ubuntu Linux environments

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions