Skip to content

[Bug][DingTalk] Inbound messages stop arriving after multiple rapid gateway restarts #7562

@ihainan

Description

@ihainan

Summary

After multiple rapid gateway restarts (especially involving kill -9 or quick SIGTERM/restart cycles), DingTalk stops routing inbound messages to the gateway entirely. The documented 30–60s window does not apply here — the routing stays broken for an indeterminate period (observed: 40+ minutes with no messages received across 6+ restart cycles).

Symptoms

  • Gateway logs show successful WebSocket connection (ticket registered, ✓ dingtalk connected)
  • TCP connection is ESTAB (confirmed via ss -tp)
  • No inbound message log entries despite user sending messages
  • DingTalk app shows messages are sent (no error on sender side)
  • Reactions and outbound sends still work (session_webhooks from earlier in the session remain valid)

Reproduction Steps

  1. Start gateway, confirm messages arrive
  2. Rapidly restart the gateway 5–10 times within a few minutes (simulating active development)
  3. Send a message from DingTalk — observe nothing in ~/.hermes/logs/gateway.log

Root Cause Hypothesis

DingTalk's stream routing appears to track a "preferred" connection per app credential. After many quick disconnect/reconnect cycles, the server may apply backpressure or enter a confused state where it doesn't route to any of the new connections. Unlike a single restart (which recovers in ~30–60s via keepalive timeout), multiple rapid restarts may require a longer cool-down or a DingTalk console action to reset.

The ghost-connection fix (monkey-patching open_connection to raise KeyboardInterrupt during shutdown) prevents duplicate ticket registration but does not help when routing is already stuck.

Investigation Needed

  • Check DingTalk developer console for connection/quota state
  • Determine if there is a per-credential connection registration rate limit
  • Test whether waiting 5+ minutes after a clean restart recovers routing
  • Add WebSocket ping/heartbeat monitoring to detect silent dead connections
  • Consider exponential backoff for reconnect attempts to avoid triggering rate limits

Note

This is a DingTalk platform behavior issue, not a bug in our adapter logic per se. The open_connection ghost-fix is correct and prevents the immediate ghost-connection problem on a single restart. The issue is the cumulative effect of many rapid restarts.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions