Skip to content

fix: DingTalk platform adapter has multiple bugs preventing message processing#5038

Closed
cloorc wants to merge 1 commit into
NousResearch:mainfrom
cloorc:fix/dingtalk-adapter-bugs
Closed

fix: DingTalk platform adapter has multiple bugs preventing message processing#5038
cloorc wants to merge 1 commit into
NousResearch:mainfrom
cloorc:fix/dingtalk-adapter-bugs

Conversation

@cloorc

@cloorc cloorc commented Apr 4, 2026

Copy link
Copy Markdown

Description

The DingTalk platform adapter (gateway/platforms/dingtalk.py) has several critical bugs that prevent it from receiving and responding to messages. Found while debugging on HarmonyOS 4 (DingTalk mobile client).

Bugs Found

1. DingTalkStreamClient.start wrapped incorrectly with asyncio.to_thread

Line 135 (original):

await asyncio.to_thread(self._stream_client.start)

DingTalkStreamClient.start() is an async coroutine. asyncio.to_thread() is designed for synchronous (blocking) functions — wrapping an async function creates a coroutine object that never gets awaited. This causes repeated reconnection loops with no useful error.

Fix: Changed to await self._stream_client.start()

2. process() method signature mismatch with SDK base class

The dingtalk_stream.ChatbotHandler.process() is defined as async def process(self, message: CallbackMessage) in the SDK. The adapter overrides it as a regular def process(self, message) — when the SDK tries await handler.process(message), it gets a plain tuple (STATUS_OK, "OK") and fails with:

ERROR dingtalk_stream.client: error processing message: object tuple can't be used in 'await' expression

Fix: Changed to async def process(self, message) and replaced blocking future.result(timeout=60) with fire-and-forget future.add_done_callback() for error logging.

3. TimeoutError from blocking future.result(timeout=60)

The process() method blocked the dingtalk-stream SDK thread waiting for agent processing to complete within 60 seconds. Agent responses typically take longer (tool calls, LLM inference), causing a TimeoutError and preventing the ACK from being returned to the SDK promptly.

ERROR gateway.platforms.dingtalk: [DingTalk] Error processing incoming message
TimeoutError

Fix: Return ACK immediately, dispatch agent work as fire-and-forget background task.

4. _extract_text() fails on CallbackMessage — messages silently dropped

The SDK sends messages as CallbackMessage with all fields inside message.data dict (e.g., message.data['text'] = {'content': 'hello'}). The adapter uses getattr(message, "text", None) which returns None since CallbackMessage has no text attribute — only data, headers, spec_version, type, extensions.

Result: text extraction returns empty string, message is silently skipped with DEBUG log "Empty message, skipping".

Fix: Added fallback to message.data['text']['content'] for CallbackMessage format.

5. _on_message() fails to extract any fields from CallbackMessage

Same issue as #4 but for ALL fields: conversation_id, sender_id, sender_nick, sender_staff_id, session_webhook, create_at, conversation_title. All use getattr(message, "...", default) which returns defaults since CallbackMessage stores everything in data dict with camelCase keys (conversationId, senderId, sessionWebhook, etc.).

Result: session_webhook is never captured, replies cannot be sent. User IDs are empty, authorization fails.

Fix: Added _get_field() helper with _DATA_KEY_MAP (snake_case to camelCase) that falls back to message.data[key].

6. Authorization uses unreadable encrypted senderId instead of senderStaffId

DingTalk provides two user identifiers:

  • senderId: encrypted open ID like $:LWCP_v1:$qoM1+WxS0Q5F5iqeTKOz7Hge06B2HTXW (unreadable, varies per app)
  • senderStaffId: numeric corp employee ID like 22514138787330 (human-readable, stable)

The adapter used senderId as user_id for authorization, making it impractical to add users to allowlists.

Fix: Use senderStaffId as primary user_id, keep senderId as user_id_alt.

Environment

  • OS: HarmonyOS 4 (DingTalk mobile client sending messages)
  • Hermes: Current cli branch
  • Python: 3.11.15
  • dingtalk-stream SDK: Latest from pip
  • Platform: DingTalk Stream Mode (WebSocket)

Reproduction

  1. Configure DingTalk platform in config.yaml
  2. Start gateway: hermes gateway run
  3. Send message from DingTalk mobile client
  4. Observe: messages arrive but are silently dropped or timeout

Fixes Applied

All six bugs have been fixed locally. The changes ensure:

  • Stream client starts correctly with await
  • SDK thread is not blocked by long-running agent processing
  • CallbackMessage fields are properly extracted via data dict
  • Human-readable staff IDs are used for authorization

Happy to submit a PR if the fixes look good.

…rocessing

The DingTalk adapter (gateway/platforms/dingtalk.py) has six bugs that
prevent it from receiving and responding to messages:

1. DingTalkStreamClient.start() is async but was wrapped in
   asyncio.to_thread(), which is for sync functions. The coroutine was
   never awaited, causing infinite reconnection loops.

2. ChatbotHandler.process() is async in the dingtalk-stream SDK, but
   the adapter defined it as a regular function. The SDK tried to await
   the returned tuple, causing "object tuple can not be used in await
   expression".

3. process() blocked the SDK thread with future.result(timeout=60).
   Agent responses take longer than 60s (LLM inference, tool calls),
   causing TimeoutError and preventing timely ACK.

4. _extract_text() used getattr(message, "text") which returns None on
   CallbackMessage. The SDK stores text in message.data["text"]["content"].
   Messages were silently dropped as empty.

5. _on_message() used getattr() for all fields (conversation_id,
   sender_id, session_webhook, etc.) but CallbackMessage stores
   everything in message.data with camelCase keys. All fields resolved
   to defaults — session_webhook was never captured, replies impossible.

6. Authorization used encrypted senderId ($:LWCP_v1:$...) as user_id
   instead of readable senderStaffId (numeric corp employee ID), making
   allowlists impractical.

Fixes:
- Direct await on stream_client.start()
- async def process() with fire-and-forget task dispatch
- _extract_text() falls back to message.data["text"]["content"]
- _get_field() helper with camelCase key mapping for CallbackMessage
- Use senderStaffId as primary user_id for authorization

Closes NousResearch#5037

Signed-off-by: cloorc <wittcnezh@foxmail.com>
@teknium1

Copy link
Copy Markdown
Contributor

Closing as superseded by #11471 (#11471) which salvaged @kevinskysunny's minimal fix (#11257) and added a follow-up for the broken _extract_text() path found during E2E testing.

Thanks for the fix — a lot of contributors hit this SDK break at the same time. Your investigation helped confirm the root cause.

@teknium1 teknium1 closed this Apr 17, 2026
@cloorc cloorc deleted the fix/dingtalk-adapter-bugs branch May 7, 2026 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants