Skip to content

Gateway unresponsive during OpenAI batch embedding polling #9751

@jeefy

Description

@jeefy

Summary

Gateway became unresponsive (stopped sending messages to Discord) while the memory embedding subsystem was polling OpenAI's batch API.

Environment

  • OpenClaw version: latest (npm)
  • Node: v25.5.0
  • OS: Linux 6.8.0-94-generic (x64)
  • Channel: Discord

Reproduction

  1. Enable memory search with OpenAI batch embeddings (default)
  2. OpenAI batch API returns 503 / ECONNREFUSED
  3. System enters retry loop polling every 2s
  4. Log fills with: openai batch batch_* in_progress; waiting 2000ms
  5. Gateway stops processing outbound messages (Discord sends fail silently)

Logs

[2026-02-05T10:30:xx.xxxZ] openai batch batch_67a3b7d5dd788190ae31c9e1bb92bf87 in_progress; waiting 2000ms
[2026-02-05T10:30:xx.xxxZ] openai batch batch_67a3b7d5dd788190ae31c9e1bb92bf87 in_progress; waiting 2000ms
... (repeated for minutes)

Impact

  • Messages queued for Discord never sent
  • Gateway appeared healthy (no crash) but was functionally stalled
  • Only recovered after OpenClaw fell back to non-batch mode

Suggested Fix

  1. Add timeout/circuit breaker to batch polling loop
  2. Don't busy-poll (2s interval may starve event loop under load)
  3. Consider exponential backoff on 503s
  4. Add health check that detects "polling too long" state

Workaround

Disable batch embeddings in config:

agents:
  defaults:
    memorySearch:
      remote:
        batch:
          enabled: false

Related

User also observed a long-running CRON sub-agent around the same time — unclear if related.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions