Skip to content

[Bug]: OpenAI Responses SSE sanitizer appears to starve embedded streams; bypass restores normal output #76305

@andhai

Description

@andhai

Summary

This looks like a more specific root-cause report for the broader openai/* embedded run hangs until timeout symptom reported in #76174 (and possibly related to #75824).

On OpenClaw 2026.4.29 (a448042), embedded OpenAI Responses streaming can stall because the SSE sanitization layer appears to starve or corrupt the stream before the OpenAI SDK parser sees the downstream events.

A clean repro on macOS succeeds immediately when sanitizeOpenAISdkSseResponse() is bypassed for one run, which strongly isolates the timeout to that sanitizer path rather than general OpenAI connectivity, auth, or the embedded runner itself.

Environment

  • OpenClaw 2026.4.29 (a448042)
  • macOS arm64
  • Node 24.13.1
  • Provider: OpenAI direct API key path
  • Models observed failing in embedded runs before the bypass: openai/gpt-5.4, openai/gpt-5.3-codex, openai/gpt-4o-mini

What we observed before the bypass

For embedded runs using the OpenAI Responses path:

  • request is dispatched successfully
  • OpenAI returns 200 OK
  • content type is text/event-stream; charset=utf-8
  • response body is present
  • the embedded runner reaches stream-ready
  • a first event of type start is observed
  • then no usable downstream reply events arrive and the run eventually hits the idle watchdog / embedded timeout

In other words, this is not a simple fetch never returned failure.

Critical isolation result

I patched the installed bundle temporarily so sanitizeOpenAISdkSseResponse() can be bypassed with an env var for a single run:

OPENCLAW_BYPASS_OPENAI_SDK_SSE_SANITIZE=1

Then I ran a fresh unique-session repro:

OPENCLAW_DISABLE_BONJOUR=1 \
OPENCLAW_BYPASS_OPENAI_SDK_SSE_SANITIZE=1 \
openclaw agent --local --agent main \
  --session-id diag-bypass-sse-1777763671 \
  --model openai/gpt-4o-mini \
  --message "Reply with exactly: hello" \
  --json --timeout 90

That run completed successfully and returned:

hello

Total duration was ~30s, no fallback used.

Why this points at the sanitizer

With the bypass enabled, the downstream SSE/event flow looks normal again. The run emits/observes events like:

  • response.created
  • response.in_progress
  • response.output_item.added
  • response.content_part.added
  • response.output_text.delta
  • response.output_text.done
  • response.completed

Before the bypass, the stream reached the initial startup boundary but stalled before those downstream messages materialized.

Relevant diagnostic evidence

Before bypass:

[diag:fetch-sse] guarded-fetch-response provider=openai model=gpt-4o-mini status=200 contentType=text/event-stream; charset=utf-8 bodyPresent=true url=https://api.openai.com/v1/responses
[diag:fetch-sse] sanitize-start status=200 contentType=text/event-stream; charset=utf-8 url=https://api.openai.com/v1/responses
[diag:fetch-sse] sanitize-pull index=1 done=false bytes=492
[diag:openai-sdk] parse-response-stream status=200 contentType=text/event-stream; charset=utf-8 bodyPresent=true
[diag:embedded-stream] first-event provider=openai model=gpt-4o-mini api=openai-responses elapsedMs=1642 type=start
[diag:fetch-sse] sanitize-pull index=2 done=false bytes=1369
...
embedded run timeout

After bypass:

[diag:fetch-sse] sanitize-bypassed status=200 contentType=text/event-stream; charset=utf-8 url=https://api.openai.com/v1/responses
[diag:openai-sdk] sse-message index=5 event=response.output_text.delta dataPrefix={"type":"response.output_text.delta",..."delta":"hello"...}
[openai-transport] [diag:openai-responses] text-delta provider=openai model=gpt-4o-mini api=openai-responses contentIndex=0 deltaLen=5 totalLen=5
[openai-transport] [diag:openai-responses] response-completed provider=openai model=gpt-4o-mini api=openai-responses status=completed stopReason=stop contentBlocks=1

Suspect area

The problematic layer appears to be the SSE sanitation/filtering path around the function:

  • sanitizeOpenAISdkSseResponse(response)

That layer sits between the raw text/event-stream HTTP response and the OpenAI SDK SSE parser.

Expected behavior

OpenClaw should preserve valid OpenAI Responses SSE frames so embedded OpenAI runs can stream normally without requiring a bypass.

Actual behavior

The sanitizer path appears to interfere with or starve the SSE stream, causing embedded runs to hang until timeout.

Workaround

A temporary bundle-local env-gated bypass of the sanitizer restored correct behavior immediately:

OPENCLAW_BYPASS_OPENAI_SDK_SSE_SANITIZE=1

Request

Could maintainers take a look at the SSE sanitizer path for the OpenAI Responses stream on 2026.4.29?

This seems like a concrete root-cause lead for the broader openai/* embedded runs hang until timeout reports such as #76174.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1High-priority user-facing bug, regression, or broken workflow.impact:auth-providerAuth, provider routing, model choice, or SecretRef resolution may break.impact:crash-loopCrash, hang, restart loop, or process-level availability failure.

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions