fix(core): isolate OpenAI SDK abort listener leak with per-request child controllers#4810
Conversation
…ild controllers
OpenAI SDK v5.11.0's `fetchWithTimeout` (client.mjs:324) adds an abort
listener on the caller's signal without `{once: true}` or cleanup on
request completion. PR #4366 removed the `raiseAbortListenerCap` band-aid
(which had silenced the warning by setting maxListeners to Infinity) under
the assumption that `createChildAbortController` in agent-core covered it.
However, the SDK's internal listener leak was never addressed — it
accumulates on whichever signal reaches the SDK across retries and rounds.
Fix: wrap the signal passed to `client.chat.completions.create()` in a
per-request `createChildAbortController`. The SDK's leaked listener stays
on the short-lived child signal. The `finally` block aborts the child,
triggering reverse-cleanup that removes the parent listener. For streaming,
the cleanup runs when the async generator is fully consumed or abandoned.
Closes #4423
📋 Review SummaryThis PR addresses a memory leak in the OpenAI SDK's abort signal handling by wrapping the caller's 🔍 General Feedback
🎯 Specific Feedback🟡 High
🟢 Medium
🔵 Low
✅ Highlights
|
Code Coverage Summary
CLI Package - Full Text ReportCore Package - Full Text ReportFor detailed HTML reports, please see the 'coverage-reports-22.x-ubuntu-latest' artifact from the main CI run. |
Tests that don't pass an abortSignal expect `signal: undefined` to reach the SDK. Only create the per-request child when a parent signal exists.
TS does not propagate narrowing into the nested `drainThenCleanup` generator, so referencing `perRequestAc` inside the closure failed with TS18048. Capture the narrowed value in a local const after the guard so the finally block sees a non-nullable controller.
|
@qwen-code /triage |
|
Thanks for the PR! Template looks good ✓ — all required sections present, linked issue (#4423) matches. On direction: this is a straightforward bug fix for a real, user-reported issue. The On approach: minimal and correct — wraps Moving on to code review. 🔍 中文说明感谢贡献! 模板完整 ✓ — 所有必填章节齐全,关联 issue (#4423) 匹配。 方向:这是一个针对真实用户报告 bug 的直接修复。OpenAI SDK 的 方案:最小且正确 — 在两个 SDK 调用点(非流式和流式)用 进入代码审查 🔍 — Qwen Code · qwen3.7-max |
… add cleanup tests 1. Wrap the streaming `client.chat.completions.create()` call in try/catch so a network/DNS/proxy error still aborts the per-request child controller (same pattern as the non-streaming path). 2. Add three cleanup-invariant tests: - child signal aborted after full stream consumption - child signal aborted after consumer break - child signal aborted when SDK create() throws
wenshao
left a comment
There was a problem hiding this comment.
No issues found. LGTM! ✅
The per-request child AbortController pattern correctly isolates the OpenAI SDK listener leak on both streaming and non-streaming paths. The try/finally (non-streaming) and try/catch + drainThenCleanup generator (streaming) ensure cleanup in all control-flow paths: normal return, thrown exceptions, early break, and parent abort propagation. The R1 concerns (missing try/finally for create() errors, test coverage gaps) are properly addressed with 4 new tests.
CI: 8/8 pass. 55/55 pipeline tests pass. tsc and eslint clean.
— qwen3.7-max via Qwen Code /review
There was a problem hiding this comment.
Pull request overview
This PR mitigates AbortSignal listener accumulation triggered by the OpenAI SDK by wrapping the caller-provided abort signal in a short-lived per-request child AbortController, and ensuring the child is aborted on request completion (and after streaming consumption) so the parent listener is released.
Changes:
- Pass a per-request child
AbortSignalintoclient.chat.completions.create()for both non-streaming and streaming calls. - Ensure the per-request child controller is aborted in
finally(non-streaming) and via a wrapper async generator on stream completion/early-break (streaming). - Update and expand unit tests to assert child-signal usage and post-consumption cleanup behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| packages/core/src/core/openaiContentGenerator/pipeline.ts | Wrap OpenAI SDK calls with a per-request child abort controller and ensure cleanup for both sync and streaming paths. |
| packages/core/src/core/openaiContentGenerator/pipeline.test.ts | Adjust existing assertions for child-signal behavior and add tests covering abort propagation and cleanup after streaming. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code ReviewThe implementation is clean and matches what I would have done independently. Two call sites, each with the right cleanup strategy for its context:
The Tests are thorough: 6 new test cases covering child signal creation, parent→child propagation, post-consumption cleanup, early-break cleanup, and throw-time cleanup. Existing tests updated to expect the child signal (not the parent). No blockers. No AGENTS.md violations. Test ResultsListener Accumulation Repro (#4366 script)Tmux Smoke Test (with PR applied)No 中文说明代码审查实现干净,与我的独立方案一致。两个调用点各自采用正确的清理策略:
#4366 引入的 测试全面:6 个新测试用例覆盖 child signal 创建、父→子传播、消费后清理、提前退出清理、异常时清理。已有测试更新为期望 child signal。 无阻塞问题,无 AGENTS.md 违规。 测试结果
— Qwen Code · qwen3.7-max |
|
This is a textbook surgical fix: real user pain (#4423), correct root cause (SDK My independent proposal before reading the diff was identical — wrap the signal at the SDK call boundary using The before/after is clear: the listener accumulation repro proves the mechanism (OLD: 2000 listeners, NEW: 0), pipeline tests verify correctness, and the tmux smoke test confirms no regressions. Approving. ✅ 中文说明这是一个教科书级的精准修复:真实用户痛点(#4423)、正确的根因定位(SDK 我在看 diff 之前的独立方案完全一致 — 在 SDK 调用边界用 前后对比清晰:listener 积累 repro 证明机制有效(旧:2000 listeners,新:0),pipeline 测试验证正确性,tmux 冒烟测试确认无回归。 批准 ✅ — Qwen Code · qwen3.7-max |
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
LGTM, looks ready to ship. ✅
- Update error-handling test comment: the error does propagate to the consumer via the async generator, not handled only internally - Remove trailing whitespace on blank line in early-break test
|
Qwen Code review did not complete successfully: Qwen review timed out after 55 minutes. See workflow logs. |
✅ Local verification report — real OpenAI SDK + built pipeline, in tmux (Linux)Built the PR and ran a real end-to-end A/B in Verdict: the fix works exactly as described, on both code paths (non-streaming + streaming, incl. early abandonment), with abort propagation preserved and no regressions. Safe to merge. The PR description matches observed reality. Environment
Real A/B (built pipeline + real SDK, 40 calls on ONE parent
|
| Path (40 calls, real SDK) | parent abort listeners | MaxListenersExceededWarning |
|---|---|---|
BEFORE — unfixed call site (= main) |
40 | fired |
AFTER — real pipeline.execute |
0 | none |
AFTER — real pipeline.executeStream (full consume) |
0 | none |
AFTER — real pipeline.executeStream (early break) |
0 | none |
- The early-break case is the subtlest: a consumer that abandons the stream before
[DONE]. It stays at 0, confirmingdrainThenCleanup'sfinally { ac.abort() }fires via the async-generator.return()path, releasing the parent listener. - Abort propagation preserved:
pipeline.execute()with a pre-aborted parent rejects (the child is aborted with the parent's reason), so cancellation semantics are unchanged.
Unit tests & reverse-audit
packages/core/src/core/openaiContentGenerator/pipeline.test.ts→ 55 passed ✓ (matches PR claim)- Full
openaiContentGeneratorsuite → 461 passed / 15 files — no regressions ✓ - Reverse-audit of the test diff (passing ≠ correct): no assertions were weakened.
- The two removed
expect(create).toHaveBeenCalledWith(…, { signal: abortController.signal })assertions had to change — the SDK now correctly receives a child signal; they're replaced by stronger checks (sdkSignal instanceof AbortSignal && sdkSignal !== parent). - The stream-error assertion was strengthened: moved out of the
catchblock (where a never-thrown error would have been a silent false-pass) toexpect(caughtError).toBe(testError)after the loop. The round-3 commit also corrected the now-inaccurate comment.
- The two removed
Code review
The fix wraps each SDK call in createChildAbortController(parentSignal) (an existing, well-documented util with {once:true} parent-listener + reverse-cleanup on child-abort + WeakRef parent) and aborts the child when the request settles:
- non-streaming:
try { … } finally { perRequestAc?.abort() } - streaming:
catch(create throws) +drainThenCleanup'sfinally(stream consumed or abandoned) - the no-parent-signal case is a clean no-op (
parentSignal ? … : undefined). One extraAbortControllerper call — negligible, as the PR notes.
Recommendation
LGTM to merge. The real SDK leak (40 dead listeners + warning over 40 calls) is fully eliminated by the actual shipped code on every path I exercised, cancellation is preserved, tests are accurate and green, and no regressions surfaced.
中文小结
用 tmux + 真实 openai@5.11.0 SDK + 真实构建出的 ContentGenerationPipeline(打到本地 mock OpenAI 服务器)做了端到端 A/B,在同一个长生命周期 parent signal 上连续 40 次调用,统计 parent 上的 abort listener 数量。这比 PR 自带的 repro(自己重写 controller 并模拟泄漏)更强——它跑的是真实 SDK 泄漏路径和真实修复代码。Linux 环境,补齐了 🐧 Linux 空缺。
- 先确认泄漏真实存在:
node_modules/openai/client.mjs:324的signal.addEventListener('abort', …)没有{once:true}、不 remove。 - BEFORE(未修复,等同 main 直接把 signal 传给 SDK):listener 10→20→30→40,第 16 个触发
MaxListenersExceededWarning(正是 issue 出现(node:2666) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 1596 abort listeners added to [AbortSignal]. MaxListeners is 1500. Use events.setMaxListeners() to increase limit (node:2666) MaxListenersExceededWarning: Possible EventT #4423 的现象)。 - AFTER(真实
pipeline.execute/executeStream):40 次后 listener 始终为 0,无警告。 - 流提前中断(early break) 这个最隐蔽的路径也是 0——证明
drainThenCleanup的finally经由 async generator.return()正确触发清理。 - 取消语义保留:pre-aborted parent 调用
execute会 reject(child 带着 parent 的 reason abort)。 - 单测:
pipeline.test.ts55 全过;整个 openaiContentGenerator 套件 461 全过,无回归;eslint 通过。 - 逆向审计(测试通过 ≠ 测试正确):没有削弱断言——被删的
signal: parent断言是因为 SDK 现在正确收到 child signal,必须改;且流错误那条断言被加强了(从 catch 里挪出来,修掉了一个潜在的假通过)。
结论:真实 SDK 泄漏(40 次 = 40 个死 listener + 警告)被实际发布的代码在所有路径上彻底消除,取消语义不变,测试准确且绿,无回归,可安全合并。PR 描述与实测一致。
qwen-code-ci-bot
left a comment
There was a problem hiding this comment.
No review findings. Downgraded from Approve to Comment: CI still running. — qwen3.7-max via Qwen Code /review
tanzhenxin
left a comment
There was a problem hiding this comment.
Read the pipeline.ts diff — child-controller isolation is correct on both paths, and drainThenCleanup's try/finally covers normal completion, mid-stream errors, and early abandonment. Together with wenshao's measured A/B against the real SDK, this looks solid.
What this PR does
Wraps the
AbortSignalpassed to OpenAI SDKclient.chat.completions.create()in a per-requestcreateChildAbortController. This isolates the SDK's internal listener leak (no{once: true}, noremoveEventListenerinfetchWithTimeout) on a short-lived child signal that is aborted and cleaned up when the request completes.Why it's needed
Users in long sessions (especially CI-monitoring cron scenarios) hit
MaxListenersExceededWarning: Possible EventTarget memory leak detected. 3400+ abort listeners added to [AbortSignal]. PR #4366 fixed the agent-runtime parent→child listener chain but removed theraiseAbortListenerCapband-aid inpipeline.tswithout addressing the SDK-internal leak path. The OpenAI SDK v5.11.0fetchWithTimeout(client.mjs:324) doessignal.addEventListener('abort', () => controller.abort())without cleanup — every API call accumulates a dead listener on the caller's signal.Root Cause Analysis
The fix wraps each SDK call in a short-lived child signal:
Reviewer Test Plan
How to verify
Option 1: Unit tests (fast)
55 tests cover: child signal identity, parent abort propagation, child cleanup after full consumption, early break, and SDK
create()throw.Option 2: Reproduction script (demonstrates the mechanism)
Option 3: Full E2E (long session)
gh pr checksevery 60s)MaxListenersExceededWarningfloods stderr after ~30 roundsEvidence (Before & After)
Before (without fix) — SDK leaks directly on the long-lived parent signal:
After (with fix) — per-request child signal isolates the leak, reverse-cleanup keeps parent clean:
Unit tests:
Tested on
Environment (optional)
Unit tests via vitest. Reproduction script via Node.js v22.
Risk & Scope
AbortControllerallocation per SDK call (negligible cost vs. preventing unbounded listener growth).createChildAbortController.Linked Issues
Closes #4423
中文说明
这个 PR 做了什么
将传给 OpenAI SDK
client.chat.completions.create()的AbortSignal包装在一个 per-request 的createChildAbortController中。这将 SDK 内部的 listener 泄漏(fetchWithTimeout中没有{once: true},也没有removeEventListener)隔离在一个短生命周期的 child signal 上,请求完成后在finally中 abort 并清理。为什么需要
长时间会话(特别是 CI 监控 cron 场景)中用户会遇到
MaxListenersExceededWarning: Possible EventTarget memory leak detected. 3400+ abort listeners added to [AbortSignal]。PR #4366 修了 agent runtime 的 parent→child listener 链,但删除了pipeline.ts中的raiseAbortListenerCap补丁却没有处理 SDK 内部的泄漏路径。OpenAI SDK v5.11.0 的fetchWithTimeout(client.mjs:324)在每次 API 调用时对信号做addEventListener但不清理——每次调用都在 caller 的 signal 上累积一个死 listener。根因分析
pipeline.ts把长生命周期的 round signal 直接传给 SDK → SDKfetchWithTimeout每次 call 做addEventListener('abort', ...)不清理 → N 次调用 = N 个死 listenercreateChildAbortController包装一层短生命周期 child signal → SDK 的泄漏隔离在 child 上 → 请求完成后childAc.abort()触发反向清理 → parent signal listener 数始终为 0风险与范围
AbortController分配(对比防止无界 listener 增长,代价微不足道)