fix: fatal hard-crash on loop detection via unhandled AbortError#20108
fix: fatal hard-crash on loop detection via unhandled AbortError#20108scidomino merged 3 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello @hsm207, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical hard-crash issue within the GeminiClient that occurred when the loop detection service attempted to stop a streaming response. The crash stemmed from a race condition involving AbortError and the @google/genai SDK's async iterator, leading to an unhandled exception. The changes implement a more robust abort signal handling strategy and refine the loop detection termination process, ensuring that the application gracefully exits without crashing when repetitive content is identified. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a critical crash caused by a race condition in the Gemini SDK's signal handling by transitioning to manual signal linking and a more controlled exit from the streaming loop. However, the manual signal linking in GeminiClient.processTurn has several critical flaws. It fails to properly handle already-aborted signals, leading to resource and memory leaks, and the fix for the fatal crash is incomplete, as parent signal aborts can still trigger the crash during stream suspensions. Additionally, there's a risk of leaking event listeners if errors occur during turn initialization. The review comments provide detailed feedback on ensuring the internal controller is aborted for already-aborted signals, improving listener cleanup robustness, and fully mitigating the crash.
| const controller = new AbortController(); | ||
| const linkedSignal = AbortSignal.any([signal, controller.signal]); | ||
| const onAbort = () => { | ||
| try { | ||
| controller.abort(); | ||
| } catch (e) { | ||
| debugLogger.debug( | ||
| 'Error aborting internal controller from parent signal', | ||
| e, | ||
| ); | ||
| } | ||
| }; | ||
| signal.addEventListener('abort', onAbort, { once: true }); |
There was a problem hiding this comment.
The manual linking of AbortSignal fails to check if the parent signal is already aborted, causing the internal AbortController to never abort. This leads to unnecessary LLM requests (turn.run), resource consumption, and potential session hangs. Crucially, the fix for the fatal crash is incomplete; if the parent signal is aborted during a yield suspension, controller.abort() can still trigger the uncatchable fatal crash this PR aims to resolve. Additionally, there is a significant risk of leaking the onAbort listener on the parent signal object if an error occurs during the initialization phase (e.g., in this.loopDetector.turnStarted, router.route, or this.setTools) before the try block at line 706 is reached.
const controller = new AbortController();
const onAbort = () => {
try {
controller.abort();
} catch (e) {
debugLogger.debug(
'Error aborting internal controller from parent signal',
e,
);
}
};
if (signal.aborted) {
onAbort();
} else {
signal.addEventListener('abort', onAbort, { once: true });
}References
- Asynchronous operations that can be cancelled by the user should accept and propagate an
AbortSignalto ensure cancellability and prevent dangling processes or network requests. The current implementation's failure to properly link and handleAbortSignalleads to resource consumption and potential session hangs, directly contradicting the goal of proper cancellability.
There was a problem hiding this comment.
Thanks for the feedback! I've refined the implementation to ensure signal linking happens within the safety of the try-catch block and added explicit pre-abort checks to prevent any potential leaks or unhandled exceptions from the underlying SDK's custom iterator.
47454ed to
c76e6b7
Compare
48bf197 to
4d4e86a
Compare
|
@bdmorgan can i get some human feedback about this pr, please? |
23e8820 to
300d1a1
Compare
bac0406 to
fa68c03
Compare
| private readonly connection: acp.AgentSideConnection, | ||
| private readonly sessionId: string, | ||
| private readonly capabilities: acp.FileSystemCapabilities, | ||
| private readonly capabilities: acp.FileSystemCapability, |
There was a problem hiding this comment.
Is this intended? how is this related to the fix?
| @@ -0,0 +1,43 @@ | |||
| /** | |||
There was a problem hiding this comment.
I'm not a fan of adding a single new test file. Can this not fit into an existing test file somewhere?
| isInvalidStreamRetry: boolean, | ||
| displayContent?: PartListUnion, | ||
| controllerToAbort?: AbortController, | ||
| _controllerToAbort?: AbortController, |
There was a problem hiding this comment.
If the controller is no longer needed, it should not be an argument at all.
|
@scidomino Thanks for the review. The upstream has evolved significantly since the original PR was opened, rendering many of the previous implementation details (and related comments) obsolete. I have redone the fix against the latest Current Behavior:
I've cleaned up the code and removed the unused parameters as suggested. |
c44a95a to
8dde7e7
Compare
8dde7e7 to
35c6578
Compare
|
ok. you only have two issues:
Make sure |
35c6578 to
c24dad4
Compare
|
Done! addressed both points:
Verified that |
7ae5207 to
38fc039
Compare
|
You had changes to package-lock that I reverted. I will merge now. Please don't update the code as that will require rerunning the tests. |
ff28d55
…gle-gemini#20108) Co-authored-by: Tommaso Sciortino <sciortino@gmail.com>
Summary
This PR fixes a fatal hard-crash in the
GeminiClientthat occurs when the loop detection service identifies repetitive content and attempts to stop the stream. It resolves a synchronousAbortErrorrace condition that can crash the Node.js process by bypassing high-level error handling.Details
The crash is caused by a race condition within the
@google/genaiSDK's async iterator when an abort is triggered during yield suspensions. To resolve this, theGeminiClientnow immediatelybreaks the streaming loop upon detection of a loop or reaching a turn limit. This ensures the system exits the unmonitored SDK state safely before any abort-related cleanup occurs.Key Changes:
processTurnto immediatelybreakthefor awaitloop when a loop is detected.AbortControllerandlinkedSignaldeclarations to simplify the logic.packages/core/src/services/loopDetectionService.test.ts.Related Issues
Fixes #20106.
Related to:
How to Validate
npm installnpm run buildnpm start -- -p "Repeat the following sequence exactly 50 times: la li lu le lo"Expected Result: The loop is detected, the generation stops gracefully, and the application exits with Code 0 instead of crashing with a stack trace.
Pre-Merge Checklist
loopDetectionService.test.ts)