Skip to content
This repository was archived by the owner on May 24, 2026. It is now read-only.

workaround: avoid SDK cancellation serialization bug#320

Merged
PureWeen merged 3 commits into
mainfrom
fix/sdk-cancellation-serialization-workaround
Mar 9, 2026
Merged

workaround: avoid SDK cancellation serialization bug#320
PureWeen merged 3 commits into
mainfrom
fix/sdk-cancellation-serialization-workaround

Conversation

@PureWeen

@PureWeen PureWeen commented Mar 9, 2026

Copy link
Copy Markdown
Owner

Summary

Pass \CancellationToken.None\ to SDK \SendAsync\ calls to prevent \StreamJsonRpc's \StandardCancellationStrategy\ from attempting to serialize \RequestId, which is not included in the SDK's AOT-safe JSON context.

Fixes #319

Root Cause (SDK Bug)

When a \CancellationToken\ fires during an SDK call:

  1. \StreamJsonRpc.StandardCancellationStrategy.CancelOutboundRequest()\ tries to send a $/cancelRequest\ notification
  2. This requires serializing \StreamJsonRpc.RequestId\
  3. The SDK's \SystemTextJsonFormatter\ doesn't include \RequestId\ in its \TypeInfoResolverChain\
  4. \NotSupportedException\ is thrown, becomes unobserved, kills the JSON-RPC reader loop silently
  5. Session gets stuck — no events arrive, watchdog eventually fires

Why This Workaround Works

PolyPilot already handles cancellation at the TCS level:

  • \AbortSessionAsync\ calls \ResponseCompletion.TrySetCanceled()\
  • Watchdog timeout clears \IsProcessing\ state
  • All abort/error paths properly clean up state

The \CancellationToken\ was only used by SDK to send server-side cancellation notifications — which is the broken path.

Trade-off

The server won't receive $/cancelRequest\ notifications, so it may continue processing cancelled requests. However:

  • Local state is correctly cleaned up (good UX)
  • Watchdog provides safety net
  • Server resources are wasted but operations complete gracefully

Investigation Summary

4 parallel sub-agents analyzed:

Agent Focus Finding
SDK changelogs Breaking changes None documented for cancellation
SDK Client.cs JsonRpc config Confirms \StandardCancellationStrategy\ used, \RequestId\ missing from JSON context
PolyPilot audit CancellationToken usage Our usage is correct
SDK samples Recommended patterns Don't use CancellationToken at all

All agents agree: This is an SDK bug, not a PolyPilot defect.

Changes

  • CopilotService.cs: 2 lines changed
    • Primary \SendAsync\ call (line 2351)
    • Reconnect retry \SendAsync\ call (line 2499)

Testing

  • Build passes
  • 139 watchdog tests pass
  • Manual testing: orchestrator sessions no longer get stuck on cancellation

SDK Versions Affected

  • Broken: 0.1.30+ (after AOT/trimming changes)
  • Works: 0.1.26

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

Pass CancellationToken.None to SDK SendAsync calls to prevent
StreamJsonRpc's StandardCancellationStrategy from attempting to
serialize RequestId, which is not included in the SDK's AOT-safe
JSON context.

## Root Cause (SDK Bug)

When a CancellationToken fires during an SDK call:
1. StreamJsonRpc's StandardCancellationStrategy.CancelOutboundRequest()
   tries to send a \$/cancelRequest notification
2. This requires serializing StreamJsonRpc.RequestId
3. The SDK's SystemTextJsonFormatter doesn't include RequestId in its
   TypeInfoResolverChain (ClientJsonContext, TypesJsonContext, etc.)
4. NotSupportedException is thrown, becomes unobserved, kills the
   JSON-RPC reader loop silently
5. Session gets stuck — no events arrive, watchdog eventually fires

## Why This Workaround Works

PolyPilot already handles cancellation at the TCS level:
- AbortSessionAsync calls ResponseCompletion.TrySetCanceled()
- Watchdog timeout clears IsProcessing state
- All abort/error paths properly clean up state

The CancellationToken was only used by SDK to send server-side
cancellation notifications — which is the broken path.

## Trade-off

The server won't receive \$/cancelRequest notifications, so it may
continue processing cancelled requests. However:
- Local state is correctly cleaned up (good UX)
- Watchdog provides safety net
- Server resources are wasted but operations complete gracefully

## Investigation

4 sub-agents analyzed: SDK changelogs (no breaking changes), SDK
Client.cs (confirms missing RequestId), PolyPilot usage (correct),
SDK samples (don't use CancellationToken). All confirmed SDK bug.

Fixes #319

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen force-pushed the fix/sdk-cancellation-serialization-workaround branch from fe42b32 to 366fbe3 Compare March 9, 2026 03:37
PureWeen and others added 2 commits March 8, 2026 23:50
- CRITICAL: Fix missed call site in PolyPilot.Console/Models/AgentSession.cs
  The Console app was also vulnerable to the RequestId serialization crash.

- MODERATE: Add comment about watchdog limitation when SendAsync blocks
  If the transport write itself blocks (half-open TCP), watchdog completes
  the TCS but execution never reaches it. Edge-case on mobile/codespace.

- MINOR: Add explanatory comment to soft-steer path (line 2702)
  Functionally safe (implicit CancellationToken.None) but prevents
  future maintainers from accidentally reintroducing the bug.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tion

If SendAsync throws (transport error, connection dropped, etc.),
execution exits without resetting IsProcessing. Unlike CopilotService,
AgentSession has no watchdog — only event handlers reset the flag.
If SendAsync throws before sending, no events arrive, and IsProcessing
stays true forever → all subsequent calls throw 'already processing'.

Fix: Wrap SendAsync and TCS await in try/catch that resets IsProcessing
on any exception before re-throwing.

Note: The TrySetCanceled() path self-heals because the server continues
processing (CancellationToken.None) and eventually fires SessionIdleEvent.
The SendAsync exception path does not self-heal without this fix.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@PureWeen PureWeen merged commit 3bc8ea7 into main Mar 9, 2026
@PureWeen PureWeen deleted the fix/sdk-cancellation-serialization-workaround branch March 9, 2026 05:02
arisng pushed a commit to arisng/PolyPilot that referenced this pull request Apr 4, 2026
## Summary

Pass \CancellationToken.None\ to SDK \SendAsync\ calls to prevent
\StreamJsonRpc\'s \StandardCancellationStrategy\ from attempting to
serialize \RequestId\, which is not included in the SDK's AOT-safe JSON
context.

Fixes PureWeen#319

## Root Cause (SDK Bug)

When a \CancellationToken\ fires during an SDK call:
1. \StreamJsonRpc.StandardCancellationStrategy.CancelOutboundRequest()\
tries to send a \$/cancelRequest\ notification
2. This requires serializing \StreamJsonRpc.RequestId\
3. The SDK's \SystemTextJsonFormatter\ doesn't include \RequestId\ in
its \TypeInfoResolverChain\
4. \NotSupportedException\ is thrown, becomes unobserved, kills the
JSON-RPC reader loop silently
5. Session gets stuck — no events arrive, watchdog eventually fires

## Why This Workaround Works

PolyPilot already handles cancellation at the TCS level:
- \AbortSessionAsync\ calls \ResponseCompletion.TrySetCanceled()\
- Watchdog timeout clears \IsProcessing\ state
- All abort/error paths properly clean up state

The \CancellationToken\ was only used by SDK to send server-side
cancellation notifications — which is the broken path.

## Trade-off

The server won't receive \$/cancelRequest\ notifications, so it may
continue processing cancelled requests. However:
- Local state is correctly cleaned up (good UX)
- Watchdog provides safety net
- Server resources are wasted but operations complete gracefully

## Investigation Summary

4 parallel sub-agents analyzed:

| Agent | Focus | Finding |
|-------|-------|---------|
| SDK changelogs | Breaking changes | None documented for cancellation |
| SDK Client.cs | JsonRpc config | Confirms
\StandardCancellationStrategy\ used, \RequestId\ missing from JSON
context |
| PolyPilot audit | CancellationToken usage | Our usage is correct |
| SDK samples | Recommended patterns | Don't use CancellationToken at
all |

**All agents agree**: This is an SDK bug, not a PolyPilot defect.

## Changes

- **CopilotService.cs**: 2 lines changed
  - Primary \SendAsync\ call (line 2351)
  - Reconnect retry \SendAsync\ call (line 2499)

## Testing

- Build passes
- 139 watchdog tests pass
- Manual testing: orchestrator sessions no longer get stuck on
cancellation

## SDK Versions Affected

- Broken: 0.1.30+ (after AOT/trimming changes)
- Works: 0.1.26

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate Copilot SDK cancellation serialization bug

1 participant