Fix Session Leak in Transparent Proxy#4079
Conversation
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
The transparent proxy creates sessions in its own session manager when it sees Mcp-Session-Id headers but never deleted them on DELETE requests. Sessions accumulated until TTL cleanup (1 hour), causing monotonic memory growth that OOMKilled the pod at 128Mi under sustained session churn. Add DELETE session cleanup in RoundTrip so sessions are removed immediately on successful DELETE. Include unit tests and an investigation document with pprof methodology and before/after comparison. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
a3e0a0d to
67d4c17
Compare
PR size has been reduced below the XL threshold. Thank you for splitting this up!
|
✅ PR size has been reduced below the XL threshold. The size review has been dismissed and this PR can now proceed with normal review. Thank you for splitting this up! |
Signed-off-by: Chris Burns <29541485+ChrisJBurns@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #4079 +/- ##
========================================
Coverage 68.68% 68.68%
========================================
Files 454 458 +4
Lines 46027 46235 +208
========================================
+ Hits 31612 31755 +143
- Misses 11977 12035 +58
- Partials 2438 2445 +7 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
amirejaz
left a comment
There was a problem hiding this comment.
This is a nice improvement. It would be great if we also delete the session when receiving 404 and the request included Mcp-Session-Id, because the upstream is telling us the session is gone.
When the upstream returns 404 for a DELETE, the session is already gone on the server side. Keeping a local reference would only waste memory until TTL cleanup. Expand the cleanup condition to cover both 2xx and 404 responses, and add a 500 test case to verify transient errors still preserve the session. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
RoundTripmethod so sessions are removed immediately on successful DELETE, keeping memory flat regardless of session throughputFixes #4062
Special notes for reviewers
Mcp-Session-Idheaders but had no code to delete them on DELETE. The streamable proxy's session manager was not affected.sync.Mapwithmap + sync.RWMutexinLocalStorage. While valid, the data showed it is not required — the DELETE fix alone keeps memory flat. Thesync.Mapreplacement may be considered as a separate defense-in-depth measure.In-Depth Breakdown
Issue: #4062
Date: March 2026
Component:
thv-proxyrunner(streamable-http transport)Problem
The proxyrunner container was being OOMKilled (128Mi limit) under sustained session churn. Even with clean session lifecycles (initialize → notify initialized → tools/list → echo x2 →
DELETE), memory grew monotonically and was never reclaimed, eventually crashing the pod.
Root Cause
The transparent proxy (
pkg/transport/proxy/transparent/transparent_proxy.go) creates a session in its ownsession.Managerwhenever it sees anMcp-Session-Idheader in a response. However, it had no code to delete sessions when a DELETE request was forwarded.Sessions were only cleaned up by the TTL cleanup routine, which runs every
TTL/2(1 hour with the default 2-hour TTL). During sustained load, tens of thousands of session objects accumulated in the transparent proxy's session manager, growing memory monotonically until the pod was OOMKilled.In pprof output this manifested as growing allocations from:
session.NewProxySession— the session objects held in the mapinternal/sync.newEntryNode—sync.Mapinternal hash table entry nodesinternal/sync.newIndirectNode—sync.Mapinternal hash table indirect nodesThese were initially suspected to be a
sync.Mapdata structure issue, but testing proved thesync.Mapgrowth was a symptom of sessions never being deleted — not a problem withsync.Mapitself. When sessions are deleted promptly, thesync.Maponly holds a small number of concurrent entries and its bucket structure stays bounded.Debugging with pprof
We used Go's built-in heap profiler to inspect live allocations on the running proxyrunner pod. The pprof endpoint was exposed at
:6060via a debug server added to the proxyrunner.Enabling pprof in the proxyrunner
Apply this diff to
cmd/thv-proxyrunner/main.go:Then port-forward to the proxyrunner pod to access it locally:
Commands
Saving profiles and generating graphs
The
-http=:8081flag launches an interactive web UI athttp://localhost:8081with multiple views:-topCLI output)When comparing profiles with
-base, the graph highlights only the difference between the two snapshots — useful for isolating what grew between two points in time.Key pprof columns
flatflat%sum%cumcum%A note on GC and pprof snapshots
Some allocations (particularly
ReverseProxy.copyBuffer,bufio.NewReaderSize,bufio.NewWriterSize, andnet/textproto.readMIMEHeader) appeared and disappeared between pprof snapshots. These are transient in-flight request buffers that get cleaned up when GC runs. For example,ReverseProxy.copyBuffershowed 2.6 MB in one snapshot and dropped to 528 KB in the next — this is normal GC behavior reclaiming short-lived allocations between cycles.The critical distinction: these transient allocations are bounded by the number of concurrent in-flight requests and fluctuate around a stable level. The session-related allocations (
NewProxySession,newEntryNode,newIndirectNode) were monotonically increasing — they never dropped between snapshots regardless of GC activity, because sessions were never deleted from the transparent proxy's session manager.Test methodology
All measurements were taken using the
memory_test.goe2e test (test/e2e/thv-operator/virtualmcp/memory_test.go) running clean session lifecycles: initialize → DELETE. The test used 50 concurrent workers with HTTP keep-alives enabled, targeting a proxyrunner pod with a 128Mi memory limit.Testing Code
The test was run against two configurations to isolate the impact of the fix:
Before fix
Heap growth
Session-related allocations are the combined flat allocations from:
session.NewProxySession— session objects held in the transparent proxy's session managerinternal/sync.newEntryNode—sync.Mapinternal hash table entry nodesinternal/sync.newIndirectNode—sync.Mapinternal hash table indirect nodespprof output at ~5k sessions
pprof output at ~25k sessions
NewProxySessiongrew from 1.5 MB to 4 MB between snapshots — these are session objects accumulating in the transparent proxy's session manager because DELETE never removes them.(Notice
session.NewProxySession,sync.newIndirectNodeandsync.newIndirectNode)After fix
Heap growth
Comparison
pprof output at ~5k sessions
No
NewProxySessionornewEntryNode— only a single 512 KBnewIndirectNodethat did not grow further.pprof output at ~25k sessions
Zero session-related allocations in the top 20 at 25k sessions. All entries are init-time or transient in-flight request buffers.
Fix
File:
pkg/transport/proxy/transparent/transparent_proxy.goAdded session cleanup in the
RoundTripmethod, before the existing response processing:When a DELETE request with an
Mcp-Session-Idheader receives a 2xx response from the backend, the transparent proxy now immediately removes its session — instead of waiting up to 2 hours for TTL cleanup.Remaining Considerations
sync.Mapvs map + RWMutex: During investigation we also tested replacingsync.Mapwithmap + sync.RWMutexinLocalStorage. While this is a valid improvement (regular maps release memory more eagerly on delete), the data showed it is not required to fix the OOM — the transparent proxy DELETE fix alone keeps memory flat. Thesync.Mapreplacement may be considered as a defense-in-depth measure.OOMKillederrors very early on in the testing (~2000 sessions).Generated with Claude Code