Skip to content

Reduce per-connection memory by ~40%#1

Merged
dolonet merged 3 commits intomasterfrom
optimize-per-connection-memory
Mar 27, 2026
Merged

Reduce per-connection memory by ~40%#1
dolonet merged 3 commits intomasterfrom
optimize-per-connection-memory

Conversation

@dolonet
Copy link
Copy Markdown
Owner

@dolonet dolonet commented Mar 27, 2026

Summary

  • Pool relay buffers via sync.Pool and reduce from 16 KB to 4 KB — stack-allocated arrays forced goroutine stacks to 32 KB (never shrink); pooled heap buffers keep stacks at 2–4 KB
  • Pool doppel record buffer and inline Clock timer into start() — eliminates one goroutine + one channel per connection
  • Replace ctx.Done() goroutines with context.AfterFunc in relay and proxy — saves ~4 KB per connection (two fewer goroutines)

Per-connection impact

Metric Before After
Goroutines 4–5 2–3
Goroutine stacks (total) ~100 KB ~12 KB
Estimated user-space memory ~120 KB ~50 KB

Production measurement (Amsterdam)

Old binary: 27 228 KB RSS @ ~45 connections → ~160 KB/conn
New binary: 25 664 KB RSS @ 61 connections → ~93 KB/conn

Observed reduction: ~42% per connection.

Safety

  • 4 KB relay buffer is safe: TLS layer reassembles records internally, no additional syscalls
  • context.AfterFunc is semantically equivalent to the replaced goroutines
  • Clock inlining preserves backpressure: timer only resets after processing completes
  • All existing tests pass, including -race

Test plan

  • go vet ./...
  • go test ./... — all packages pass
  • go test -race ./mtglib/internal/relay/ ./mtglib/internal/doppel/ ./mtglib/ — no races
  • Cross-compile GOOS=linux GOARCH=amd64
  • Deploy to production on test port — API responds, accepts connections
  • Replace production binary — 60+ connections, 24 MB traffic, no errors after 5+ minutes

dolonet added 3 commits March 27, 2026 12:37
Replace stack-allocated 16 KB buffers in pump() with pooled 4 KB
slices. Stack-allocated arrays force goroutine stacks to grow to
32 KB and never shrink. Pooled heap buffers keep stacks at 2-4 KB.

4 KB is safe because the TLS layer handles record reassembly
internally — smaller relay chunks do not increase syscalls.

Replace the ctx.Done() cleanup goroutine with context.AfterFunc,
which avoids a dedicated goroutine during the relay lifetime.
Replace stack-allocated 16 KB buffer in start() with a pooled
slice, reducing the goroutine stack from 32 KB to 2-4 KB.

Merge Clock's timer goroutine directly into the start() loop,
eliminating one goroutine and one channel per connection. The
semantics are preserved: timer fires, data is processed, timer
resets. Backpressure works identically since the timer is not
reset until the current iteration completes.

Remove clock.go and clock_test.go — Clock behavior is covered
by the existing conn_test.go integration tests.
Replace the ctx.Done() goroutine in ServeConn with
context.AfterFunc. This eliminates a goroutine that was alive
for the entire connection duration, saving ~2 KB of stack per
connection. The AfterFunc callback only spawns a goroutine when
cancellation actually occurs.
@dolonet dolonet force-pushed the optimize-per-connection-memory branch from 0aebb86 to 718dec0 Compare March 27, 2026 09:38
@dolonet dolonet merged commit 85bbe17 into master Mar 27, 2026
10 checks passed
@dolonet dolonet deleted the optimize-per-connection-memory branch April 1, 2026 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant