Skip to content

fix(appsec): plug GLS + dataBroadcaster listener leak in ServiceEntrySpanOperation#4766

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 6 commits into
DataDog:mainfrom
aqaurius6666:main
May 19, 2026
Merged

fix(appsec): plug GLS + dataBroadcaster listener leak in ServiceEntrySpanOperation#4766
gh-worker-dd-mergequeue-cf854d[bot] merged 6 commits into
DataDog:mainfrom
aqaurius6666:main

Conversation

@aqaurius6666

@aqaurius6666 aqaurius6666 commented May 15, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Fixes a memory leak in AppSec when running under orchestrion, plus the related listener-cleanup asymmetry that compounds it. Two small, independent changes in two files:

  1. ServiceEntrySpanOperation.Finish() now calls dyngo.FinishOperation.
    The op was pushed onto orchestrion's per-goroutine GLS stack by StartAndRegisterOperation, but its Finish() only flushed JSON tags and never invoked dyngo.FinishOperation — and FinishOperation is the only caller of orchestrion.GLSPopValue. On reused goroutines (HTTP keep-alive workers, gRPC handlers, worker pools), the parent op accumulated on the GLS stack one entry per request, retaining the span and the data-listener closures registered by AppsecSpanTransport.OnServiceEntryStart.

    The new call is defer-ed, not appended, so the GLS pop and disable() still run if a TagSetter.SetTag implementation panics during the tag-flush loop — otherwise the panic path would silently re-introduce the same leak.

  2. operation.disable() now also clears dataBroadcaster.listeners.
    disable() cleared eventRegister.listeners but left its sibling dataBroadcaster.listeners populated — an asymmetry between two structurally identical listener maps, both populated by the same addXListener family. On ops that are properly finished, this only delays reclamation; combined with the GLS retention above, it amplifies the leak by keeping data-listener closures alive for every retained parent op.

Motivation

Fixes #4763.

That issue covers fix (1) — the ServiceEntrySpanOperation GLS leak — and includes the same MockGLS()-based reproduction approach used in this PR's verification.

This PR adds fix (2) on top: the dataBroadcaster cleanup asymmetry was found while tracing the same retention path and is responsible for amplifying the heap impact of (1). The two are independent — (1) removes the leak on the orchestrion path; (2) removes the asymmetry that would amplify any future regression of the same shape.

Root cause walkthrough

Start path — both ops are registered:

waf.StartContextOperation
 ├── trace.StartServiceEntrySpanOperation
 │     └── dyngo.StartAndRegisterOperation(entrySpanOp)   → GLS push #1 (parent)
 └── dyngo.StartAndRegisterOperation(contextOp)            → GLS push #2 (child)

Finish path (before this PR) — only the child is unwound:

// internal/appsec/emitter/waf/context.go
func (op *ContextOperation) Finish() {
    dyngo.FinishOperation(op, ContextRes{})   // pops child from GLS
    op.ServiceEntrySpanOperation.Finish()      // flushes tags only — no FinishOperation
}
// instrumentation/appsec/trace/service_entry_span.go (before)
func (op *ServiceEntrySpanOperation) Finish() {
    // ranges over op.jsonTags and calls span.SetTag — does NOT call dyngo.FinishOperation
}

Per request on the same goroutine, the GLS stack grows by one. waf.ContextOperation.Finish() is unchanged by this PR — its existing call sequence (pop child, then call parent's Finish) now correctly drives both lifecycles.

Reproduction

Two reproduction tests are not bundled in the PR (kept the diff minimal at 2 files / +16 / -1), but both pass at HEAD and fail when the corresponding fix is reverted.

Test 1 — GLS stack returns to baseline.

A note for reviewers running this locally: orchestrion.Enabled() is flipped by the orchestrion source-transform tool at build time, not by -tags orchestrion. Under plain go test, CtxWithValue therefore skips the GLS push branch and GLSStackDepth() always returns 0 — a test relying on the build tag alone would pass vacuously. The repo's orchestrion.MockGLS() (internal/orchestrion/mock_gls.go) installs a real contextStack and flips enabled=true for the duration of the test, which is what #4763's reproduction also uses.

package waf_test

import (
    "context"
    "testing"

    "github.com/DataDog/dd-trace-go/v2/instrumentation/appsec/trace"
    "github.com/DataDog/dd-trace-go/v2/internal/appsec/emitter/waf"
    "github.com/DataDog/dd-trace-go/v2/internal/orchestrion"
)

func TestGLSStackReturnsToBaseline(t *testing.T) {
    defer orchestrion.MockGLS()()

    ctx := context.Background()
    baseline := orchestrion.GLSStackDepth()

    const iterations = 100
    for i := 0; i < iterations; i++ {
        op, _ := waf.StartContextOperation(ctx, trace.NoopTagSetter{})
        op.Finish()
    }

    if got := orchestrion.GLSStackDepth(); got != baseline {
        t.Errorf("GLS leaked: started at %d, now %d after %d cycles", baseline, got, iterations)
    }
}

Observed when commit 1 of this PR is reverted: GLS leaked: started at 0, now 100 after 100 cycles — exactly +1 entry per request, matching the predicted leak rate. Stable at baseline with the fix applied.

Test 2 — disable() must clear dataBroadcaster.

package dyngo

import "testing"

type repEvt struct{ val string }

func TestDataBroadcasterNotClearedOnDisable(t *testing.T) {
    op := &operation{}
    addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {}))
    addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {}))
    addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {}))

    op.disable()

    op.dataBroadcaster.mu.RLock()
    defer op.dataBroadcaster.mu.RUnlock()
    total := 0
    for _, ls := range op.dataBroadcaster.listeners {
        total += len(ls)
    }
    if total != 0 {
        t.Errorf("dataBroadcaster has %d listeners after disable(), want 0", total)
    }
}

Observed when commit 2 is reverted: dataBroadcaster has 3 listeners after disable(), want 0. Passes at HEAD.

Happy to include either or both tests in the PR if reviewers want — they were left out only to keep the diff narrowly scoped to the fix.

Memory evidence

On a long-lived process with AppSec + orchestrion enabled and steady HTTP traffic:

  • AppSec on, before fix: monotonic heap growth that scales with request count.
  • AppSec off: flat heap.
  • AppSec on, with both fixes: flat heap.

Heap-live-size profile retention path matches the GLS analysis:

dyngo.addDataListener
  ← dyngo.OnData
    ← AppsecSpanTransport.OnServiceEntryStart
      ← dyngo.StartAndRegisterOperation               (called once per HTTP request)

The path terminates in StartAndRegisterOperation with no matching FinishOperation — consistent with the GLS-pin hypothesis.

Files touched

  • instrumentation/appsec/trace/service_entry_span.go — +9 / -1
  • instrumentation/appsec/dyngo/operation.go — +7 / -0

Total: 2 files, +16 / -1.

Reviewer's Checklist

  • Changed code has unit tests for its functionality at or near 100% coverage.
  • System-Tests covering this feature have been added and enabled with the va.b.c-dev version tag.
  • There is a benchmark for any new code, or changes to existing code.
  • If this interacts with the agent in a new way, a system test has been added.
  • New code is free of linting errors. You can check this by running make lint locally.
  • New code doesn't break existing tests. You can check this by running make test locally.
  • Add an appropriate team label so this PR gets put in the right place for the release notes.
  • All generated files are up to date. You can check this by running make generate locally.
  • Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running make fix-modules locally.

aqaurius6666 and others added 2 commits May 14, 2026 09:30
operation.disable() cleared eventRegister.listeners but never its sibling
dataBroadcaster.listeners — an asymmetry that left data listener closures
(and any per-request state they captured) reachable until whole-op GC. On
ops that are properly finished this just delays reclamation, but combined
with the GLS retention on ServiceEntrySpanOperation it amplifies the
appsec+orchestrion heap leak.

Add the missing dataBroadcaster.clear() sibling method and call it from
disable() so both listener maps are released on the same path.
ServiceEntrySpanOperation is registered into orchestrion's GLS by
StartAndRegisterOperation but its Finish() only flushed JSON tags — it
never called dyngo.FinishOperation. On reused goroutines (HTTP worker
pools built with the orchestrion tag), the GLS stack permanently
retained one *ServiceEntrySpanOperation per request along with the 3
data listeners registered by AppsecSpanTransport.OnServiceEntryStart,
producing monotonic heap growth.

Encapsulate the dyngo lifecycle inside ServiceEntrySpanOperation.Finish()
via a deferred dyngo.FinishOperation call. The defer also guards against
a misbehaving TagSetter.SetTag panicking during the tag-flush loop:
without it, the GLS pop and disable() would be skipped on the panic path,
re-introducing the leak.

Add ServiceEntrySpanRes result type to satisfy dyngo's type constraint.
No changes required in waf.ContextOperation.Finish — its existing call
sequence is preserved and now correctly drives the parent's lifecycle.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@aqaurius6666 aqaurius6666 requested a review from a team as a code owner May 15, 2026 02:36
@aqaurius6666 aqaurius6666 changed the title tmp fix(appsec): plug GLS + dataBroadcaster listener leak in ServiceEntrySpanOperation May 15, 2026
@darccio darccio added the AI Assisted AI/LLM assistance used in this PR (partially or fully) label May 15, 2026
@e-n-0 e-n-0 requested a review from RomainMuller May 18, 2026 15:38
@codecov

codecov Bot commented May 18, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 85.71429% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 62.52%. Comparing base (ca307a0) to head (70a7d49).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
instrumentation/appsec/trace/service_entry_span.go 50.00% 1 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
instrumentation/appsec/dyngo/operation.go 83.46% <100.00%> (-5.88%) ⬇️
instrumentation/appsec/trace/service_entry_span.go 18.75% <50.00%> (+18.75%) ⬆️

... and 282 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented May 19, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

Code Freeze Check | code_freeze_check   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Missing required workflow permissions. Ensure 'id-token: write' is set.

Static Checks | lint   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Missing required workflow permissions. Ensure 'id-token: write' is set.

System Tests | System Tests   View in Datadog   GitHub Actions

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 83.33%
Overall Coverage: 61.86% (-0.01%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 70a7d49 | Docs | Datadog PR Page | Give us feedback!

@darccio darccio left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit fff4209 into DataDog:main May 19, 2026
289 of 297 checks passed
hannahkm pushed a commit that referenced this pull request May 22, 2026
…SpanOperation (#4766)

### What does this PR do?

Fixes a memory leak in AppSec when running under orchestrion, plus the related listener-cleanup asymmetry that compounds it. Two small, independent changes in two files:

1. **`ServiceEntrySpanOperation.Finish()` now calls `dyngo.FinishOperation`.**
   The op was pushed onto orchestrion's per-goroutine GLS stack by `StartAndRegisterOperation`, but its `Finish()` only flushed JSON tags and never invoked `dyngo.FinishOperation` — and `FinishOperation` is the **only** caller of `orchestrion.GLSPopValue`. On reused goroutines (HTTP keep-alive workers, gRPC handlers, worker pools), the parent op accumulated on the GLS stack one entry per request, retaining the span and the data-listener closures registered by `AppsecSpanTransport.OnServiceEntryStart`.

   The new call is **`defer`-ed**, not appended, so the GLS pop and `disable()` still run if a `TagSetter.SetTag` implementation panics during the tag-flush loop — otherwise the panic path would silently re-introduce the same leak.

2. **`operation.disable()` now also clears `dataBroadcaster.listeners`.**
   `disable()` cleared `eventRegister.listeners` but left its sibling `dataBroadcaster.listeners` populated — an asymmetry between two structurally identical listener maps, both populated by the same `addXListener` family. On ops that are properly finished, this only delays reclamation; combined with the GLS retention above, it amplifies the leak by keeping data-listener closures alive for every retained parent op.

### Motivation

Fixes #4763.

That issue covers fix (1) — the `ServiceEntrySpanOperation` GLS leak — and includes the same `MockGLS()`-based reproduction approach used in this PR's verification.

This PR adds fix (2) on top: the `dataBroadcaster` cleanup asymmetry was found while tracing the same retention path and is responsible for amplifying the heap impact of (1). The two are independent — (1) removes the leak on the orchestrion path; (2) removes the asymmetry that would amplify any future regression of the same shape.

### Root cause walkthrough

**Start path — both ops are registered:**

```
waf.StartContextOperation
 ├── trace.StartServiceEntrySpanOperation
 │     └── dyngo.StartAndRegisterOperation(entrySpanOp)   → GLS push #1 (parent)
 └── dyngo.StartAndRegisterOperation(contextOp)            → GLS push #2 (child)
```

**Finish path (before this PR) — only the child is unwound:**

```go
// internal/appsec/emitter/waf/context.go
func (op *ContextOperation) Finish() {
    dyngo.FinishOperation(op, ContextRes{})   // pops child from GLS
    op.ServiceEntrySpanOperation.Finish()      // flushes tags only — no FinishOperation
}
```

```go
// instrumentation/appsec/trace/service_entry_span.go (before)
func (op *ServiceEntrySpanOperation) Finish() {
    // ranges over op.jsonTags and calls span.SetTag — does NOT call dyngo.FinishOperation
}
```

Per request on the same goroutine, the GLS stack grows by one. `waf.ContextOperation.Finish()` is **unchanged** by this PR — its existing call sequence (pop child, then call parent's `Finish`) now correctly drives both lifecycles.

### Reproduction

Two reproduction tests are not bundled in the PR (kept the diff minimal at 2 files / +16 / -1), but both pass at HEAD and fail when the corresponding fix is reverted.

**Test 1 — GLS stack returns to baseline.**

A note for reviewers running this locally: `orchestrion.Enabled()` is flipped by the orchestrion source-transform tool at build time, not by `-tags orchestrion`. Under plain `go test`, `CtxWithValue` therefore skips the GLS push branch and `GLSStackDepth()` always returns 0 — a test relying on the build tag alone would pass vacuously. The repo's `orchestrion.MockGLS()` (`internal/orchestrion/mock_gls.go`) installs a real `contextStack` and flips `enabled=true` for the duration of the test, which is what #4763's reproduction also uses.

```go
package waf_test

import (
    "context"
    "testing"

    "github.com/DataDog/dd-trace-go/v2/instrumentation/appsec/trace"
    "github.com/DataDog/dd-trace-go/v2/internal/appsec/emitter/waf"
    "github.com/DataDog/dd-trace-go/v2/internal/orchestrion"
)

func TestGLSStackReturnsToBaseline(t *testing.T) {
    defer orchestrion.MockGLS()()

    ctx := context.Background()
    baseline := orchestrion.GLSStackDepth()

    const iterations = 100
    for i := 0; i < iterations; i++ {
        op, _ := waf.StartContextOperation(ctx, trace.NoopTagSetter{})
        op.Finish()
    }

    if got := orchestrion.GLSStackDepth(); got != baseline {
        t.Errorf("GLS leaked: started at %d, now %d after %d cycles", baseline, got, iterations)
    }
}
```

Observed when commit 1 of this PR is reverted: `GLS leaked: started at 0, now 100 after 100 cycles` — exactly +1 entry per request, matching the predicted leak rate. Stable at baseline with the fix applied.

**Test 2 — `disable()` must clear `dataBroadcaster`.**

```go
package dyngo

import "testing"

type repEvt struct{ val string }

func TestDataBroadcasterNotClearedOnDisable(t *testing.T) {
    op := &operation{}
    addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {}))
    addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {}))
    addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {}))

    op.disable()

    op.dataBroadcaster.mu.RLock()
    defer op.dataBroadcaster.mu.RUnlock()
    total := 0
    for _, ls := range op.dataBroadcaster.listeners {
        total += len(ls)
    }
    if total != 0 {
        t.Errorf("dataBroadcaster has %d listeners after disable(), want 0", total)
    }
}
```

Observed when commit 2 is reverted: `dataBroadcaster has 3 listeners after disable(), want 0`. Passes at HEAD.

Happy to include either or both tests in the PR if reviewers want — they were left out only to keep the diff narrowly scoped to the fix.

### Memory evidence

On a long-lived process with AppSec + orchestrion enabled and steady HTTP traffic:

- AppSec on, before fix: monotonic heap growth that scales with request count.
- AppSec off: flat heap.
- AppSec on, with both fixes: flat heap.

Heap-live-size profile retention path matches the GLS analysis:

```
dyngo.addDataListener
  ← dyngo.OnData
    ← AppsecSpanTransport.OnServiceEntryStart
      ← dyngo.StartAndRegisterOperation               (called once per HTTP request)
```

The path terminates in `StartAndRegisterOperation` with no matching `FinishOperation` — consistent with the GLS-pin hypothesis.

### Files touched

- `instrumentation/appsec/trace/service_entry_span.go` — +9 / -1
- `instrumentation/appsec/dyngo/operation.go` — +7 / -0

Total: 2 files, +16 / -1.

### Reviewer's Checklist

- [ ] Changed code has unit tests for its functionality at or near 100% coverage.
- [ ] [System-Tests](https://github.com/DataDog/system-tests/) covering this feature have been added and enabled with the va.b.c-dev version tag.
- [ ] There is a benchmark for any new code, or changes to existing code.
- [ ] If this interacts with the agent in a new way, a system test has been added.
- [x] New code is free of linting errors. You can check this by running `make lint` locally.
- [x] New code doesn't break existing tests. You can check this by running `make test` locally.
- [ ] Add an appropriate team label so this PR gets put in the right place for the release notes.
- [x] All generated files are up to date. You can check this by running `make generate` locally.
- [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running `make fix-modules` locally.


Co-authored-by: e-n-0 <flavien.darche@datadoghq.com>
Co-authored-by: dario.castane <dario.castane@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Assisted AI/LLM assistance used in this PR (partially or fully) mergequeue-status: done

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: Memory leak; ServiceEntrySpanOperation.Finish() never calls dyngo.FinishOperation, GLS entry per request

5 participants