[core] Implementation of the tracing package#2
Conversation
…; providing a DefaultTracer client, while missing the HTTP implementation
| if !s.IsFinished() { | ||
| s.Duration = Now() - s.Start | ||
|
|
||
| go func() { |
There was a problem hiding this comment.
two issues with this:
- spawning a goroutine per finished span tons of overhead.
- as you commented, this could block forever, which is completely unacceptable for client side code. a channel is not the right tool for this because if a reader crashes a customer app will lock up forever.
i'd much prefer if we took this bit of code from dd-go/dogtrace. this is already running on high traffic production systems, so let's just use it instead of inventing something new. (except for the onFinish stuff, that's completely unnecessary)
There was a problem hiding this comment.
OK it was a proposal. I can update this code with the delivery process that we're using in the dogtrace
There was a problem hiding this comment.
Reading some stuff around, it seems I can easily solve the point 2 (blocked go-routine) with a timeout so that if the reader crashes (the reader is the one that sends all spans to the locale/remote agent) the span is lost and the go-routine is killed. I think it makes sense because if the go-routine that sends data is dead, it's OK to lose the span because no one is capable to send it somewhere.
On the other hand, if we're using the current approach in dogtrace (without a go-routine), our code will block as you describe because it should obtain a lock and it will surely add some delay (maybe it should wait for another Span that has the lock?). Furthermore if the sender dies, the shared data structures (the children []*Span in the Span or a common global outgoingSpans []*Span) will grow without control causing a memory leak.
About the go-routine overhead, I think I can profile it because it's a built-in functionality and reading to golang.org blog posts, they were created for similar cases.
By the way I have a free weekend so I want to play a little bit with this stuff 😄
If you agree, I prefer to revert this part at the end so that I can spend some time with benchmarks and experiments.
|
i think we should only create spans from a tracer. and never from other spans. we've had APIs like this before and it's more confusing (when do i call i think something like this is simpler: tracer := datadog.NewTracer()
// create a new root span with some info
span := tracer.NewSpan("whatever")
defer span.Finish()
span.Service = "foo"
span.Resource = "whatever"
// create a span from another
child := tracer.NewChildSpan("another", span)
defer child.Finish()
// or from a special `datadog_trace` variable in context
child2 := tracer.NewSpanFromContext("yahoo", context)
child2.Finish() |
|
Yeah I agree. Having an API where developers must always use the |
| import ( | ||
| "sync" | ||
|
|
||
| log "github.com/cihub/seelog" |
There was a problem hiding this comment.
just a remark, but we definitely don't want to pull this dep in the final package. Also, seelog is madness, it often shows up in a top 30 cpu profile in our apps 😱
There was a problem hiding this comment.
ah OK.... I've seen that in a lot of our projects. There is something else I can use?
There was a problem hiding this comment.
I wonder if we should use a logging lib at all, or just use the primitive "log" package in Go. Do we need the user to pull another dependency for logging if he already has one?
Also we should try to keep our logging at a minimum (except if the user toggles debug mode).
There was a problem hiding this comment.
Yes, totally agree about limit our logging to warnings/errors. About the logging lib I don't know... I mean if the primitive log package provides all that we need, I think we can replace everything with that.
ab40cc4 to
c975ff8
Compare
| "math/rand" | ||
| "time" | ||
|
|
||
| log "github.com/cihub/seelog" |
There was a problem hiding this comment.
@LeoCavaille what do you think about that library (https://github.com/op/go-logging)? Have you ever used it?
| Start int64 `json:"start"` // span start time expressed in nanoseconds since epoch | ||
| Duration int64 `json:"duration"` // duration of the span expressed in nanoseconds | ||
| Error int32 `json:"error"` // error status of the span; 0 means no errors | ||
| Meta map[string]string `json:"meta"` // arbitrary map of metadata |
There was a problem hiding this comment.
@LeoCavaille may I add the omitempty for Meta and Metrics? the agent is expecting the meta field in any case even if it's nil?
There was a problem hiding this comment.
if it's not in the payload, it will just set it to nil when decoding, so yea.
|
Side note: as a last step I'll implement a kind of |
| // it's better to be defensive and to produce a wrongly configured span | ||
| // that is not sent to the trace agent. | ||
| if parent == nil { | ||
| return newSpan(name, "", "", spanID, spanID, spanID, nil) |
There was a problem hiding this comment.
Not so happy about using nil in the tracer field... we could forget that when the client handles more things. At the same time I don't know if makes sense allowing users to have more than one tracer instances.
- Pros: I can have multiple
tracers for different sections so that I can enable/disable them individually - Cons: have to send
*tracer"around the code"
There was a problem hiding this comment.
Tracer could be an interfacer. CanonicalTracer could be the implementation you have now that is used by default. DevNullTracer could be an implementation that simply discards any spans given to it, and you could pass that in for the nil case. I think this should solve your problem, but at the expense of a layer of indirection in the code
There was a problem hiding this comment.
Yeah I was thinking the same.... I'll leave that as is for now but I want to implement your suggestion. I'm feeling safe if nil pointers don't travel around the code 😄
talwai
left a comment
There was a problem hiding this comment.
I have reviewed this PR and contributed my humble feedback
| // the child is finished but it's not recorded in | ||
| // the tracer buffer | ||
| assert.True(child.Duration > 0) | ||
| assert.Equal(len(tracer.finishedSpans), 0) |
| def go_benchmark(path) | ||
| sh "go test -run=NONE -bench=. -memprofile=mem.out #{path}" | ||
| sh "go test -run=NONE -bench=. -cpuprofile=cpu.out #{path}" | ||
| sh "go test -run=NONE -bench=. -blockprofile=block.out #{path}" |
There was a problem hiding this comment.
or run them all together ?
go test ... -memprofile=mem.out -cpuprofile=cpu.out -blockprofile=block.out ...
There was a problem hiding this comment.
Ah, I don't have a lot of experience on that. I split them because I've found in The Go programming language book such sentence:
Gathering a profile for code under test is as easy as enabling one fo the flags below [-cpuprofile, etc...]. Be careful when using more than one flag at a time, however: the machinery for gathering one kind of profile may skew the results of others.
Do you have more info on that? It could be possible that the info is outdated or that this kind of benchmark may be executed just once.
| // initialize the Tracer | ||
| t := &Tracer{ | ||
| transport: NewHTTPTransport(defaultDeliveryURL), | ||
| flushTicker: time.NewTicker(flushInterval), |
There was a problem hiding this comment.
you just use the ticker in one method, not sure it's necessary to make it an attribute of the Tracer? Just instantiate and use it in that method?
for range time.Tick()
There was a problem hiding this comment.
Yeah, totally correct and thanks for catching that. Totally forget to remove that because it was related to the previous approach (multiple channels and go routines that can be stopped). Furthermore our goroutine never stops so there isn't any kind of "leak" for loosing the reference.
c0b1074 to
78989e1
Compare
| @@ -0,0 +1,121 @@ | |||
| package tracer | |||
|
|
||
| // Tracer is the common struct we use to collect, buffer | ||
| type Tracer struct { | ||
| enabled int32 // acts as bool to define if the Tracer is enabled or not |
| import: | ||
| - package: github.com/stretchr/testify | ||
| version: ^1.1.3 | ||
| - package: github.com/cihub/seelog |
| // Mock Transport | ||
| type DummyTransport struct{} | ||
|
|
||
| func (t *DummyTransport) Send(spans []*Span) error { return nil } |
There was a problem hiding this comment.
the dummy transport should encode the spans to make sure the test data in sane
| parent := tracer.NewSpan("pylons.request", "pylons", "/") | ||
| child := tracer.NewChildSpan("redis.command", parent) | ||
| assert.Equal(child.ParentID, parent.SpanID) | ||
| assert.Equal(child.TraceID, parent.TraceID) |
There was a problem hiding this comment.
please test the right thigs are inherited
| } | ||
|
|
||
| // child that is correctly configured | ||
| return newSpan(name, parent.Service, parent.Resource, spanID, parent.TraceID, parent.SpanID, parent.tracer) |
There was a problem hiding this comment.
it shouldn't be inheriting the resource. only the service & ids (see python client)
|
|
||
| type datadogContextKey struct{} | ||
|
|
||
| var datadogActiveSpanKey = datadogContextKey{} |
There was a problem hiding this comment.
using an empty struct as the key is kinda crazy. is this idiomatic? why not "datadog_trace_span"
There was a problem hiding this comment.
Found that in the opentracing-go client. I'm not an expert here but yes.... think a string is enough?
| // Encode returns a byte array related to the marshalling | ||
| // of a list of spans. | ||
| func (e *JSONEncoder) Encode(spans []*Span) ([]byte, error) { | ||
| return json.Marshal(spans) |
There was a problem hiding this comment.
this is really slow let's use the encoder pool from dogtrace. been down this road :)
| return err | ||
| } | ||
|
|
||
| defer response.Body.Close() |
There was a problem hiding this comment.
no need to defer at the end of a function. needless perf cost. you're already at the end
This PR updates the behavior of `WithMaxQuerySize` when `max=0` to avoid attaching the query tag entirely. This is more intuitive ("max query size of zero") and gives folks a way to disable serializing the command entirely.
…SpanOperation (#4766) ### What does this PR do? Fixes a memory leak in AppSec when running under orchestrion, plus the related listener-cleanup asymmetry that compounds it. Two small, independent changes in two files: 1. **`ServiceEntrySpanOperation.Finish()` now calls `dyngo.FinishOperation`.** The op was pushed onto orchestrion's per-goroutine GLS stack by `StartAndRegisterOperation`, but its `Finish()` only flushed JSON tags and never invoked `dyngo.FinishOperation` — and `FinishOperation` is the **only** caller of `orchestrion.GLSPopValue`. On reused goroutines (HTTP keep-alive workers, gRPC handlers, worker pools), the parent op accumulated on the GLS stack one entry per request, retaining the span and the data-listener closures registered by `AppsecSpanTransport.OnServiceEntryStart`. The new call is **`defer`-ed**, not appended, so the GLS pop and `disable()` still run if a `TagSetter.SetTag` implementation panics during the tag-flush loop — otherwise the panic path would silently re-introduce the same leak. 2. **`operation.disable()` now also clears `dataBroadcaster.listeners`.** `disable()` cleared `eventRegister.listeners` but left its sibling `dataBroadcaster.listeners` populated — an asymmetry between two structurally identical listener maps, both populated by the same `addXListener` family. On ops that are properly finished, this only delays reclamation; combined with the GLS retention above, it amplifies the leak by keeping data-listener closures alive for every retained parent op. ### Motivation Fixes #4763. That issue covers fix (1) — the `ServiceEntrySpanOperation` GLS leak — and includes the same `MockGLS()`-based reproduction approach used in this PR's verification. This PR adds fix (2) on top: the `dataBroadcaster` cleanup asymmetry was found while tracing the same retention path and is responsible for amplifying the heap impact of (1). The two are independent — (1) removes the leak on the orchestrion path; (2) removes the asymmetry that would amplify any future regression of the same shape. ### Root cause walkthrough **Start path — both ops are registered:** ``` waf.StartContextOperation ├── trace.StartServiceEntrySpanOperation │ └── dyngo.StartAndRegisterOperation(entrySpanOp) → GLS push #1 (parent) └── dyngo.StartAndRegisterOperation(contextOp) → GLS push #2 (child) ``` **Finish path (before this PR) — only the child is unwound:** ```go // internal/appsec/emitter/waf/context.go func (op *ContextOperation) Finish() { dyngo.FinishOperation(op, ContextRes{}) // pops child from GLS op.ServiceEntrySpanOperation.Finish() // flushes tags only — no FinishOperation } ``` ```go // instrumentation/appsec/trace/service_entry_span.go (before) func (op *ServiceEntrySpanOperation) Finish() { // ranges over op.jsonTags and calls span.SetTag — does NOT call dyngo.FinishOperation } ``` Per request on the same goroutine, the GLS stack grows by one. `waf.ContextOperation.Finish()` is **unchanged** by this PR — its existing call sequence (pop child, then call parent's `Finish`) now correctly drives both lifecycles. ### Reproduction Two reproduction tests are not bundled in the PR (kept the diff minimal at 2 files / +16 / -1), but both pass at HEAD and fail when the corresponding fix is reverted. **Test 1 — GLS stack returns to baseline.** A note for reviewers running this locally: `orchestrion.Enabled()` is flipped by the orchestrion source-transform tool at build time, not by `-tags orchestrion`. Under plain `go test`, `CtxWithValue` therefore skips the GLS push branch and `GLSStackDepth()` always returns 0 — a test relying on the build tag alone would pass vacuously. The repo's `orchestrion.MockGLS()` (`internal/orchestrion/mock_gls.go`) installs a real `contextStack` and flips `enabled=true` for the duration of the test, which is what #4763's reproduction also uses. ```go package waf_test import ( "context" "testing" "github.com/DataDog/dd-trace-go/v2/instrumentation/appsec/trace" "github.com/DataDog/dd-trace-go/v2/internal/appsec/emitter/waf" "github.com/DataDog/dd-trace-go/v2/internal/orchestrion" ) func TestGLSStackReturnsToBaseline(t *testing.T) { defer orchestrion.MockGLS()() ctx := context.Background() baseline := orchestrion.GLSStackDepth() const iterations = 100 for i := 0; i < iterations; i++ { op, _ := waf.StartContextOperation(ctx, trace.NoopTagSetter{}) op.Finish() } if got := orchestrion.GLSStackDepth(); got != baseline { t.Errorf("GLS leaked: started at %d, now %d after %d cycles", baseline, got, iterations) } } ``` Observed when commit 1 of this PR is reverted: `GLS leaked: started at 0, now 100 after 100 cycles` — exactly +1 entry per request, matching the predicted leak rate. Stable at baseline with the fix applied. **Test 2 — `disable()` must clear `dataBroadcaster`.** ```go package dyngo import "testing" type repEvt struct{ val string } func TestDataBroadcasterNotClearedOnDisable(t *testing.T) { op := &operation{} addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {})) addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {})) addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {})) op.disable() op.dataBroadcaster.mu.RLock() defer op.dataBroadcaster.mu.RUnlock() total := 0 for _, ls := range op.dataBroadcaster.listeners { total += len(ls) } if total != 0 { t.Errorf("dataBroadcaster has %d listeners after disable(), want 0", total) } } ``` Observed when commit 2 is reverted: `dataBroadcaster has 3 listeners after disable(), want 0`. Passes at HEAD. Happy to include either or both tests in the PR if reviewers want — they were left out only to keep the diff narrowly scoped to the fix. ### Memory evidence On a long-lived process with AppSec + orchestrion enabled and steady HTTP traffic: - AppSec on, before fix: monotonic heap growth that scales with request count. - AppSec off: flat heap. - AppSec on, with both fixes: flat heap. Heap-live-size profile retention path matches the GLS analysis: ``` dyngo.addDataListener ← dyngo.OnData ← AppsecSpanTransport.OnServiceEntryStart ← dyngo.StartAndRegisterOperation (called once per HTTP request) ``` The path terminates in `StartAndRegisterOperation` with no matching `FinishOperation` — consistent with the GLS-pin hypothesis. ### Files touched - `instrumentation/appsec/trace/service_entry_span.go` — +9 / -1 - `instrumentation/appsec/dyngo/operation.go` — +7 / -0 Total: 2 files, +16 / -1. ### Reviewer's Checklist - [ ] Changed code has unit tests for its functionality at or near 100% coverage. - [ ] [System-Tests](https://github.com/DataDog/system-tests/) covering this feature have been added and enabled with the va.b.c-dev version tag. - [ ] There is a benchmark for any new code, or changes to existing code. - [ ] If this interacts with the agent in a new way, a system test has been added. - [x] New code is free of linting errors. You can check this by running `make lint` locally. - [x] New code doesn't break existing tests. You can check this by running `make test` locally. - [ ] Add an appropriate team label so this PR gets put in the right place for the release notes. - [x] All generated files are up to date. You can check this by running `make generate` locally. - [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running `make fix-modules` locally. Co-authored-by: e-n-0 <flavien.darche@datadoghq.com> Co-authored-by: dario.castane <dario.castane@datadoghq.com>
…SpanOperation (#4766) ### What does this PR do? Fixes a memory leak in AppSec when running under orchestrion, plus the related listener-cleanup asymmetry that compounds it. Two small, independent changes in two files: 1. **`ServiceEntrySpanOperation.Finish()` now calls `dyngo.FinishOperation`.** The op was pushed onto orchestrion's per-goroutine GLS stack by `StartAndRegisterOperation`, but its `Finish()` only flushed JSON tags and never invoked `dyngo.FinishOperation` — and `FinishOperation` is the **only** caller of `orchestrion.GLSPopValue`. On reused goroutines (HTTP keep-alive workers, gRPC handlers, worker pools), the parent op accumulated on the GLS stack one entry per request, retaining the span and the data-listener closures registered by `AppsecSpanTransport.OnServiceEntryStart`. The new call is **`defer`-ed**, not appended, so the GLS pop and `disable()` still run if a `TagSetter.SetTag` implementation panics during the tag-flush loop — otherwise the panic path would silently re-introduce the same leak. 2. **`operation.disable()` now also clears `dataBroadcaster.listeners`.** `disable()` cleared `eventRegister.listeners` but left its sibling `dataBroadcaster.listeners` populated — an asymmetry between two structurally identical listener maps, both populated by the same `addXListener` family. On ops that are properly finished, this only delays reclamation; combined with the GLS retention above, it amplifies the leak by keeping data-listener closures alive for every retained parent op. ### Motivation Fixes #4763. That issue covers fix (1) — the `ServiceEntrySpanOperation` GLS leak — and includes the same `MockGLS()`-based reproduction approach used in this PR's verification. This PR adds fix (2) on top: the `dataBroadcaster` cleanup asymmetry was found while tracing the same retention path and is responsible for amplifying the heap impact of (1). The two are independent — (1) removes the leak on the orchestrion path; (2) removes the asymmetry that would amplify any future regression of the same shape. ### Root cause walkthrough **Start path — both ops are registered:** ``` waf.StartContextOperation ├── trace.StartServiceEntrySpanOperation │ └── dyngo.StartAndRegisterOperation(entrySpanOp) → GLS push #1 (parent) └── dyngo.StartAndRegisterOperation(contextOp) → GLS push #2 (child) ``` **Finish path (before this PR) — only the child is unwound:** ```go // internal/appsec/emitter/waf/context.go func (op *ContextOperation) Finish() { dyngo.FinishOperation(op, ContextRes{}) // pops child from GLS op.ServiceEntrySpanOperation.Finish() // flushes tags only — no FinishOperation } ``` ```go // instrumentation/appsec/trace/service_entry_span.go (before) func (op *ServiceEntrySpanOperation) Finish() { // ranges over op.jsonTags and calls span.SetTag — does NOT call dyngo.FinishOperation } ``` Per request on the same goroutine, the GLS stack grows by one. `waf.ContextOperation.Finish()` is **unchanged** by this PR — its existing call sequence (pop child, then call parent's `Finish`) now correctly drives both lifecycles. ### Reproduction Two reproduction tests are not bundled in the PR (kept the diff minimal at 2 files / +16 / -1), but both pass at HEAD and fail when the corresponding fix is reverted. **Test 1 — GLS stack returns to baseline.** A note for reviewers running this locally: `orchestrion.Enabled()` is flipped by the orchestrion source-transform tool at build time, not by `-tags orchestrion`. Under plain `go test`, `CtxWithValue` therefore skips the GLS push branch and `GLSStackDepth()` always returns 0 — a test relying on the build tag alone would pass vacuously. The repo's `orchestrion.MockGLS()` (`internal/orchestrion/mock_gls.go`) installs a real `contextStack` and flips `enabled=true` for the duration of the test, which is what #4763's reproduction also uses. ```go package waf_test import ( "context" "testing" "github.com/DataDog/dd-trace-go/v2/instrumentation/appsec/trace" "github.com/DataDog/dd-trace-go/v2/internal/appsec/emitter/waf" "github.com/DataDog/dd-trace-go/v2/internal/orchestrion" ) func TestGLSStackReturnsToBaseline(t *testing.T) { defer orchestrion.MockGLS()() ctx := context.Background() baseline := orchestrion.GLSStackDepth() const iterations = 100 for i := 0; i < iterations; i++ { op, _ := waf.StartContextOperation(ctx, trace.NoopTagSetter{}) op.Finish() } if got := orchestrion.GLSStackDepth(); got != baseline { t.Errorf("GLS leaked: started at %d, now %d after %d cycles", baseline, got, iterations) } } ``` Observed when commit 1 of this PR is reverted: `GLS leaked: started at 0, now 100 after 100 cycles` — exactly +1 entry per request, matching the predicted leak rate. Stable at baseline with the fix applied. **Test 2 — `disable()` must clear `dataBroadcaster`.** ```go package dyngo import "testing" type repEvt struct{ val string } func TestDataBroadcasterNotClearedOnDisable(t *testing.T) { op := &operation{} addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {})) addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {})) addDataListener(&op.dataBroadcaster, DataListener[repEvt](func(e repEvt) {})) op.disable() op.dataBroadcaster.mu.RLock() defer op.dataBroadcaster.mu.RUnlock() total := 0 for _, ls := range op.dataBroadcaster.listeners { total += len(ls) } if total != 0 { t.Errorf("dataBroadcaster has %d listeners after disable(), want 0", total) } } ``` Observed when commit 2 is reverted: `dataBroadcaster has 3 listeners after disable(), want 0`. Passes at HEAD. Happy to include either or both tests in the PR if reviewers want — they were left out only to keep the diff narrowly scoped to the fix. ### Memory evidence On a long-lived process with AppSec + orchestrion enabled and steady HTTP traffic: - AppSec on, before fix: monotonic heap growth that scales with request count. - AppSec off: flat heap. - AppSec on, with both fixes: flat heap. Heap-live-size profile retention path matches the GLS analysis: ``` dyngo.addDataListener ← dyngo.OnData ← AppsecSpanTransport.OnServiceEntryStart ← dyngo.StartAndRegisterOperation (called once per HTTP request) ``` The path terminates in `StartAndRegisterOperation` with no matching `FinishOperation` — consistent with the GLS-pin hypothesis. ### Files touched - `instrumentation/appsec/trace/service_entry_span.go` — +9 / -1 - `instrumentation/appsec/dyngo/operation.go` — +7 / -0 Total: 2 files, +16 / -1. ### Reviewer's Checklist - [ ] Changed code has unit tests for its functionality at or near 100% coverage. - [ ] [System-Tests](https://github.com/DataDog/system-tests/) covering this feature have been added and enabled with the va.b.c-dev version tag. - [ ] There is a benchmark for any new code, or changes to existing code. - [ ] If this interacts with the agent in a new way, a system test has been added. - [x] New code is free of linting errors. You can check this by running `make lint` locally. - [x] New code doesn't break existing tests. You can check this by running `make test` locally. - [ ] Add an appropriate team label so this PR gets put in the right place for the release notes. - [x] All generated files are up to date. You can check this by running `make generate` locally. - [ ] Non-trivial go.mod changes, e.g. adding new modules, are reviewed by @DataDog/dd-trace-go-guild. Make sure all nested modules are up to date by running `make fix-modules` locally. Co-authored-by: e-n-0 <flavien.darche@datadoghq.com> Co-authored-by: dario.castane <dario.castane@datadoghq.com>
Core implementation to handle
spans and thetracerobjects. The following is a working example of our tracer package:What is missing?
tests are not pushed until this approach is acceptedthe dispatch implementationremove some parameters from theNewChildSpan()signature. We may just:handling thecontext.Contextobject so that thectxcan be passed as argument between functions. Actxmust be used to retrieve aspanif a parent is available.