chore(telemetry): extend process tracking to spawned processes#16842
chore(telemetry): extend process tracking to spawned processes#16842gh-worker-dd-mergequeue-cf854d[bot] merged 18 commits into
Conversation
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🔗 Commit SHA: a02f32e | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback! |
Performance SLOsComparing candidate munir/send-app-start-closed-on-root (a02f32e) with baseline main (c076c6a) 🟡 Near SLO Breach (5 suites)🟡 djangosimple - 30/30✅ appsecTime: ✅ 21.036ms (SLO: <22.300ms -5.7%) vs baseline: -0.2% Memory: ✅ 71.378MB (SLO: <73.500MB -2.9%) vs baseline: +5.3% ✅ exception-replay-enabledTime: ✅ 1.369ms (SLO: <1.450ms -5.6%) vs baseline: ~same Memory: ✅ 69.600MB (SLO: <71.500MB -2.7%) vs baseline: +5.2% ✅ iastTime: ✅ 21.070ms (SLO: <22.250ms -5.3%) vs baseline: -0.3% Memory: ✅ 71.335MB (SLO: <75.000MB -4.9%) vs baseline: +5.3% ✅ profilerTime: ✅ 15.280ms (SLO: <16.550ms -7.7%) vs baseline: +0.9% Memory: ✅ 60.155MB (SLO: <61.000MB 🟡 -1.4%) vs baseline: +5.7% ✅ resource-renamingTime: ✅ 20.746ms (SLO: <21.750ms -4.6%) vs baseline: -0.4% Memory: ✅ 71.370MB (SLO: <73.500MB -2.9%) vs baseline: +5.2% ✅ span-code-originTime: ✅ 21.439ms (SLO: <28.200ms 📉 -24.0%) vs baseline: +0.9% Memory: ✅ 71.429MB (SLO: <75.000MB -4.8%) vs baseline: +5.3% ✅ tracerTime: ✅ 21.020ms (SLO: <21.750ms -3.4%) vs baseline: -0.4% Memory: ✅ 71.255MB (SLO: <75.000MB -5.0%) vs baseline: +5.1% ✅ tracer-and-profilerTime: ✅ 21.124ms (SLO: <23.500ms 📉 -10.1%) vs baseline: +0.5% Memory: ✅ 73.305MB (SLO: <75.000MB -2.3%) vs baseline: +5.4% ✅ tracer-dont-create-db-spansTime: ✅ 21.161ms (SLO: <21.500ms 🟡 -1.6%) vs baseline: +0.5% Memory: ✅ 71.357MB (SLO: <75.000MB -4.9%) vs baseline: +5.2% ✅ tracer-minimalTime: ✅ 17.878ms (SLO: <18.500ms -3.4%) vs baseline: -0.5% Memory: ✅ 71.283MB (SLO: <75.000MB -5.0%) vs baseline: +5.1% ✅ tracer-nativeTime: ✅ 20.930ms (SLO: <21.750ms -3.8%) vs baseline: +0.1% Memory: ✅ 71.298MB (SLO: <72.500MB 🟡 -1.7%) vs baseline: +5.2% ✅ tracer-no-cachesTime: ✅ 18.909ms (SLO: <19.650ms -3.8%) vs baseline: +0.6% Memory: ✅ 71.333MB (SLO: <75.000MB -4.9%) vs baseline: +5.2% ✅ tracer-no-databasesTime: ✅ 20.690ms (SLO: <21.100ms 🟡 -1.9%) vs baseline: +0.2% Memory: ✅ 71.333MB (SLO: <75.000MB -4.9%) vs baseline: +5.3% ✅ tracer-no-middlewareTime: ✅ 20.734ms (SLO: <21.500ms -3.6%) vs baseline: +0.1% Memory: ✅ 71.296MB (SLO: <75.000MB -4.9%) vs baseline: +5.1% ✅ tracer-no-templatesTime: ✅ 20.918ms (SLO: <22.000ms -4.9%) vs baseline: +1.0% Memory: ✅ 71.355MB (SLO: <73.500MB -2.9%) vs baseline: +5.3% 🟡 otelspan - 22/22✅ add-eventTime: ✅ 40.764ms (SLO: <47.150ms 📉 -13.5%) vs baseline: -0.8% Memory: ✅ 41.284MB (SLO: <47.000MB 📉 -12.2%) vs baseline: +5.5% ✅ add-metricsTime: ✅ 236.380ms (SLO: <344.800ms 📉 -31.4%) vs baseline: ~same Memory: ✅ 45.689MB (SLO: <47.500MB -3.8%) vs baseline: +5.0% ✅ add-tagsTime: ✅ 278.378ms (SLO: <330.000ms 📉 -15.6%) vs baseline: +2.2% Memory: ✅ 45.798MB (SLO: <47.500MB -3.6%) vs baseline: +5.6% ✅ get-contextTime: ✅ 83.988ms (SLO: <92.350ms -9.1%) vs baseline: +0.2% Memory: ✅ 41.406MB (SLO: <46.500MB 📉 -11.0%) vs baseline: +5.2% ✅ is-recordingTime: ✅ 39.099ms (SLO: <44.500ms 📉 -12.1%) vs baseline: -1.3% Memory: ✅ 41.053MB (SLO: <47.500MB 📉 -13.6%) vs baseline: +5.4% ✅ record-exceptionTime: ✅ 61.030ms (SLO: <67.650ms -9.8%) vs baseline: ~same Memory: ✅ 41.727MB (SLO: <47.000MB 📉 -11.2%) vs baseline: +5.3% ✅ set-statusTime: ✅ 45.164ms (SLO: <50.400ms 📉 -10.4%) vs baseline: ~same Memory: ✅ 41.110MB (SLO: <47.000MB 📉 -12.5%) vs baseline: +5.5% ✅ startTime: ✅ 40.133ms (SLO: <44.500ms -9.8%) vs baseline: +3.7% Memory: ✅ 41.206MB (SLO: <47.000MB 📉 -12.3%) vs baseline: +5.7% ✅ start-finishTime: ✅ 90.215ms (SLO: <91.000ms 🟡 -0.9%) vs baseline: ~same Memory: ✅ 38.732MB (SLO: <46.500MB 📉 -16.7%) vs baseline: +5.2% ✅ start-finish-telemetryTime: ✅ 91.586ms (SLO: <92.000ms 🟡 -0.4%) vs baseline: -0.4% Memory: ✅ 38.712MB (SLO: <46.500MB 📉 -16.7%) vs baseline: +5.1% ✅ update-nameTime: ✅ 40.055ms (SLO: <45.150ms 📉 -11.3%) vs baseline: -1.3% Memory: ✅ 41.245MB (SLO: <47.000MB 📉 -12.2%) vs baseline: +5.3% 🟡 recursivecomputation - 8/8✅ deepTime: ✅ 312.191ms (SLO: <320.950ms -2.7%) vs baseline: ~same Memory: ✅ 37.336MB (SLO: <38.750MB -3.6%) vs baseline: +5.1% ✅ deep-profiledTime: ✅ 329.372ms (SLO: <359.150ms -8.3%) vs baseline: -0.1% Memory: ✅ 43.706MB (SLO: <46.000MB -5.0%) vs baseline: +5.6% ✅ mediumTime: ✅ 7.412ms (SLO: <7.450ms 🟡 -0.5%) vs baseline: -0.3% Memory: ✅ 36.235MB (SLO: <38.000MB -4.6%) vs baseline: +5.5% ✅ shallowTime: ✅ 1.050ms (SLO: <1.050ms 🟡 ~same) vs baseline: +1.5% Memory: ✅ 36.215MB (SLO: <38.000MB -4.7%) vs baseline: +5.1% 🟡 span - 26/26✅ add-eventTime: ✅ 19.585ms (SLO: <22.500ms 📉 -13.0%) vs baseline: -1.4% Memory: ✅ 38.456MB (SLO: <53.000MB 📉 -27.4%) vs baseline: +5.6% ✅ add-metricsTime: ✅ 89.420ms (SLO: <93.500ms -4.4%) vs baseline: +0.7% Memory: ✅ 42.900MB (SLO: <53.000MB 📉 -19.1%) vs baseline: +5.3% ✅ add-tagsTime: ✅ 148.016ms (SLO: <155.000ms -4.5%) vs baseline: +0.2% Memory: ✅ 42.953MB (SLO: <53.000MB 📉 -19.0%) vs baseline: +5.4% ✅ get-contextTime: ✅ 18.748ms (SLO: <20.500ms -8.5%) vs baseline: -1.0% Memory: ✅ 38.341MB (SLO: <53.000MB 📉 -27.7%) vs baseline: +5.5% ✅ is-recordingTime: ✅ 18.938ms (SLO: <20.500ms -7.6%) vs baseline: -1.8% Memory: ✅ 38.303MB (SLO: <53.000MB 📉 -27.7%) vs baseline: +5.6% ✅ record-exceptionTime: ✅ 38.569ms (SLO: <41.000ms -5.9%) vs baseline: -0.5% Memory: ✅ 38.787MB (SLO: <53.000MB 📉 -26.8%) vs baseline: +5.3% ✅ set-statusTime: ✅ 20.646ms (SLO: <22.000ms -6.2%) vs baseline: -1.0% Memory: ✅ 38.433MB (SLO: <53.000MB 📉 -27.5%) vs baseline: +6.0% ✅ startTime: ✅ 19.732ms (SLO: <20.500ms -3.7%) vs baseline: +3.8% Memory: ✅ 38.249MB (SLO: <53.000MB 📉 -27.8%) vs baseline: +5.2% ✅ start-finishTime: ✅ 58.048ms (SLO: <58.500ms 🟡 -0.8%) vs baseline: -0.4% Memory: ✅ 36.156MB (SLO: <38.000MB -4.9%) vs baseline: +5.1% ✅ start-finish-telemetryTime: ✅ 59.239ms (SLO: <60.000ms 🟡 -1.3%) vs baseline: -0.6% Memory: ✅ 36.235MB (SLO: <38.000MB -4.6%) vs baseline: +5.3% ✅ start-finish-traceid128Time: ✅ 60.602ms (SLO: <62.000ms -2.3%) vs baseline: -0.3% Memory: ✅ 36.215MB (SLO: <38.000MB -4.7%) vs baseline: +5.5% ✅ start-traceid128Time: ✅ 18.686ms (SLO: <22.500ms 📉 -16.9%) vs baseline: -2.4% Memory: ✅ 38.319MB (SLO: <53.000MB 📉 -27.7%) vs baseline: +5.5% ✅ update-nameTime: ✅ 19.309ms (SLO: <22.000ms 📉 -12.2%) vs baseline: -1.4% Memory: ✅ 38.329MB (SLO: <53.000MB 📉 -27.7%) vs baseline: +5.1% 🟡 tracer - 6/6✅ largeTime: ✅ 33.089ms (SLO: <32.950ms +0.4%) vs baseline: -0.4% Memory: ✅ 37.827MB (SLO: <39.250MB -3.6%) vs baseline: +6.4% ✅ mediumTime: ✅ 3.336ms (SLO: <3.500ms -4.7%) vs baseline: ~same Memory: ✅ 36.255MB (SLO: <38.750MB -6.4%) vs baseline: +5.6% ✅ smallTime: ✅ 385.864µs (SLO: <390.000µs 🟡 -1.1%) vs baseline: +3.5% Memory: ✅ 36.156MB (SLO: <38.750MB -6.7%) vs baseline: +5.3%
|
691ac6e to
4fd8cd0
Compare
Codeowners resolved as |
4fd8cd0 to
2368fb5
Compare
Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>
Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>
…start-closed-on-root
## Summary Implements the [Stable Service Instance Identifier RFC](https://docs.google.com/document/d/1ECKj9_NnwaKYtFqm3p3Rlpicx5d-OQcdj9kI2jvRqVU) for Go instrumentation telemetry. - **`DD-Session-ID`**: always present on every telemetry request, set to the current `runtime_id` - **`DD-Root-Session-ID`**: present only in child processes, inherited via `_DD_ROOT_GO_SESSION_ID` env var. Omitted when equal to session ID — backend infers root = self when absent - **Auto-propagation**: `globalconfig.init()` sets `_DD_ROOT_GO_SESSION_ID` in `os.Environ()` so child processes spawned via `os/exec` inherit it automatically without any user-side calls ## Changes - `internal/globalconfig/globalconfig.go`: adds `rootSessionID` field, `init()` reads/sets `_DD_ROOT_GO_SESSION_ID` (internal env var, not in supported_configurations), `RootSessionID()` getter - `internal/telemetry/internal/writer.go`: adds `DD-Session-ID` (always) and `DD-Root-Session-ID` (child processes only) to pre-baked telemetry headers - Tests for both globalconfig (including cross-process propagation) and writer ## Related - System-tests PR: DataDog/system-tests#6510 - Node.js PR: DataDog/dd-trace-js#7821 - dd-trace-py fork tracking: DataDog/dd-trace-py#16839 - dd-trace-py spawn tracking: DataDog/dd-trace-py#16842 Co-authored-by: ayan.khan <ayan.khan@datadoghq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b384eb7bb1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Co-authored-by: Munir Abdinur <munir.abdinur@datadoghq.com>
## Description
Extends process lineage tracking to exec-based child processes (`subprocess`, `multiprocessing` spawn). Fork support was added in a previous PR; this covers the remaining spawning mechanism.
`subprocess.Popen.__init__` is now patched unconditionally (independent of ASM) to inject `_DD_ROOT_PY_SESSION_ID` and `_DD_PARENT_PY_SESSION_ID` into the child's environment. The child reads these at module load
time to seed `get_ancestor_runtime_id()` and `get_parent_runtime_id()`. Disable via `DD_TRACE_SUBPROCESS_ENABLED=false`.
**Key changes:**
- `runtime/__init__.py` — adds env var name constants, seeds module state from them at import, exposes `get_session_env_vars()`
- `subprocess/patch.py` — moves `Popen.__init__`/`Popen.wait` wrapping before the ASM gate; injects lineage env vars unconditionally in `_traced_subprocess_init`
- `telemetry/writer.py` — replaces `forksafe.is_fork_child()` with `get_parent_runtime_id() is not None`, which now correctly identifies both forked and exec-spawned children
## Testing
`test_subprocess_session_lineage_env_vars` — parametrized over `DD_TRACE_SUBPROCESS_ENABLED=true/false/unset`, runs under `ddtrace-run`, spawns a `ddtrace-run` child, and verifies the child's
`get_parent_runtime_id()` / `get_ancestor_runtime_id()` match the parent's runtime ID when enabled and are `None` when disabled.
## Risks
`Popen.__init__` is now patched whenever ddtrace is loaded, not only when ASM is enabled. Spawned processes will receive two extra `_DD_`-prefixed env vars. These are ignored by non-ddtrace processes so the blast
radius is minimal.
Co-authored-by: munir.abdinur <munir.abdinur@datadoghq.com>
Description
Extends process lineage tracking to exec-based child processes (
subprocess,multiprocessingspawn). Fork support was added in a previous PR; this covers the remaining spawning mechanism.subprocess.Popen.__init__is now patched unconditionally (independent of ASM) to inject_DD_ROOT_PY_SESSION_IDand_DD_PARENT_PY_SESSION_IDinto the child's environment. The child reads these at module loadtime to seed
get_ancestor_runtime_id()andget_parent_runtime_id(). Disable viaDD_TRACE_SUBPROCESS_ENABLED=false.Key changes:
runtime/__init__.py— adds env var name constants, seeds module state from them at import, exposesget_session_env_vars()subprocess/patch.py— movesPopen.__init__/Popen.waitwrapping before the ASM gate; injects lineage env vars unconditionally in_traced_subprocess_inittelemetry/writer.py— replacesforksafe.is_fork_child()withget_parent_runtime_id() is not None, which now correctly identifies both forked and exec-spawned childrenTesting
test_subprocess_session_lineage_env_vars— parametrized overDD_TRACE_SUBPROCESS_ENABLED=true/false/unset, runs underddtrace-run, spawns addtrace-runchild, and verifies the child'sget_parent_runtime_id()/get_ancestor_runtime_id()match the parent's runtime ID when enabled and areNonewhen disabled.Risks
Popen.__init__is now patched whenever ddtrace is loaded, not only when ASM is enabled. Spawned processes will receive two extra_DD_-prefixed env vars. These are ignored by non-ddtrace processes so the blastradius is minimal.