Conversation
Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might be set by external transactions as well. If present, it caused APMTracer to report a transaction for every span when in fact being sampled out. Correctly report the duration of transactions if not recording by not discarding the root span immediately. These transactions might still get reported, but without spans. Discard transient trace start time in `newTraceContext` when a parent APM trace context already exists. If propagated, all spans would start at the same time. Relates to ES-13389
|
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Turns out this might be unnecessary (for serverless at least), the APM agent behavior depends on the apm-server version. For recent apm-server versions such transactions are not reported of unsampled. |
jdconrad
left a comment
There was a problem hiding this comment.
LGTM! Just a couple of questions.
modules/apm/src/test/java/org/elasticsearch/telemetry/apm/internal/tracing/APMTracerTests.java
Outdated
Show resolved
Hide resolved
| final Object previousTraceContext = newTransientHeaders.remove(Task.APM_TRACE_CONTEXT); | ||
| if (previousTraceContext != null) { | ||
| newTransientHeaders.put(Task.PARENT_APM_TRACE_CONTEXT, previousTraceContext); | ||
| // Remove the trace start time override for a previous context if such a context already exists. |
There was a problem hiding this comment.
Naive question - it's guaranteed there will never be overlap in this case? (If we remove we are sure the prior trace is completed?)
There was a problem hiding this comment.
TRACE_START_TIME is a rather unfortunate hack. It should only be used for the root span (and previously it wasn't possible to set this again, but it could be set to late). Overlapping shouldn't ever be an issue. But I'll add an assert to make sure this is never set after a trace context already exists catching the too late case.
modules/apm/src/test/java/org/elasticsearch/telemetry/apm/internal/tracing/APMTracerTests.java
Outdated
Show resolved
Hide resolved
…rnal/tracing/APMTracerTests.java
| assert key != Task.TRACE_START_TIME || (hasApmTraceContext() || hasParentApmTraceContext()) == false | ||
| : "trace.starttime cannot be set after a trace context is present"; | ||
|
|
There was a problem hiding this comment.
@jdconrad this will prevent wrong usage of trace.starttime
- Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might be set by external transactions as well. If present, it caused APMTracer to report a transaction for every span when in fact being sampled out. - Correctly report the duration of transactions if not recording by not discarding the root span immediately. These transactions might still get reported, but without spans. - Discard transient trace start time in `newTraceContext` when a parent APM trace context already exists. If propagated, all spans would start at the same time. Relates to ES-13389
💔 Backport failed
You can use sqren/backport to manually backport by running |
- Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might be set by external transactions as well. If present, it caused APMTracer to report a transaction for every span when in fact being sampled out. - Correctly report the duration of transactions if not recording by not discarding the root span immediately. These transactions might still get reported, but without spans. - Discard transient trace start time in `newTraceContext` when a parent APM trace context already exists. If propagated, all spans would start at the same time. Relates to ES-13389
* main: (135 commits)
Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=1} elastic#138130
Mute org.elasticsearch.upgrades.IndexSortUpgradeIT testIndexSortForNumericTypes {upgradedNodes=2} elastic#138129
Mute org.elasticsearch.search.basic.SearchWithRandomDisconnectsIT testSearchWithRandomDisconnects elastic#138128
[DiskBBQ] avoid EsAcceptDocs bug by calling cost before building iterator (elastic#138127)
Log NOT_PREFERRED shard movements (elastic#138069)
Improve bulk loading of binary doc values (elastic#137995)
Add internal action for getting inference fields and inference results for those fields (elastic#137680)
Address issue with DateFieldMapper#isFieldWithinQuery(...) (elastic#138032)
WriteLoadConstraintDecider: Have separate rate limiting for canRemain and canAllocate decisions (elastic#138067)
Adding NodeContext to TransportBroadcastByNodeAction (elastic#138057)
Mute org.elasticsearch.simdvec.ESVectorUtilTests testSoarDistanceBulk elastic#138117
Mute org.elasticsearch.xpack.esql.qa.single_node.GenerativeIT test elastic#137909
Backport batched_response_might_include_reduction_failure version to 8.19 (elastic#138046)
Add summary metrics for tdigest fields (elastic#137982)
Add gp-llm-v2 model ID and inference endpoint (elastic#138045)
Various tracing fixes (elastic#137908)
[ML] Fixing KDE evaluate() to return correct ValueAndMagnitude object (elastic#128602)
Mute org.elasticsearch.xpack.shutdown.NodeShutdownIT testStalledShardMigrationProperlyDetected elastic#115697
[ML] Fix Flaky Audit Message Assertion in testWithDatastream for RegressionIT and ClassificationIT (elastic#138065)
[ML] Fix Non-Deterministic Training Set Selection in RegressionIT testTwoJobsWithSameRandomizeSeedUseSameTrainingSet (elastic#138063)
...
# Conflicts:
# rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/search.vectors/200_dense_vector_docvalue_fields.yml
| final Object previousTraceContext = newTransientHeaders.remove(Task.APM_TRACE_CONTEXT); | ||
| if (previousTraceContext != null) { | ||
| newTransientHeaders.put(Task.PARENT_APM_TRACE_CONTEXT, previousTraceContext); | ||
| // Remove the trace start time override for a previous context if such a context already exists. |
There was a problem hiding this comment.
I had seen this behavior as well, but I wasn't super sure if I my understanding was correct. Glad to see this here!
This PR fixes various issues causing an excessive volume of APM events and incorrectly reported transaction & span durations as described in further detail:
Only continue tracing in TaskManager if a parent APM trace context exists. Trace headers might
be set by external transactions as well. If present, it caused APMTracer to report a transaction
for every span when in fact being sampled out.
Correctly report the duration of transactions if not recording by not discarding the root span
immediately. These transactions might still get reported, but without spans.
Discard transient trace start time in
newTraceContextwhen a parent APM trace contextalready exists. If propagated, all spans would start at the same time.
Relates to ES-13389