Skip to content

Instrumentation test error visibility#9979

Closed
deejgregor wants to merge 5 commits into
DataDog:masterfrom
deejgregor:feature-instrumentation-test-error-visibility
Closed

Instrumentation test error visibility#9979
deejgregor wants to merge 5 commits into
DataDog:masterfrom
deejgregor:feature-instrumentation-test-error-visibility

Conversation

@deejgregor

@deejgregor deejgregor commented Nov 16, 2025

Copy link
Copy Markdown
Contributor

What Does This Do

This improves visibility of errors in instrumentation tests built on top of InstrumentationSpecification by failing the test with details of the errors, in particular these cases:

  1. Instrumentation blocked due to MuzzleCheck errors.
  2. Throwables caught from instrumentation code and reported to InstrumentationErrors by the Byte Buddy ExceptionHandlers class.
  3. Byte Buddy nstrumentation errors reported to InstrumentationSpecification's AgentBuilder.Listener.onError.

Also, ListWriterAssert.assertTraces will now fail fast when the errors from the first two cases above are reported.

Lastly, the code comment in ExceptionHandlers has been updated to reflect the current implementation and InstrumentationContext exceptions are more explicit about the likely causes of why the methods might not have been rewritten.

Motivation

These changes should reduce discovery and investigation time during development when there are instrumentation errors. Every single one of these issues are ones that I've run while developing instrumentations (most of them I've run into multiple times). ;-) These changes have helped me out enough (particularly for the first two use cases above) that I realized I really should see about contributing this back to help others--especially for those new to developing instrumentation, but also to assist those that have been doing it awhile when the inevitable error slips through.

Previously, the first error case above was not checked in tests using InstrumentationSpecification, and was only visible in log output, and the other two error cases just included error counts. In all of these cases, the developer would have to go digging in log output to find existence and/or details of errors.

When ListWriterAssert.assertTraces was used previously, if there were instrumentation errors that led to traces not being properly generated, tests will often fail due to failed assertions (missing spans, incorrect tags, etc.) and the developer might spend significant time chasing down the cause of the assertion failures to learn it was an instrumentation error. With these changes, if the cause was an underlying instrumentation error, ListWriterAssert.assertTraces will fail fast for the first two use cases above and clearly indicate there was an instrumentation error.

These changes will not only help our human developers by giving better context when there are failures, but also when using AI coding assistants.

Additional Notes

One possible improvement: The fail fast change could easily apply to all of the use cases above including the last use case if InstrumentationSpecification's AgentBuilder.Listener.onError recorded the error to InstrumentationErrors.

Best reviewed commit-by-commit:

Example output

InstrumentationErrors and InstrumentationContext exception

I commented-out the contextStore method on HikariPoolInstrumentation.

Condition not satisfied:

instrumentationErrorCount == 0
|                         |
1                         false

1 instrumentation errors were seen:
java.lang.RuntimeException: Calls to this method will be rewritten by Instrumentation Context Provider (e.g. FieldBackedProvider). If you get this exception, this method has not been rewritten. Ensure instrumentation class has a contextStore method and the call to InstrumentationContext.get happens directly in an instrumentation Advice class. See how_instrumentations_work.md for details.
	at datadog.trace.bootstrap.InstrumentationContext.get(InstrumentationContext.java:26)
	at com.zaxxer.hikari.pool.HikariPool.<init>(HikariPool.java:151)
	at com.zaxxer.hikari.HikariDataSource.<init>(HikariDataSource.java:73)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at org.codehaus.groovy.reflection.CachedConstructor.invoke(CachedConstructor.java:72)

Contributor Checklist

Jira ticket: [PROJ-IDENT]

@deejgregor deejgregor requested a review from a team as a code owner November 16, 2025 07:36
@deejgregor deejgregor requested a review from smola November 16, 2025 07:36

public class InstrumentationErrors {
private static final AtomicLong COUNTER = new AtomicLong();
private static final List<String> ERRORS = Collections.synchronizedList(new ArrayList<>());

@AlexeyKuznetsov-DD AlexeyKuznetsov-DD Nov 17, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ (For my own understanding): Why we need synchronized list here (and similar place in other class)?
Maybe we can use some concurrent data structure here instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short answer (not a great answer): because I'm an old-school Java programmer and still learning some modern (Java 8, heh) features. ;-)

Longer answer: these few places all had an AtomicLong and when I replaced it with a List Collections.synchronizedList(new ArrayList<>()) was my muscle-memory replacement. It looks like CopyOnWriteArrayList is a decent alternative in this case, so that's what I'll be doing.

Updated:

  1. https://github.com/DataDog/dd-trace-java/pull/9979/files#diff-82dd6ac77ecfcdc8b0ff7590d7370d00c33f7ac78ca37f78cbf841c0ced625b3R10
  2. https://github.com/DataDog/dd-trace-java/pull/9979/files#diff-2328d587d0382c573151b6c14fa9bab635d7482f5075206e22f0bf7114910834R16
  3. https://github.com/DataDog/dd-trace-java/pull/9979/files#diff-39d1612c29708df2c2a24a9d0888e933768aaecd8461e0691f6f6e28edf0050eR169

Comment on lines +508 to +511
def instrumentationErrorCount = InstrumentationErrors.getErrors().size()
assert instrumentationErrorCount == 0, instrumentationErrorCount + " instrumentation errors were seen:\n" + InstrumentationErrors.getErrors().join("\n---\n")
def muzzleErrorCount = MuzzleCheck.getErrors().size()
assert muzzleErrorCount == 0, muzzleErrorCount + " muzzle errors were seen:\n" + MuzzleCheck.getErrors().join("\n---\n")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: I think Groovy string interpolation can be used here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -47,7 +49,21 @@ class ListWriterAssert {
@ClosureParams(value = SimpleType, options = ['datadog.trace.agent.test.asserts.ListWriterAssert'])
@DelegatesTo(value = ListWriterAssert, strategy = Closure.DELEGATE_FIRST) Closure spec) {
try {
writer.waitForTraces(expectedSize)
// Fail fast if we see any muzzle errors or instrumentation errors
for (int timeoutRemaining = 20; timeoutRemaining > 0; timeoutRemaining--) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: probably it make sense to declare named const for magic number 20?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deejgregor

Copy link
Copy Markdown
Contributor Author

@AlexeyKuznetsov-DD thank you for the feedback. I'll work on the updates in the next day or two.

@mcculls mcculls added the tag: community Community contribution label Nov 18, 2025
This puts the details front and center in the test summary and is
much easier than hunting back through log messages.

Note: In InstrumentationErrors, errors are only recorded when
enableRecordingAndReset has been called. This is only done from the
InstrumentationSpecification test specification, so we don't need
to worry about the ArrayList growing without bounds in production.
I considered adding a limit to the ArrayList, but opted not to for
simplicity. I also wanted to avoid silently discarding some errors.
If there is a blocked instrumentation, the test will fail and the
failure message will include the same information that is logged
from MuzzleCheck.
@deejgregor deejgregor force-pushed the feature-instrumentation-test-error-visibility branch from 974b975 to 547e34d Compare November 19, 2025 03:53
@deejgregor

Copy link
Copy Markdown
Contributor Author

@AlexeyKuznetsov-DD I made the recommended changes, rebased on latest master and force pushed.

I made one additional update: I realized there could be a case where the assertions in ListWriterAssert would not be checked: if the expected number of traces were received within 1 second. I refactored the asserts out into a new method, assertNoErrors, and used that both inside the for loop and once after to make sure it runs at least once.

@AlexeyKuznetsov-DD AlexeyKuznetsov-DD left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR in general LGTM, but better to get one more approve from someone more familiar with codebase.

@AlexeyKuznetsov-DD AlexeyKuznetsov-DD left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but better to have one more approval from code owners.

@mcculls mcculls left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @deejgregor - thanks for the contribution, there are some CI failures that I need to look into - and some cleanup around ExceptionHandlers to reduce the impact in production (since we can rely on telemetry, and therefore don't need the additional argument stack allocations)

I'll add comments here in the next week or so

@github-actions

github-actions Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

This pull request has been marked as stale because it has not had activity over the past quarter. It will be closed in 7 days if no further activity occurs. Feel free to reopen the PR if you are still working on it.

@github-actions github-actions Bot added the tag: stale Stale pull requests label May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tag: community Community contribution tag: stale Stale pull requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants