Skip to content

feat(otlp-sink): Add OTLP sink plugin for exporting spans to AWS X-Ray#5664

Merged
dlvenable merged 24 commits intoopensearch-project:mainfrom
huypham612:otlp-sink-plugin
Jun 3, 2025
Merged

feat(otlp-sink): Add OTLP sink plugin for exporting spans to AWS X-Ray#5664
dlvenable merged 24 commits intoopensearch-project:mainfrom
huypham612:otlp-sink-plugin

Conversation

@huypham612
Copy link
Copy Markdown
Contributor

@huypham612 huypham612 commented Apr 30, 2025

Description

This PR introduces a new otlp sink plugin for Data Prepper that supports exporting OpenTelemetry spans to AWS X-Ray over HTTP using OTLP.

Key features:

  • Sends ExportTraceServiceRequest messages over HTTP to X-Ray-compatible endpoints
  • Supports gzip compression and AWS SigV4 signing
  • Retries with exponential backoff and jitter on retryable HTTP errors
  • Metrics integration (delivery latency, payload size, retries, errors, etc.)
  • Batching support with max_events, max_batch_size, and flush_timeout thresholds
  • Fully unit-tested at 100% coverage
  • End-to-end and Performance tested with a local pipeline sending spans to a SigV4-signed OTLP endpoint (AWS X-Ray)

Issues Resolved

Resolves #5663

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

* feat(xray-otlp-sink): Add X-Ray OTLP Sink plugin skeleton
- Added test resources and support for Span records
- Added sample pipeline config and OTLP test span JSON under `src/test/resources`
- Verified local pipeline ingest and logging using `grpcurl`
- Added README with developer instructions for running and testing locally

These changes establish the foundation for local testing and future X-Ray e2e testing.

Signed-off-by: huy pham <huyp@amazon.com>
Signed-off-by: huy pham <huyp@amazon.com>
Copy link
Copy Markdown
Collaborator

@KarstenSchnitter KarstenSchnitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have a discussion in the minimal feature set. I am missing mTLS and gRPC support. I will do a more thorough review of this PR next week.

Copy link
Copy Markdown
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @huypham612 for this contribution!

Copy link
Copy Markdown
Collaborator

@KarstenSchnitter KarstenSchnitter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support @dlvenable suggestion to use another http client. I also suggest to be more distinctive on the error metrics, so we can distinguish message drops because of processing errors from those rejected by the target. Please also have a look at the retry-after header as described here: https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling

@huypham612
Copy link
Copy Markdown
Contributor Author

Thank you for your feedback @dlvenable and @KarstenSchnitter, I will go through them and revise the code accordingly.

@huypham612
Copy link
Copy Markdown
Contributor Author

I support @dlvenable suggestion to use another http client. I also suggest to be more distinctive on the error metrics, so we can distinguish message drops because of processing errors from those rejected by the target. Please also have a look at the retry-after header as described here: https://opentelemetry.io/docs/specs/otlp/#otlphttp-throttling

Thank you, I will address this comment and update the code to handle this header. Good catch.

@huypham612
Copy link
Copy Markdown
Contributor Author

huypham612 commented May 16, 2025

Hi @dlvenable, @KarstenSchnitter, and @kkondaka,

I've pushed the updated changes based on the feedback. Here’s a summary of the key updates:

  1. Migrated to Armeria HTTP client.
  2. Added failedSpansCount metric for improved observability.
  3. Introduced ExecutorService to manage the buffer queue and ensure resilience.
  4. Allowed unbounded max_events by supporting 0 as a valid value.
  5. Reused the shared AwsConfig class for consistency with other sinks.
  6. Added unit tests to cover 100%.
  7. Re-ran performance tests and confirmed the implementation works as expected.

Note: I evaluated using the Retry-After header for dynamic backoff, but decided against it due to the following trade-offs:

  1. Armeria’s retry rule API only supports boolean predicates or fixed Backoff.
  2. Supporting Retry-After would require a custom Backoff implementation, which adds complexity with minimal benefit for most OTLP endpoints.
  3. The existing exponential backoff with jitter strategy already handles typical retry scenarios effectively.

Please take another look when you have time — thank you!

@huypham612 huypham612 requested review from dlvenable and kkondaka May 22, 2025 04:42
@huypham612 huypham612 requested a review from dlvenable May 23, 2025 23:26
dlvenable
dlvenable previously approved these changes May 24, 2025
huypham612 and others added 12 commits May 23, 2025 18:49
Signed-off-by: huy pham <huyp@amazon.com>
- Introduce `ThresholdConfig` with `max_events`, `max_batch_size`, and `flush_timeout`
- Implement `OtlpSinkBuffer` using a bounded `LinkedBlockingQueue` driven by threshold settings
- Enhance `OtlpSinkMetrics` to register queue gauges and cover new payload/latency metrics
- Update README with configuration options and note on AWS region derivation
- Add comprehensive unit tests for config, buffer, metrics, and HTTP sender
- Removed unused integ test package and resources

Signed-off-by: huy pham <huyp@amazon.com>
- Changed logic to get the aws region value from the provided endpoint.

Signed-off-by: huy pham <huyp@amazon.com>
- Retries on 429, 502, 503, and 504 per OTEL spec. Logs other errors without retry.
- See: https://opentelemetry.io/docs/specs/otlp/exporter/#retrying-on-failure

Signed-off-by: huy pham <huyp@amazon.com>
- removed log.info statements
- Add description to exception

Signed-off-by: huy pham <huyp@amazon.com>
- Updated worker loop to retry on interrupt until queue is empty
- Avoid logging error on expected shutdown interrupt
- test: Added unit tests to verify final flush

Signed-off-by: huy pham <huyp@amazon.com>
- Removed unused metrics

Signed-off-by: huy pham <huyp@amazon.com>
Signed-off-by: huy pham <huyp@amazon.com>
…ments

Signed-off-by: huy pham <huyp@amazon.com>
…ialsSupplier.

Signed-off-by: huy pham <huyp@amazon.com>
- Updated README
- Fixed init bug
- Performed e2e and perf tests

Signed-off-by: huy pham <huyp@amazon.com>
Signed-off-by: huy pham <huyp@amazon.com>
}

// Defensive copy to avoid ConcurrentModificationException
final List<Pair<ResourceSpans, EventHandle>> immutableBatch = List.copyOf(batch);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, we use synchronization (using locks) to avoid concurrent modification. Not sure if making a copy is the best way to handle this. What do you think @dlvenable ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking because Armeria is async, the List.copyOf ensures immutability without locking overhead. Given the typical batch size is small (1MB, even up to 5MB), this is a practical tradeoff for correctness and performance in a trace sink.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a shallow copy. So we are just copying the list. The underlying ResourceSpans and EventHandle objects are not copied.

Signed-off-by: huy pham <huyp@amazon.com>
dlvenable
dlvenable previously approved these changes May 30, 2025
Signed-off-by: huy pham <huyp@amazon.com>
dlvenable
dlvenable previously approved these changes May 30, 2025
…entials

Signed-off-by: huy pham <huyp@amazon.com>
Copy link
Copy Markdown
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @huypham612 for this great new feature!

@dlvenable dlvenable merged commit 35795aa into opensearch-project:main Jun 3, 2025
67 of 74 checks passed
jeffreyAaron pushed a commit to jeffreyAaron/data-prepper that referenced this pull request Jun 13, 2025
…ject#5664)

Add OTLP sink plugin for exporting spans to AWS X-Ray (opensearch-project#5664)

Feature/xray init

feat(xray-otlp-sink): Add X-Ray OTLP Sink plugin skeleton
- Added test resources and support for Span records
- Added sample pipeline config and OTLP test span JSON under `src/test/resources`
- Verified local pipeline ingest and logging using `grpcurl`
- Added README with developer instructions for running and testing locally

These changes establish the foundation for local testing and future X-Ray e2e testing.

Signed-off-by: huy pham <huyp@amazon.com>
Signed-off-by: Heli <desaiheli17@gmail.com>
Co-authored-by: Heli <desaiheli17@gmail.com>
Signed-off-by: Jeffrey Aaron Jeyasingh <jeffreyaaron06@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add otlp sink

5 participants