Skip to content

Make CWL retry indefinitely for retryable errors when no DLQ configured#6355

Merged
kkondaka merged 2 commits intoopensearch-project:mainfrom
kkondaka:cwl-dlq-fix
Dec 17, 2025
Merged

Make CWL retry indefinitely for retryable errors when no DLQ configured#6355
kkondaka merged 2 commits intoopensearch-project:mainfrom
kkondaka:cwl-dlq-fix

Conversation

@kkondaka
Copy link
Copy Markdown
Collaborator

Description

Make CWL retry indefinitely for retryable errors when no DLQ configured.
Also, added a new metric to count multiple failures (every 5 failures)

Issues Resolved

Resolves #6300

Check List

  • New functionality includes testing.
  • New functionality has a documentation issue. Please link to it in this PR.
    • New functionality has javadoc added
  • [ X] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
public static final String CLOUDWATCH_LOGS_EVENTS_SUCCEEDED = "cloudWatchLogsEventsSucceeded";
public static final String CLOUDWATCH_LOGS_EVENTS_FAILED = "cloudWatchLogsEventsFailed";
public static final String CLOUDWATCH_LOGS_REQUESTS_FAILED = "cloudWatchLogsRequestsFailed";
public static final String CLOUDWATCH_LOGS_REQUEST_MULTI_FAILED = "cloudWatchLogsRequestMultipleFailures";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you trying to track the requests that retry more than one time? If so is this name more clear

cloudWatchLogsRequestsWithMultipleRetries

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another metric for multiple failures so that alarms can be setup on multiple failures. As per cloudwatch logs team, single failure may be OK but multiple failures is serious enough to be alarmed and immediate action is needed.

Copy link
Copy Markdown
Member

@graytaylor0 graytaylor0 Dec 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are "multiple failures" exactly? Different error responses for the same request on retries?

.logGroup(cloudWatchLogsSinkConfig.getLogGroup())
.logStream(cloudWatchLogsSinkConfig.getLogStream())
.retryCount(cloudWatchLogsSinkConfig.getMaxRetries())
.retryCount(dlqPushHandler == null ? Integer.MAX_VALUE : cloudWatchLogsSinkConfig.getMaxRetries())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to retry all error infinitely? Should some be non-retryable?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is applicable only for retryable errors (CloudWatchLogsException OR SdkClientException) . Non-retryable errors (all other exceptions) are never retried.

@kkondaka kkondaka merged commit 1f3c152 into opensearch-project:main Dec 17, 2025
44 of 47 checks passed
wandna-amazon pushed a commit to wandna-amazon/data-prepper that referenced this pull request Jan 8, 2026
…ed (opensearch-project#6355)

* Make CWL retry indefinitely for retryable errors when no DLQ configured

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Added tests

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Signed-off-by: Nathan Wand <wandna@amazon.com>
simonelbaz pushed a commit to simonelbaz/data-prepper that referenced this pull request Jan 31, 2026
…ed (opensearch-project#6355)

* Make CWL retry indefinitely for retryable errors when no DLQ configured

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Added tests

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Signed-off-by: Simon ELBAZ <elbazsimon9@gmail.com>
simonelbaz pushed a commit to simonelbaz/data-prepper that referenced this pull request Jan 31, 2026
…ed (opensearch-project#6355)

* Make CWL retry indefinitely for retryable errors when no DLQ configured

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Added tests

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
simonelbaz pushed a commit to simonelbaz/data-prepper that referenced this pull request Jan 31, 2026
…ed (opensearch-project#6355)

* Make CWL retry indefinitely for retryable errors when no DLQ configured

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

* Added tests

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>

---------

Signed-off-by: Krishna Kondaka <krishkdk@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make CWL retry indefinitely for retryable errors when no DLQ configured

5 participants