Skip to content

fix(core): Add retries with exponential backoff for transient S3 read failures before any data is read.#2203

Merged
gibber9809 merged 3 commits into
y-scope:mainfrom
goynam:fix/retry-transient-s3-failures
Apr 20, 2026
Merged

fix(core): Add retries with exponential backoff for transient S3 read failures before any data is read.#2203
gibber9809 merged 3 commits into
y-scope:mainfrom
goynam:fix/retry-transient-s3-failures

Conversation

@goynam

@goynam goynam commented Apr 14, 2026

Copy link
Copy Markdown
Contributor

When reading log files from S3, transient network errors (HTTP errors, SSL connection resets, empty replies, timeouts) during reader creation or file type deduction cause the compression task to fail permanently. Since there is no retry mechanism, the affected log files are never ingested.

Changes

  • Add NetworkUtils::is_retryable_curl_error() in Utils.hpp/cpp to detect transient curl errors (CURLE_HTTP_RETURNED_ERROR, CURLE_SSL_CONNECT_ERROR, CURLE_GOT_NOTHING, CURLE_RECV_ERROR, CURLE_OPERATION_TIMEDOUT).
  • Add try_create_reader_and_deduce_type_with_retries() in InputConfig.hpp/cpp — a shared utility that wraps reader creation and file type deduction with retry logic (3 retries, exponential backoff starting at 1s).
  • Simplify JsonParser::ingest() and log_converter::convert_files() to use the new utility, replacing inline retry loops.

Scope

This PR only handles transient errors that occur before any data is ingested (i.e., during reader creation and input type deduction). Retrying after partial ingestion would require more extensive changes to the ingestion pipeline and is left for a follow-up.

@goynam goynam requested review from a team and gibber9809 as code owners April 14, 2026 17:18
@coderabbitai

coderabbitai Bot commented Apr 14, 2026

Copy link
Copy Markdown
Contributor
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding retries with exponential backoff for transient S3 read failures, which aligns with the primary objective of the changeset.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/JsonParser.cpp`:
- Around line 649-650: The retry loop uses constexpr size_t cMaxRetries{3}
together with the condition attempt <= cMaxRetries, causing 4 attempts (attempts
0..3); either change the loop condition to stop at the intended total attempts
(e.g., use attempt < cMaxRetries for exactly 3 total attempts) or
rename/repurpose the constant to cMaxAttempts and document that it includes the
initial try (and keep the <= condition). Update all occurrences (including the
similar block referenced around lines 655–666) to use the chosen convention
consistently and adjust the constant name or loop condition accordingly so the
semantics match the identifier.
- Around line 668-672: try_create_reader(path, m_network_auth) can fail
transiently but the current code returns false immediately; change
JsonParser.cpp so that when try_create_reader returns nullptr you inspect the
underlying failure (use the same retry/ backoff policy used elsewhere in this
component) and retry reader creation a few times before giving up, only
returning false if the error is determined non-retryable; ensure you still call
m_archive_writer->close() on final failure and reference try_create_reader,
reader, path and m_network_auth when locating the logic to modify (or if you
intentionally decide not to retry, add a brief comment explaining why reader
creation is not retryable).

In `@components/core/src/clp_s/log_converter/log_converter.cpp`:
- Around line 23-24: Both JsonParser::ingest() and convert_files() duplicate
retry constants (cMaxRetries, cInitialBackoff) and loop structure; extract the
constants into a shared header (e.g., Utils.hpp) and factor the retry loop into
a reusable helper (e.g., retry_with_backoff or RetryPolicy) that takes a
callable, max retries and initial backoff so both JsonParser::ingest() and
convert_files() call the same helper; update references to use the shared
constants and helper to remove the duplicated loop/constant definitions and keep
behavior identical.
- Around line 62-68: The reader creation block using clp_s::try_create_reader
currently returns false immediately when reader == nullptr; implement retry
logic similar to JsonParser::ingest so transient network failures are retried:
call try_create_reader(path, command_line_arguments.get_network_auth()) in a
short-loop with a limited number of attempts and exponential/backoff delays,
logging each attempt (use SPDLOG_ERROR/INFO) and only return false after
exhausting retries; ensure the variable reader is rechecked after each attempt
and preserve existing error context in logs.
- Around line 80-91: The code currently calls
clp_s::NetworkUtils::check_and_log_curl_error twice when a retryable CURL error
occurs and attempt >= cMaxRetries; modify the control flow in the block around
clp_s::NetworkUtils::is_retryable_curl_error(reader.get()) so that
check_and_log_curl_error is only invoked once on the terminal failure path: if
is_retryable returns true and attempt < cMaxRetries, perform the retry
(continue) after optionally logging for retry; otherwise (final attempt or
non-retryable), call check_and_log_curl_error(path.path, reader.get()) exactly
once before emitting the SPDLOG_ERROR and returning false. Ensure you reference
the existing symbols attempt, cMaxRetries,
NetworkUtils::is_retryable_curl_error, and
NetworkUtils::check_and_log_curl_error when making the change.

In `@components/core/src/clp_s/Utils.cpp`:
- Around line 186-194: The current retry check treats CURLE_HTTP_RETURNED_ERROR
(from curl_error_info->code()) as always retryable but that includes 4xx client
errors; update the logic in Utils.cpp (the function that inspects
curl_error_info->code()) to distinguish HTTP 5xx from 4xx by calling
curl_easy_getinfo on the CURL handle to fetch CURLINFO_RESPONSE_CODE and only
treat CURLE_HTTP_RETURNED_ERROR as retryable when the response_code is in the
500–599 range; keep the existing checks for CURLE_SSL_CONNECT_ERROR,
CURLE_GOT_NOTHING, CURLE_RECV_ERROR, and CURLE_OPERATION_TIMEDOUT unchanged.
Ensure you access the CURL handle associated with the error (the same one used
to produce curl_error_info) to call curl_easy_getinfo and handle any failure to
get the response code conservatively (do not assume 5xx).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cc080208-cb4a-4c30-b558-158d0a1fff4c

📥 Commits

Reviewing files that changed from the base of the PR and between 9ae078e and 42c09539afd8560d4d2f43e525e90cb5bb55f7e3.

📒 Files selected for processing (4)
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/src/clp_s/log_converter/log_converter.cpp

Comment on lines +649 to +650
constexpr size_t cMaxRetries{3};
constexpr std::chrono::seconds cInitialBackoff{1};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Retry loop executes 4 attempts, not 3 retries.

The loop condition attempt <= cMaxRetries with cMaxRetries{3} means the loop runs for attempt values 0, 1, 2, 3 — that's 4 total attempts (1 initial + 3 retries). If the intention is 3 retries after the initial attempt, this is correct. However, the constant name cMaxRetries suggests the maximum number of retries should be 3, making 4 total attempts.

If you want exactly 3 total attempts (1 initial + 2 retries), change to attempt < cMaxRetries. Otherwise, consider renaming to cMaxAttempts for clarity.

Also applies to: 655-666

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/JsonParser.cpp` around lines 649 - 650, The retry
loop uses constexpr size_t cMaxRetries{3} together with the condition attempt <=
cMaxRetries, causing 4 attempts (attempts 0..3); either change the loop
condition to stop at the intended total attempts (e.g., use attempt <
cMaxRetries for exactly 3 total attempts) or rename/repurpose the constant to
cMaxAttempts and document that it includes the initial try (and keep the <=
condition). Update all occurrences (including the similar block referenced
around lines 655–666) to use the chosen convention consistently and adjust the
constant name or loop condition accordingly so the semantics match the
identifier.

Comment thread components/core/src/clp_s/JsonParser.cpp Outdated
Comment thread components/core/src/clp_s/log_converter/log_converter.cpp Outdated
Comment thread components/core/src/clp_s/log_converter/log_converter.cpp Outdated
Comment thread components/core/src/clp_s/log_converter/log_converter.cpp Outdated
Comment on lines +186 to +194
auto const code{curl_error_info->code()};
// CURLE_HTTP_RETURNED_ERROR (22): HTTP 5xx errors (503, 500, etc.)
// CURLE_SSL_CONNECT_ERROR (35): SSL connection reset
// CURLE_GOT_NOTHING (52): Empty reply from server
// CURLE_RECV_ERROR (56): Failure in receiving network data
// CURLE_OPERATION_TIMEDOUT (28): Operation timed out
return CURLE_HTTP_RETURNED_ERROR == code || CURLE_SSL_CONNECT_ERROR == code
|| CURLE_GOT_NOTHING == code || CURLE_RECV_ERROR == code
|| CURLE_OPERATION_TIMEDOUT == code;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Does CURLE_HTTP_RETURNED_ERROR trigger for HTTP 4xx errors or only 5xx?

💡 Result:

CURLE_HTTP_RETURNED_ERROR triggers for HTTP error codes >= 400 (both 4xx and 5xx), but only if CURLOPT_FAILONERROR is set to TRUE. By default, libcurl considers HTTP 4xx/5xx responses as successful transfers and does not return this error.

Citations:


🏁 Script executed:

# First, let's examine the Utils.cpp file around the function in question
cat -n components/core/src/clp_s/Utils.cpp | sed -n '170,210p'

Repository: y-scope/clp

Length of output: 1713


🏁 Script executed:

# Search for CURLOPT_FAILONERROR usage in the codebase
rg "CURLOPT_FAILONERROR" --type cpp --type hpp -B 2 -A 2

Repository: y-scope/clp

Length of output: 82


🏁 Script executed:

# Search for where is_retryable_curl_error is being used
rg "is_retryable_curl_error" --type cpp --type hpp -B 2 -A 2

Repository: y-scope/clp

Length of output: 82


🏁 Script executed:

# Search for CURLOPT_FAILONERROR using correct flags
rg "CURLOPT_FAILONERROR" -t cpp -B 2 -A 2

Repository: y-scope/clp

Length of output: 728


🏁 Script executed:

# Search for where is_retryable_curl_error is used
rg "is_retryable_curl_error" -t cpp -B 2 -A 2

Repository: y-scope/clp

Length of output: 3067


🏁 Script executed:

# Find and examine NetworkReader implementation
fd -e cpp -e h | xargs grep -l "class NetworkReader" | head -5

Repository: y-scope/clp

Length of output: 37


CURLE_HTTP_RETURNED_ERROR includes 4xx client errors, not just 5xx.

With CURLOPT_FAILONERROR enabled (which is explicitly set in CurlDownloadHandler.cpp), CURLE_HTTP_RETURNED_ERROR is triggered for any HTTP error code ≥ 400, including both client errors (4xx) and server errors (5xx). The comment incorrectly states "HTTP 5xx errors (503, 500, etc.)" but the code treats all HTTP errors ≥ 400 as retryable.

This is problematic because client errors like 401 (Unauthorized), 403 (Forbidden), and 404 (Not Found) are non-transient and will not succeed on retry. These indicate problems with the request or credentials, not temporary server issues.

To properly handle only server errors, check the actual HTTP response code using curl_easy_getinfo() with CURLINFO_RESPONSE_CODE and verify it falls in the 500–599 range before treating it as retryable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/Utils.cpp` around lines 186 - 194, The current
retry check treats CURLE_HTTP_RETURNED_ERROR (from curl_error_info->code()) as
always retryable but that includes 4xx client errors; update the logic in
Utils.cpp (the function that inspects curl_error_info->code()) to distinguish
HTTP 5xx from 4xx by calling curl_easy_getinfo on the CURL handle to fetch
CURLINFO_RESPONSE_CODE and only treat CURLE_HTTP_RETURNED_ERROR as retryable
when the response_code is in the 500–599 range; keep the existing checks for
CURLE_SSL_CONNECT_ERROR, CURLE_GOT_NOTHING, CURLE_RECV_ERROR, and
CURLE_OPERATION_TIMEDOUT unchanged. Ensure you access the CURL handle associated
with the error (the same one used to produce curl_error_info) to call
curl_easy_getinfo and handle any failure to get the response code conservatively
(do not assume 5xx).

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see this PR is aiming to handle two different failure scenarios for both clp-s and log-converter:

  1. We hit a retryable network error before trying to ingest any data from some input
  2. We hit a retryable network error after ingesting some data from some input

I think the first scenario is handled properly here for both clp-s and log-converter, but the second scenario will still have some issues. Specifically, in the log-converter path I think we'll end up with some logs being ingested more than once, and the clp-s path probably won't work since most of the code in JsonParser currently assumes that any read failure leads to overall compression failure, and the error handling paths currently leave JsonParser in a state where it can't continue ingestion.

Since handling this second scenario properly will require some more extensive changes, I think it might make sense to reduce our scope to just handling the first scenario in this PR. For that, we could probably create a new input utility like try_create_reader_and_deduce_input_type_with_retries in InputConfig that does the whole create reader -> deduce type -> check for network failure, back off, and retry loop. Then we could use that utility in both log_converter.cpp and JsonParser.cpp to keep the changes in those files minimal.

@goynam goynam force-pushed the fix/retry-transient-s3-failures branch from 42c0953 to 90f43fb Compare April 18, 2026 05:29

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/InputConfig.cpp`:
- Around line 494-518: The retry helper currently skips logging curl errors when
file_type == FileType::Unknown on the final attempt or for non-retryable curl
errors; update the control flow around try_create_reader/try_deduce_reader_type
so that when result yields FileType::Unknown and
NetworkUtils::check_and_log_curl_error(reader.get()) indicates a curl error you
always call check_and_log_curl_error (so non-retryable and final-attempt errors
get logged), but only use attempt < max_retries to decide whether to continue
the loop (i.e., keep NetworkUtils::is_retryable_curl_error(reader.get()) &&
attempt < max_retries for the continue branch); locate this logic around the
try_create_reader/try_deduce_reader_type result handling and adjust the branches
that reference reader, result, nested_readers, file_type, attempt, and
max_retries accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 18904f94-1802-44f0-9b86-6ba794b93496

📥 Commits

Reviewing files that changed from the base of the PR and between 42c09539afd8560d4d2f43e525e90cb5bb55f7e3 and 90f43fb.

📒 Files selected for processing (6)
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/InputConfig.hpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/src/clp_s/log_converter/log_converter.cpp

Comment on lines +494 to +518
auto reader{try_create_reader(path, network_auth)};
if (nullptr == reader) {
if (attempt < max_retries) {
SPDLOG_WARN("Failed to create reader for {}, will retry.", path.path);
continue;
}
SPDLOG_ERROR("Failed to open input {} for reading.", path.path);
return {{}, FileType::Unknown};
}

auto result = try_deduce_reader_type(reader);
auto const& [nested_readers, file_type] = result;

if (FileType::Unknown == file_type
&& NetworkUtils::is_retryable_curl_error(reader.get())
&& attempt < max_retries)
{
NetworkUtils::check_and_log_curl_error(path.path, reader.get());
continue;
}

return result;
}

return {{}, FileType::Unknown};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Retryable curl error on final attempt is silently dropped; non-retryable curl error on type deduction is never logged.

Two related observability gaps in this retry helper:

  1. When attempt == max_retries and the error is a retryable curl error, the if (... && attempt < max_retries) guard skips check_and_log_curl_error and falls through to return result;. Since result is {{}, FileType::Unknown}, downstream callers in JsonParser::ingest (lines 681–683) and convert_files (lines 59–64) gate their own logging on false == nested_readers.empty() and therefore also skip it. The last-attempt curl code/message is never logged, which defeats the diagnostic goal of this PR.

  2. For a non-retryable curl error on type deduction (e.g., CURLE_COULDNT_RESOLVE_HOST), the helper bypasses the retry branch entirely and returns without calling check_and_log_curl_error; again the callers cannot log because nested_readers is empty.

Consider unconditionally logging the curl error (when present) on any path that ends in FileType::Unknown, and reserve attempt < max_retries for the continue decision only.

🛠️ Suggested fix
-        if (FileType::Unknown == file_type
-            && NetworkUtils::is_retryable_curl_error(reader.get())
-            && attempt < max_retries)
-        {
-            NetworkUtils::check_and_log_curl_error(path.path, reader.get());
-            continue;
-        }
-
-        return result;
+        if (FileType::Unknown == file_type) {
+            auto const retryable{NetworkUtils::is_retryable_curl_error(reader.get())};
+            if (retryable && attempt < max_retries) {
+                NetworkUtils::check_and_log_curl_error(path.path, reader.get());
+                continue;
+            }
+            // Final attempt or non-retryable failure: log any curl error before giving up
+            // so the caller (which sees an empty nested_readers vector) isn't left in the dark.
+            NetworkUtils::check_and_log_curl_error(path.path, reader.get());
+        }
+
+        return result;
🧰 Tools
🪛 Cppcheck (2.20.0)

[style] 508-508: The function 'wildcard_query_matches_any_encoded_var' is never used.

(unusedFunction)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/InputConfig.cpp` around lines 494 - 518, The retry
helper currently skips logging curl errors when file_type == FileType::Unknown
on the final attempt or for non-retryable curl errors; update the control flow
around try_create_reader/try_deduce_reader_type so that when result yields
FileType::Unknown and NetworkUtils::check_and_log_curl_error(reader.get())
indicates a curl error you always call check_and_log_curl_error (so
non-retryable and final-attempt errors get logged), but only use attempt <
max_retries to decide whether to continue the loop (i.e., keep
NetworkUtils::is_retryable_curl_error(reader.get()) && attempt < max_retries for
the continue branch); locate this logic around the
try_create_reader/try_deduce_reader_type result handling and adjust the branches
that reference reader, result, nested_readers, file_type, attempt, and
max_retries accordingly.

@gibber9809 gibber9809 changed the title fix(core): add retry with exponential backoff for transient S3 read failures fix(core): Add retry with exponential backoff for transient S3 read failures Apr 18, 2026
@gibber9809 gibber9809 self-requested a review April 18, 2026 15:21
gibber9809
gibber9809 previously approved these changes Apr 18, 2026

@gibber9809 gibber9809 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the changes. Directly edited the PR title as well. Should be good to merge once the CI finishes.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
components/core/src/clp_s/log_converter/log_converter.cpp (1)

44-67: 🧹 Nitpick | 🔵 Trivial

LGTM — refactor consolidates retry/deduce and matches the JsonParser pattern.

Consolidation into try_create_reader_and_deduce_type_with_retries removes the duplicated retry loop and the prior double-logging issue, and the false == nested_readers.empty() guard before check_and_log_curl_error(..., nested_readers.front().get()) correctly mirrors the pattern in JsonParser::ingest() (guarding against the empty-vector return on reader-creation failure from the helper).

One small note: unlike JsonParser::ingest(), this path does not call close_nested_readers(nested_readers) on the error return at line 66 (nor on the successful break path before convert_file at line 70). Since convert_files returns immediately to main on error, it's effectively benign here, but relying on shared-pointer destruction for cleanup is inconsistent with the sibling callsite — worth considering for symmetry.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/log_converter/log_converter.cpp` around lines 44 -
67, The switch handling the result of
try_create_reader_and_deduce_type_with_retries doesn't call
clp_s::close_nested_readers(nested_readers) for symmetry with
JsonParser::ingest: ensure you call clp_s::close_nested_readers(nested_readers)
in the error/default branch just before returning false, and also call
clp_s::close_nested_readers(nested_readers) on the LogText/EmptyFile path after
the switch (before invoking convert_file) so nested readers are explicitly
closed in both success and failure paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@components/core/src/clp_s/log_converter/log_converter.cpp`:
- Around line 44-67: The switch handling the result of
try_create_reader_and_deduce_type_with_retries doesn't call
clp_s::close_nested_readers(nested_readers) for symmetry with
JsonParser::ingest: ensure you call clp_s::close_nested_readers(nested_readers)
in the error/default branch just before returning false, and also call
clp_s::close_nested_readers(nested_readers) on the LogText/EmptyFile path after
the switch (before invoking convert_file) so nested readers are explicitly
closed in both success and failure paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0cb0e2e9-68bd-4b90-abda-1fa443226e34

📥 Commits

Reviewing files that changed from the base of the PR and between 90f43fb and 1aa75f7.

📒 Files selected for processing (1)
  • components/core/src/clp_s/log_converter/log_converter.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@components/core/src/clp_s/InputConfig.cpp`:
- Around line 494-502: The current loop unconditionally retries when
try_create_reader(path, network_auth) returns nullptr; change this so
non-transient failures fail fast: after try_create_reader(...) returns nullptr,
detect whether the path is a network resource (e.g. check the path/url kind or
use the NetworkReader construction error indicator) and only apply the
exponential backoff/retry when the failure is a network/transient curl-type
error (use is_retryable_curl_error or the NetworkReader-specific error info);
for filesystem/missing-permission or other non-retryable reasons, log an
SPDLOG_ERROR with the path and return {{}, FileType::Unknown} immediately
instead of continuing through attempt/max_retries.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 7c9ec2e5-495a-4a66-b57b-61890a317ffe

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa75f7 and 834f44d.

📒 Files selected for processing (4)
  • components/core/src/clp_s/InputConfig.cpp
  • components/core/src/clp_s/Utils.cpp
  • components/core/src/clp_s/Utils.hpp
  • components/core/src/clp_s/log_converter/log_converter.cpp

Comment on lines +494 to +502
auto reader{try_create_reader(path, network_auth)};
if (nullptr == reader) {
if (attempt < max_retries) {
SPDLOG_WARN("Failed to create reader for {}, will retry.", path.path);
continue;
}
SPDLOG_ERROR("Failed to open input {} for reading.", path.path);
return {{}, FileType::Unknown};
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Unconditional retry of try_create_reader adds wasted backoff for non-transient failures.

try_create_reader returns nullptr for reasons that are almost always non-transient: missing files / permission denied on the filesystem path, invalid URLs or missing/invalid AWS credentials on the network path (the clp::NetworkReader constructor does not itself perform the curl transfer — transient S3 errors surface later during read). Retrying these paths with exponential backoff delays the real error by ~7s (1+2+4) with no hope of succeeding.

Consider scoping retries to the network path and/or only when is_retryable_curl_error applies, and failing fast for filesystem creation failures.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@components/core/src/clp_s/InputConfig.cpp` around lines 494 - 502, The
current loop unconditionally retries when try_create_reader(path, network_auth)
returns nullptr; change this so non-transient failures fail fast: after
try_create_reader(...) returns nullptr, detect whether the path is a network
resource (e.g. check the path/url kind or use the NetworkReader construction
error indicator) and only apply the exponential backoff/retry when the failure
is a network/transient curl-type error (use is_retryable_curl_error or the
NetworkReader-specific error info); for filesystem/missing-permission or other
non-retryable reasons, log an SPDLOG_ERROR with the path and return {{},
FileType::Unknown} immediately instead of continuing through
attempt/max_retries.

@kirkrodrigues kirkrodrigues left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferring to @gibber9809's review.

For the PR title, how about:

fix(core): Add retries with exponential backoff for transient S3 read failures before any data is read.

Can we also edit the PR description to more accurately reflect the changes in this PR?

@goynam goynam changed the title fix(core): Add retry with exponential backoff for transient S3 read failures fix(core): Add retries with exponential backoff for transient S3 read failures before any data is read. Apr 20, 2026
@gibber9809 gibber9809 merged commit ffa05a7 into y-scope:main Apr 20, 2026
30 checks passed
@junhaoliao junhaoliao added this to the Mid-April 2026 milestone Apr 24, 2026
junhaoliao pushed a commit to junhaoliao/clp that referenced this pull request May 17, 2026
… failures before any data is read. (y-scope#2203)

Co-authored-by: Naman Goyal <namangg@amazon.com>
Co-authored-by: Devin Gibson <gibber9809@users.noreply.github.com>
Co-authored-by: dgibson <devinbook1@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants