Skip to content

NetlinkRequest: retry on ErrDumpInterrupted in NetlinkRequest.Execute#1170

Merged
aboch merged 1 commit intovishvananda:mainfrom
ti-mo:main
Apr 3, 2026
Merged

NetlinkRequest: retry on ErrDumpInterrupted in NetlinkRequest.Execute#1170
aboch merged 1 commit intovishvananda:mainfrom
ti-mo:main

Conversation

@ti-mo
Copy link
Copy Markdown
Contributor

@ti-mo ti-mo commented Mar 3, 2026

This PR makes NetlinkRequest.Execute() retry up to 10 times before finally returning ErrDumpInterrupted. Given the lack of responses in #1163, I assume this isn't a controversial proposal.

Fixes: #1163.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of network dump operations by adding bounded automatic retries for interrupted transfers (up to 10 attempts), reducing transient failures and increasing success rates.
    • When retries are exhausted, the system now returns a single consolidated error instead of partial results, simplifying diagnosis and avoiding confusing partial outputs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 3, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Replaces a single-call ExecuteIter in NetlinkRequest.Execute with a bounded retry loop (10 attempts) that retries on ErrDumpInterrupted, accumulating messages across attempts and returning collected messages on success or a wrapped error if retries are exhausted or a non-interrupt error occurs.

Changes

Cohort / File(s) Summary
Netlink dump retry mechanism
nl/nl_linux.go
Replaces a single ExecuteIter call with a bounded retry loop (10 attempts). Collects messages into a local result slice each iteration, retries when ErrDumpInterrupted is seen, returns consolidated messages on success, and returns a wrapped error if a non-interrupt error occurs or retries are exhausted.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant NetlinkLib as NetlinkRequest.Execute
    participant Kernel as Netlink Socket / Kernel

    Caller->>NetlinkLib: Execute request
    NetlinkLib->>Kernel: ExecuteIter (read dump)
    Kernel-->>NetlinkLib: Messages (+ optional ErrDumpInterrupted)
    alt ErrDumpInterrupted
        NetlinkLib->>NetlinkLib: retry up to 10 times (collect messages)
        NetlinkLib->>Kernel: ExecuteIter (retry)
        Kernel-->>NetlinkLib: Messages (+/− ErrDumpInterrupted)
    end
    alt Success (no ErrDumpInterrupted)
        NetlinkLib-->>Caller: consolidated messages
    else Retries exhausted or non-interrupt error
        NetlinkLib-->>Caller: wrapped error ("execute netlink request: <err>")
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐇 I poked the netlink, soft and bright,
The kernel hiccuped, then gave me a sight.
Ten tiny hops, I gathered the tune,
Until the dump settled beneath the moon. 🥕

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Linked Issues check ✅ Passed The code implements the core requirement from issue #1163: a bounded retry loop (10 iterations) in NetlinkRequest.Execute to automatically retry on ErrDumpInterrupted before returning an error.
Out of Scope Changes check ✅ Passed The changes are limited to nl/nl_linux.go and focus solely on implementing the retry mechanism for ErrDumpInterrupted as specified in the linked issue, with no extraneous modifications.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding retry logic on ErrDumpInterrupted in NetlinkRequest.Execute.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nl/nl_linux.go`:
- Around line 544-550: Store the original callback reference before starting
retries and restore it on each retry: save the incoming callback parameter f
into a local variable (e.g., origF) where NetlinkSocket is created/used, and on
any retry path that reinitializes or replaces NetlinkSocket or its callback, set
the socket's callback back to origF (rather than a mutated or nil value) and
then proceed with the retry loop that uses maxRetries and i; ensure references
to NetlinkSocket and f/origF are updated so the original function is preserved
across all retry attempts.
- Around line 678-687: The callback `f` is overwritten with `dummyMsgIterFunc`
when the user callback returns false, but that mutated `f` is still used across
retries causing the original callback to be lost; to fix, capture and preserve
the original callback at the start of the function (e.g., save `origF := f`),
use `f = dummyMsgIterFunc` only for draining, and before jumping to the `retry`
label restore it (`f = origF`) so subsequent retry attempts invoke the user's
original callback; update references to `f`, `dummyMsgIterFunc`, and the `retry`
flow accordingly.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c7039a4 and 2d11699.

📒 Files selected for processing (1)
  • nl/nl_linux.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
nl/nl_linux.go (1)

616-623: ⚠️ Potential issue | 🔴 Critical

Retries currently leak attempt-local callback state and can emit duplicate data.

At Line 622, control jumps to retry after earlier messages in the same attempt may already have been sent to f, so a later successful attempt can re-emit data (notably affecting Execute() aggregation). Also, if f was swapped to dummyMsgIterFunc earlier, that mutated callback is reused on retry, suppressing the original callback. This contradicts the intent in Lines 618-619.

🐛 Proposed fix direction
 func (req *NetlinkRequest) ExecuteIter(sockType int, resType uint16, f func(msg []byte) bool) error {
 	const maxRetries = 10
+	origF := f
+	var staged [][]byte

 retry:
+	f = origF
+	staged = staged[:0]

 	if err := s.Send(req); err != nil {
 		return err
 	}

 ...
-			if cont := f(m.Data); !cont {
-				// Drain the rest of the messages from the kernel but don't
-				// pass them to the iterator func.
-				f = dummyMsgIterFunc
-			}
+			// Stage per-attempt data; only flush after a successful, non-interrupted dump.
+			staged = append(staged, append([]byte(nil), m.Data...))
 ...
 	}
+	for _, data := range staged {
+		if cont := origF(data); !cont {
+			break
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nl/nl_linux.go` around lines 616 - 623, The retry logic leaks attempt-local
callback state and can re-emit or suppress messages; to fix, capture the
original callback (save f to origF) and ensure each attempt uses a fresh local
iterator/callback variable that you reinitialize before jumping to the retry
label, and restore origF when deciding to retry so dummyMsgIterFunc or any
mutated f isn't reused; additionally ensure partial messages from a failed
attempt are not forwarded to origF (either buffer per-attempt messages and only
forward after a successful attempt, or suppress forwarding until success),
updating the code paths around the retry label, the i/maxRetries loop, and where
f is set so Execute() aggregation sees each message exactly once.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nl/nl_linux.go`:
- Around line 620-623: The retry loop using the label "retry" increments i and
checks "if i < maxRetries" which makes the loop allow only maxRetries-1 retries;
change the loop condition to "if i <= maxRetries" or adjust the
increment/initialization so the loop performs exactly maxRetries attempts, and
also update the reporting that prints i to reflect whether i counts attempts or
retries (e.g., report attempts as i or retries as i-1) so the message from the
code that logs "%d retries" matches the corrected counting; make changes around
the "retry" label, the variables i and maxRetries, and the final log/print that
emits the retry count.

---

Duplicate comments:
In `@nl/nl_linux.go`:
- Around line 616-623: The retry logic leaks attempt-local callback state and
can re-emit or suppress messages; to fix, capture the original callback (save f
to origF) and ensure each attempt uses a fresh local iterator/callback variable
that you reinitialize before jumping to the retry label, and restore origF when
deciding to retry so dummyMsgIterFunc or any mutated f isn't reused;
additionally ensure partial messages from a failed attempt are not forwarded to
origF (either buffer per-attempt messages and only forward after a successful
attempt, or suppress forwarding until success), updating the code paths around
the retry label, the i/maxRetries loop, and where f is set so Execute()
aggregation sees each message exactly once.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d11699 and 98ca605.

📒 Files selected for processing (1)
  • nl/nl_linux.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
nl/nl_linux.go (2)

614-620: ⚠️ Potential issue | 🔴 Critical

Critical: Retry path can emit corrupted iteration semantics (stale/duplicate data and lost callback).

At Line 619, jumping to retry from inside the receive loop retries before fully draining the current multipart response. Combined with callback invocation before interruption is known (see Line 674) and callback mutation at Line 677, this can (a) leak partial data into Execute() accumulations, (b) duplicate entries across attempts, and (c) keep f as dummyMsgIterFunc on subsequent retries.

🔧 Proposed fix sketch (drain first, preserve callback, then retry)
 func (req *NetlinkRequest) ExecuteIter(sockType int, resType uint16, f func(msg []byte) bool) error {
 	var (
 		s        *NetlinkSocket
 		err      error
-		attempts = 10
+		maxRetries = 10
+		attempts   = maxRetries
 	)
+	origF := f

 retry:
 	if err := s.Send(req); err != nil {
 		return err
 	}
@@
 			if m.Header.Flags&unix.NLM_F_DUMP_INTR != 0 {
-				if attempts--; attempts > 0 {
-					goto retry
-				}
 				dumpIntr = true
+				// Finish draining this multipart response first.
+				f = dummyMsgIterFunc
 			}
@@
 	if dumpIntr {
-		return fmt.Errorf("dump interrupted after multiple attempts: %w", ErrDumpInterrupted)
+		if attempts--; attempts > 0 {
+			f = origF
+			goto retry
+		}
+		return fmt.Errorf("dump interrupted after %d retries: %w", maxRetries, ErrDumpInterrupted)
 	}

Also applies to: 674-678, 581-581

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nl/nl_linux.go` around lines 614 - 620, The retry jump from inside the
receive loop (the attempts-- then goto retry path) can leave the current
multipart Netlink response undrained and can corrupt iteration semantics
(leaking partial data, duplicating entries and clobbering the callback f to
dummyMsgIterFunc); fix by draining the rest of the multipart response before
performing the retry, preserve the original callback by using a local saved
callback variable instead of overwriting f (avoid assigning dummyMsgIterFunc
until after the entire exchange is known to be finished), and ensure no samples
are emitted to Execute() accumulations until the request fully succeeds; locate
the receive loop and retry label (retry), the attempts decrement, the callback
variable f, dummyMsgIterFunc, and Execute() accumulation code to implement these
changes.

547-547: ⚠️ Potential issue | 🟠 Major

Retry budget is off-by-one vs “retry up to 10 times”, and exhaustion context is vague.

With attempts = 10 and if attempts--; attempts > 0 at Line 618, the code performs at most 9 retries after the first interrupted attempt. Also, Line 685 should include the concrete retry count for clearer diagnostics.

Also applies to: 618-620, 685-685

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nl/nl_linux.go` at line 547, The retry loop uses "attempts" and the
conditional "if attempts--; attempts > 0" which causes one fewer retry than
intended; change the logic to perform exactly 10 retries by either initializing
a separate const retries = 10 and iterating with a clear loop (e.g. for i := 0;
i < retries; i++) or adjust the decrement/check to use >= 0 so the count matches
the intended number, and update the exhaustion error/log where the code reports
failure (the final error path that currently uses the "attempts" variable) to
include the concrete retry count (e.g., the configured retries constant or an
explicit attemptsPerformed value) for clearer diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@nl/nl_linux.go`:
- Around line 614-620: The retry jump from inside the receive loop (the
attempts-- then goto retry path) can leave the current multipart Netlink
response undrained and can corrupt iteration semantics (leaking partial data,
duplicating entries and clobbering the callback f to dummyMsgIterFunc); fix by
draining the rest of the multipart response before performing the retry,
preserve the original callback by using a local saved callback variable instead
of overwriting f (avoid assigning dummyMsgIterFunc until after the entire
exchange is known to be finished), and ensure no samples are emitted to
Execute() accumulations until the request fully succeeds; locate the receive
loop and retry label (retry), the attempts decrement, the callback variable f,
dummyMsgIterFunc, and Execute() accumulation code to implement these changes.
- Line 547: The retry loop uses "attempts" and the conditional "if attempts--;
attempts > 0" which causes one fewer retry than intended; change the logic to
perform exactly 10 retries by either initializing a separate const retries = 10
and iterating with a clear loop (e.g. for i := 0; i < retries; i++) or adjust
the decrement/check to use >= 0 so the count matches the intended number, and
update the exhaustion error/log where the code reports failure (the final error
path that currently uses the "attempts" variable) to include the concrete retry
count (e.g., the configured retries constant or an explicit attemptsPerformed
value) for clearer diagnostics.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 37fc0b86-e0bd-4be4-b685-3cd4e8a97edb

📥 Commits

Reviewing files that changed from the base of the PR and between 98ca605 and a0f03ac.

📒 Files selected for processing (1)
  • nl/nl_linux.go

@ti-mo
Copy link
Copy Markdown
Contributor Author

ti-mo commented Mar 4, 2026

Okay, obviously putting a retry in ExecuteIter isn't great, but my goal was to put this in a spot where all callers could benefit from it.

After reading through a bit more of the surrounding code, I don't think a retry mechanism can be married with something like RouteListFilteredIter, since the point is to avoid buffering unwanted entries. Direct users of ExecuteIter will need to handle this on their own, since they'll need to invalidate their buffers and retry (or fail with ErrDumpInterrupted like they do now).

@ti-mo ti-mo changed the title Retry dumps up to 10 times in ExecuteIter before returning ErrDumpInterrupted Retry up to 10 times on ErrDumpInterrupted in NetlinkRequest.Execute Mar 4, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
nl/nl_linux.go (1)

523-538: Add explicit retry-exhaustion context for ErrDumpInterrupted.

At Line 538, exhausted dump interruptions return a generic wrapped error without attempt count. Including attempt metadata would make CI/debugging much clearer.

Suggested diff
 func (req *NetlinkRequest) Execute(sockType int, resType uint16) ([][]byte, error) {
+	const maxAttempts = 10
 	var err error
-	for range 10 {
+	for range maxAttempts {
 		var res [][]byte
 		err = req.ExecuteIter(sockType, resType, func(msg []byte) bool {
 			res = append(res, msg)
 			return true
 		})
 		if errors.Is(err, ErrDumpInterrupted) {
 			continue
 		}
 		if err == nil {
 			return res, nil
 		}
 		break
 	}
 
+	if errors.Is(err, ErrDumpInterrupted) {
+		return nil, fmt.Errorf("execute netlink request interrupted after %d attempts: %w", maxAttempts, err)
+	}
 	return nil, fmt.Errorf("execute netlink request: %w", err)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@nl/nl_linux.go` around lines 523 - 538, The loop swallows multiple
ErrDumpInterrupted attempts and returns a generic wrapped error; add a retry
counter and when ErrDumpInterrupted keeps occurring for all retries return an
error that includes the exhausted-attempts metadata. Modify the block that calls
req.ExecuteIter (the loop around req.ExecuteIter, ErrDumpInterrupted, res and
final return) to increment an attempts variable each iteration, and on final
failure format/Wrap the returned error to include the attempt count (e.g.,
"execute netlink request: %w (attempts=%d)") so consumers can see how many
retries occurred before giving up.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@nl/nl_linux.go`:
- Around line 523-538: The loop swallows multiple ErrDumpInterrupted attempts
and returns a generic wrapped error; add a retry counter and when
ErrDumpInterrupted keeps occurring for all retries return an error that includes
the exhausted-attempts metadata. Modify the block that calls req.ExecuteIter
(the loop around req.ExecuteIter, ErrDumpInterrupted, res and final return) to
increment an attempts variable each iteration, and on final failure format/Wrap
the returned error to include the attempt count (e.g., "execute netlink request:
%w (attempts=%d)") so consumers can see how many retries occurred before giving
up.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6dd928c5-786a-4d19-b23c-f215fa832ab9

📥 Commits

Reviewing files that changed from the base of the PR and between a0f03ac and 69a8708.

📒 Files selected for processing (1)
  • nl/nl_linux.go

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@nl/nl_linux.go`:
- Around line 525-541: The loop currently discards the accumulated messages
`res` when retries exhaust with `ErrDumpInterrupted` and returns nil; change the
final error branch so that when `errors.Is(err, ErrDumpInterrupted)` you return
the last-collected `res` along with the wrapped error instead of `nil` (preserve
the results from `req.ExecuteIter`), i.e. return `res, fmt.Errorf("execute
netlink request (%d attempts): %w", attempts, err)` so callers (e.g., code using
`addr_linux.go`) receive best-effort partial results.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: cceb2dee-e3ad-4fde-93e3-ef7d90ae454b

📥 Commits

Reviewing files that changed from the base of the PR and between 69a8708 and 8ce0bea.

📒 Files selected for processing (1)
  • nl/nl_linux.go

@ti-mo ti-mo force-pushed the main branch 2 times, most recently from c9f866e to 03930dc Compare March 4, 2026 15:54
@ti-mo
Copy link
Copy Markdown
Contributor Author

ti-mo commented Mar 5, 2026

@aboch I've iterated on this a bit after some AI suggestions. Ended up moving it to Execute(), and apparently we can't wrap errors since downstream code performs type assertions on error values. Not going to update those now. Cilium CI passes with this patch.

@ti-mo
Copy link
Copy Markdown
Contributor Author

ti-mo commented Mar 10, 2026

Rebased on top of #1169 to pick up the RDMA fix and also fixed XfrmStateGet/Del on Linux 6.13 and later, so TestXfrmStateWithPcpunumWithSADir now passes and CI is back to being green.

@aboch PTAL

@aboch
Copy link
Copy Markdown
Collaborator

aboch commented Mar 10, 2026

Thank you @ti-mo !

Can you please move the xrfm fix to a dedicated PR.
This projects follows a "1 commit/topic per PR" approach.

@ti-mo
Copy link
Copy Markdown
Contributor Author

ti-mo commented Mar 10, 2026

Thank you @ti-mo !

Can you please move the xrfm fix to a dedicated PR. This projects follows a "1 commit/topic per PR" approach.

@aboch Done: #1172.

@ti-mo ti-mo force-pushed the main branch 2 times, most recently from 20293a4 to 9f69158 Compare March 10, 2026 14:53
@aboch
Copy link
Copy Markdown
Collaborator

aboch commented Mar 11, 2026

Thanks @ti-mo for the proposal.

I have been thinking if it is the correct behavior for the library to keep retrying on its own and masking the original dump interrupted issue.

Would it make sense to have this functionality controllable by a package level flag?
I was thinking to have the internal retry off by default.
Then, any integration of this netlink library can decide whether to turn it on.

WDYT?

@ti-mo
Copy link
Copy Markdown
Contributor Author

ti-mo commented Mar 18, 2026

[...] a package level flag [...]

That's rarely the answer, but I've added Handle.RetryInterrupted() and propagated it to NetlinkRequest. I plan on making a follow-up PR to deprecate Handle.DisableVFInfoCollection() and add a NewHandleWithOptions() that takes a *HandleOptions with all fields exported to make Handle configurable in a declarative manner.

IIUC, DisableVFInfoCollection() was introduced for Cilium and it looks like we're the only public user. The requirement to call a method on Handle is quite inconvenient.

@ti-mo ti-mo requested a review from aboch March 18, 2026 13:19
@ti-mo ti-mo changed the title Retry up to 10 times on ErrDumpInterrupted in NetlinkRequest.Execute NetlinkRequest: retry on ErrDumpInterrupted in NetlinkRequest.Execute Mar 18, 2026
@aboch
Copy link
Copy Markdown
Collaborator

aboch commented Mar 20, 2026

@ti-mo what is your proposal for enabling this option on the package level handle, pkgHandle ?

@ti-mo
Copy link
Copy Markdown
Contributor Author

ti-mo commented Mar 23, 2026

@aboch I've included a ConfigureHandle in #1174: 369d043

func ConfigureHandle(opts HandleOptions) error

Not great API design, since independent users of this package that want conflicting flags will inevitably step on each other. Given the shared nature of pkgHandle, not sure what other option we have. If you have any more feedback, please give it in the other PR.

@aboch
Copy link
Copy Markdown
Collaborator

aboch commented Mar 23, 2026

@aboch I've included a ConfigureHandle in #1174: 369d043

func ConfigureHandle(opts HandleOptions) error

Not great API design, since independent users of this package that want conflicting flags will inevitably step on each other. Given the shared nature of pkgHandle, not sure what other option we have. If you have any more feedback, please give it in the other PR.

I see it now.
Thank you

Copilot AI review requested due to automatic review settings April 2, 2026 14:13
The addition of ErrDumpInterrupted forced all callers to deal explicitly with
interrupted dumps, while typically the caller should simply retry.

This commit adds a retry system that makes the library retry up to 10 times
before finally giving up with ErrDumpInterrupted. This should lead to fewer
surprised users.

It can be enabled by calling Handle.RetryInterrupted.

Signed-off-by: Timo Beckers <timo@incline.eu>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in mechanism to retry netlink dump operations when the kernel indicates an interrupted dump (NLM_F_DUMP_INTR / ErrDumpInterrupted), aiming to reduce transient failures (per #1163).

Changes:

  • Add RetryInterrupted configuration on nl.NetlinkRequest and plumb it from netlink.Handle.
  • Update NetlinkRequest.Execute() to retry up to 10 times (when enabled) upon ErrDumpInterrupted.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
nl/nl_linux.go Adds a RetryInterrupted flag and implements bounded retries in NetlinkRequest.Execute().
handle_linux.go Adds Handle.RetryInterrupted() and passes the option through to newly created netlink requests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aboch
Copy link
Copy Markdown
Collaborator

aboch commented Apr 2, 2026

LGTM

@ti-mo please check the last last comment from coderabbit. If you tell me it is not relevant, we merge this.
Thank you

@ti-mo
Copy link
Copy Markdown
Contributor Author

ti-mo commented Apr 3, 2026

Yep, it's a false positive, Copilot took the wrapping comment a bit too literally. We want to avoid wrapping non-ErrDumpInterrupted errors. After a cursory local check it's fine, but I'm putting this through a full round of Cilium CI just in case: cilium/cilium#44315.

@aboch aboch merged commit 188504c into vishvananda:main Apr 3, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Automatic retries on NLM_F_DUMP_INTR

4 participants