NetlinkRequest: retry on ErrDumpInterrupted in NetlinkRequest.Execute#1170
NetlinkRequest: retry on ErrDumpInterrupted in NetlinkRequest.Execute#1170aboch merged 1 commit intovishvananda:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughReplaces a single-call Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant NetlinkLib as NetlinkRequest.Execute
participant Kernel as Netlink Socket / Kernel
Caller->>NetlinkLib: Execute request
NetlinkLib->>Kernel: ExecuteIter (read dump)
Kernel-->>NetlinkLib: Messages (+ optional ErrDumpInterrupted)
alt ErrDumpInterrupted
NetlinkLib->>NetlinkLib: retry up to 10 times (collect messages)
NetlinkLib->>Kernel: ExecuteIter (retry)
Kernel-->>NetlinkLib: Messages (+/− ErrDumpInterrupted)
end
alt Success (no ErrDumpInterrupted)
NetlinkLib-->>Caller: consolidated messages
else Retries exhausted or non-interrupt error
NetlinkLib-->>Caller: wrapped error ("execute netlink request: <err>")
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nl/nl_linux.go`:
- Around line 544-550: Store the original callback reference before starting
retries and restore it on each retry: save the incoming callback parameter f
into a local variable (e.g., origF) where NetlinkSocket is created/used, and on
any retry path that reinitializes or replaces NetlinkSocket or its callback, set
the socket's callback back to origF (rather than a mutated or nil value) and
then proceed with the retry loop that uses maxRetries and i; ensure references
to NetlinkSocket and f/origF are updated so the original function is preserved
across all retry attempts.
- Around line 678-687: The callback `f` is overwritten with `dummyMsgIterFunc`
when the user callback returns false, but that mutated `f` is still used across
retries causing the original callback to be lost; to fix, capture and preserve
the original callback at the start of the function (e.g., save `origF := f`),
use `f = dummyMsgIterFunc` only for draining, and before jumping to the `retry`
label restore it (`f = origF`) so subsequent retry attempts invoke the user's
original callback; update references to `f`, `dummyMsgIterFunc`, and the `retry`
flow accordingly.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
nl/nl_linux.go (1)
616-623:⚠️ Potential issue | 🔴 CriticalRetries currently leak attempt-local callback state and can emit duplicate data.
At Line 622, control jumps to
retryafter earlier messages in the same attempt may already have been sent tof, so a later successful attempt can re-emit data (notably affectingExecute()aggregation). Also, iffwas swapped todummyMsgIterFuncearlier, that mutated callback is reused on retry, suppressing the original callback. This contradicts the intent in Lines 618-619.🐛 Proposed fix direction
func (req *NetlinkRequest) ExecuteIter(sockType int, resType uint16, f func(msg []byte) bool) error { const maxRetries = 10 + origF := f + var staged [][]byte retry: + f = origF + staged = staged[:0] if err := s.Send(req); err != nil { return err } ... - if cont := f(m.Data); !cont { - // Drain the rest of the messages from the kernel but don't - // pass them to the iterator func. - f = dummyMsgIterFunc - } + // Stage per-attempt data; only flush after a successful, non-interrupted dump. + staged = append(staged, append([]byte(nil), m.Data...)) ... } + for _, data := range staged { + if cont := origF(data); !cont { + break + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nl/nl_linux.go` around lines 616 - 623, The retry logic leaks attempt-local callback state and can re-emit or suppress messages; to fix, capture the original callback (save f to origF) and ensure each attempt uses a fresh local iterator/callback variable that you reinitialize before jumping to the retry label, and restore origF when deciding to retry so dummyMsgIterFunc or any mutated f isn't reused; additionally ensure partial messages from a failed attempt are not forwarded to origF (either buffer per-attempt messages and only forward after a successful attempt, or suppress forwarding until success), updating the code paths around the retry label, the i/maxRetries loop, and where f is set so Execute() aggregation sees each message exactly once.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nl/nl_linux.go`:
- Around line 620-623: The retry loop using the label "retry" increments i and
checks "if i < maxRetries" which makes the loop allow only maxRetries-1 retries;
change the loop condition to "if i <= maxRetries" or adjust the
increment/initialization so the loop performs exactly maxRetries attempts, and
also update the reporting that prints i to reflect whether i counts attempts or
retries (e.g., report attempts as i or retries as i-1) so the message from the
code that logs "%d retries" matches the corrected counting; make changes around
the "retry" label, the variables i and maxRetries, and the final log/print that
emits the retry count.
---
Duplicate comments:
In `@nl/nl_linux.go`:
- Around line 616-623: The retry logic leaks attempt-local callback state and
can re-emit or suppress messages; to fix, capture the original callback (save f
to origF) and ensure each attempt uses a fresh local iterator/callback variable
that you reinitialize before jumping to the retry label, and restore origF when
deciding to retry so dummyMsgIterFunc or any mutated f isn't reused;
additionally ensure partial messages from a failed attempt are not forwarded to
origF (either buffer per-attempt messages and only forward after a successful
attempt, or suppress forwarding until success), updating the code paths around
the retry label, the i/maxRetries loop, and where f is set so Execute()
aggregation sees each message exactly once.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
nl/nl_linux.go (2)
614-620:⚠️ Potential issue | 🔴 CriticalCritical: Retry path can emit corrupted iteration semantics (stale/duplicate data and lost callback).
At Line 619, jumping to
retryfrom inside the receive loop retries before fully draining the current multipart response. Combined with callback invocation before interruption is known (see Line 674) and callback mutation at Line 677, this can (a) leak partial data intoExecute()accumulations, (b) duplicate entries across attempts, and (c) keepfasdummyMsgIterFuncon subsequent retries.🔧 Proposed fix sketch (drain first, preserve callback, then retry)
func (req *NetlinkRequest) ExecuteIter(sockType int, resType uint16, f func(msg []byte) bool) error { var ( s *NetlinkSocket err error - attempts = 10 + maxRetries = 10 + attempts = maxRetries ) + origF := f retry: if err := s.Send(req); err != nil { return err } @@ if m.Header.Flags&unix.NLM_F_DUMP_INTR != 0 { - if attempts--; attempts > 0 { - goto retry - } dumpIntr = true + // Finish draining this multipart response first. + f = dummyMsgIterFunc } @@ if dumpIntr { - return fmt.Errorf("dump interrupted after multiple attempts: %w", ErrDumpInterrupted) + if attempts--; attempts > 0 { + f = origF + goto retry + } + return fmt.Errorf("dump interrupted after %d retries: %w", maxRetries, ErrDumpInterrupted) }Also applies to: 674-678, 581-581
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nl/nl_linux.go` around lines 614 - 620, The retry jump from inside the receive loop (the attempts-- then goto retry path) can leave the current multipart Netlink response undrained and can corrupt iteration semantics (leaking partial data, duplicating entries and clobbering the callback f to dummyMsgIterFunc); fix by draining the rest of the multipart response before performing the retry, preserve the original callback by using a local saved callback variable instead of overwriting f (avoid assigning dummyMsgIterFunc until after the entire exchange is known to be finished), and ensure no samples are emitted to Execute() accumulations until the request fully succeeds; locate the receive loop and retry label (retry), the attempts decrement, the callback variable f, dummyMsgIterFunc, and Execute() accumulation code to implement these changes.
547-547:⚠️ Potential issue | 🟠 MajorRetry budget is off-by-one vs “retry up to 10 times”, and exhaustion context is vague.
With
attempts = 10andif attempts--; attempts > 0at Line 618, the code performs at most 9 retries after the first interrupted attempt. Also, Line 685 should include the concrete retry count for clearer diagnostics.Also applies to: 618-620, 685-685
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nl/nl_linux.go` at line 547, The retry loop uses "attempts" and the conditional "if attempts--; attempts > 0" which causes one fewer retry than intended; change the logic to perform exactly 10 retries by either initializing a separate const retries = 10 and iterating with a clear loop (e.g. for i := 0; i < retries; i++) or adjust the decrement/check to use >= 0 so the count matches the intended number, and update the exhaustion error/log where the code reports failure (the final error path that currently uses the "attempts" variable) to include the concrete retry count (e.g., the configured retries constant or an explicit attemptsPerformed value) for clearer diagnostics.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@nl/nl_linux.go`:
- Around line 614-620: The retry jump from inside the receive loop (the
attempts-- then goto retry path) can leave the current multipart Netlink
response undrained and can corrupt iteration semantics (leaking partial data,
duplicating entries and clobbering the callback f to dummyMsgIterFunc); fix by
draining the rest of the multipart response before performing the retry,
preserve the original callback by using a local saved callback variable instead
of overwriting f (avoid assigning dummyMsgIterFunc until after the entire
exchange is known to be finished), and ensure no samples are emitted to
Execute() accumulations until the request fully succeeds; locate the receive
loop and retry label (retry), the attempts decrement, the callback variable f,
dummyMsgIterFunc, and Execute() accumulation code to implement these changes.
- Line 547: The retry loop uses "attempts" and the conditional "if attempts--;
attempts > 0" which causes one fewer retry than intended; change the logic to
perform exactly 10 retries by either initializing a separate const retries = 10
and iterating with a clear loop (e.g. for i := 0; i < retries; i++) or adjust
the decrement/check to use >= 0 so the count matches the intended number, and
update the exhaustion error/log where the code reports failure (the final error
path that currently uses the "attempts" variable) to include the concrete retry
count (e.g., the configured retries constant or an explicit attemptsPerformed
value) for clearer diagnostics.
|
Okay, obviously putting a retry in ExecuteIter isn't great, but my goal was to put this in a spot where all callers could benefit from it. After reading through a bit more of the surrounding code, I don't think a retry mechanism can be married with something like |
ErrDumpInterruptedErrDumpInterrupted in NetlinkRequest.Execute
There was a problem hiding this comment.
🧹 Nitpick comments (1)
nl/nl_linux.go (1)
523-538: Add explicit retry-exhaustion context forErrDumpInterrupted.At Line 538, exhausted dump interruptions return a generic wrapped error without attempt count. Including attempt metadata would make CI/debugging much clearer.
Suggested diff
func (req *NetlinkRequest) Execute(sockType int, resType uint16) ([][]byte, error) { + const maxAttempts = 10 var err error - for range 10 { + for range maxAttempts { var res [][]byte err = req.ExecuteIter(sockType, resType, func(msg []byte) bool { res = append(res, msg) return true }) if errors.Is(err, ErrDumpInterrupted) { continue } if err == nil { return res, nil } break } + if errors.Is(err, ErrDumpInterrupted) { + return nil, fmt.Errorf("execute netlink request interrupted after %d attempts: %w", maxAttempts, err) + } return nil, fmt.Errorf("execute netlink request: %w", err) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@nl/nl_linux.go` around lines 523 - 538, The loop swallows multiple ErrDumpInterrupted attempts and returns a generic wrapped error; add a retry counter and when ErrDumpInterrupted keeps occurring for all retries return an error that includes the exhausted-attempts metadata. Modify the block that calls req.ExecuteIter (the loop around req.ExecuteIter, ErrDumpInterrupted, res and final return) to increment an attempts variable each iteration, and on final failure format/Wrap the returned error to include the attempt count (e.g., "execute netlink request: %w (attempts=%d)") so consumers can see how many retries occurred before giving up.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@nl/nl_linux.go`:
- Around line 523-538: The loop swallows multiple ErrDumpInterrupted attempts
and returns a generic wrapped error; add a retry counter and when
ErrDumpInterrupted keeps occurring for all retries return an error that includes
the exhausted-attempts metadata. Modify the block that calls req.ExecuteIter
(the loop around req.ExecuteIter, ErrDumpInterrupted, res and final return) to
increment an attempts variable each iteration, and on final failure format/Wrap
the returned error to include the attempt count (e.g., "execute netlink request:
%w (attempts=%d)") so consumers can see how many retries occurred before giving
up.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@nl/nl_linux.go`:
- Around line 525-541: The loop currently discards the accumulated messages
`res` when retries exhaust with `ErrDumpInterrupted` and returns nil; change the
final error branch so that when `errors.Is(err, ErrDumpInterrupted)` you return
the last-collected `res` along with the wrapped error instead of `nil` (preserve
the results from `req.ExecuteIter`), i.e. return `res, fmt.Errorf("execute
netlink request (%d attempts): %w", attempts, err)` so callers (e.g., code using
`addr_linux.go`) receive best-effort partial results.
c9f866e to
03930dc
Compare
|
@aboch I've iterated on this a bit after some AI suggestions. Ended up moving it to Execute(), and apparently we can't wrap errors since downstream code performs type assertions on error values. Not going to update those now. Cilium CI passes with this patch. |
|
Thank you @ti-mo ! Can you please move the xrfm fix to a dedicated PR. |
20293a4 to
9f69158
Compare
|
Thanks @ti-mo for the proposal. I have been thinking if it is the correct behavior for the library to keep retrying on its own and masking the original dump interrupted issue. Would it make sense to have this functionality controllable by a package level flag? WDYT? |
That's rarely the answer, but I've added IIUC, DisableVFInfoCollection() was introduced for Cilium and it looks like we're the only public user. The requirement to call a method on Handle is quite inconvenient. |
ErrDumpInterrupted in NetlinkRequest.Execute|
@ti-mo what is your proposal for enabling this option on the package level handle, |
|
@aboch I've included a func ConfigureHandle(opts HandleOptions) errorNot great API design, since independent users of this package that want conflicting flags will inevitably step on each other. Given the shared nature of pkgHandle, not sure what other option we have. If you have any more feedback, please give it in the other PR. |
I see it now. |
The addition of ErrDumpInterrupted forced all callers to deal explicitly with interrupted dumps, while typically the caller should simply retry. This commit adds a retry system that makes the library retry up to 10 times before finally giving up with ErrDumpInterrupted. This should lead to fewer surprised users. It can be enabled by calling Handle.RetryInterrupted. Signed-off-by: Timo Beckers <timo@incline.eu>
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in mechanism to retry netlink dump operations when the kernel indicates an interrupted dump (NLM_F_DUMP_INTR / ErrDumpInterrupted), aiming to reduce transient failures (per #1163).
Changes:
- Add
RetryInterruptedconfiguration onnl.NetlinkRequestand plumb it fromnetlink.Handle. - Update
NetlinkRequest.Execute()to retry up to 10 times (when enabled) uponErrDumpInterrupted.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
nl/nl_linux.go |
Adds a RetryInterrupted flag and implements bounded retries in NetlinkRequest.Execute(). |
handle_linux.go |
Adds Handle.RetryInterrupted() and passes the option through to newly created netlink requests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
LGTM @ti-mo please check the last last comment from coderabbit. If you tell me it is not relevant, we merge this. |
|
Yep, it's a false positive, Copilot took the wrapping comment a bit too literally. We want to avoid wrapping non-ErrDumpInterrupted errors. After a cursory local check it's fine, but I'm putting this through a full round of Cilium CI just in case: cilium/cilium#44315. |
This PR makes
NetlinkRequest.Execute()retry up to 10 times before finally returningErrDumpInterrupted. Given the lack of responses in #1163, I assume this isn't a controversial proposal.Fixes: #1163.
Summary by CodeRabbit