Skip to content

fix(lavasession): prevent stale reconnect candidates from polluting r…#2253

Merged
nimrod-teich merged 1 commit into
mainfrom
fix/reconnect-stale-reported-providers
Mar 26, 2026
Merged

fix(lavasession): prevent stale reconnect candidates from polluting r…#2253
nimrod-teich merged 1 commit into
mainfrom
fix/reconnect-stale-reported-providers

Conversation

@NadavLevi

@NadavLevi NadavLevi commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

User description

…eported providers

ReconnectProviders could re-add a provider from a previous epoch after Reset() cleared the reported list during an epoch transition. Add a generation counter to ReportedProviders so stale candidates are discarded.

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Prevent stale reconnect candidates from re-adding removed providers by tracking a generation counter in ReportedProviders and gating ReconnectProviders processing during epoch transitions. Include backupProviders in reported-provider filtering so ConsumerSessionManager no longer logs missing pairing entries when backup providers are reported.

TopicDetails
Reconnect cleanup Prevent stale reconnect candidates by gating ReconnectProviders on the new generation counter, consolidating reporting logic under reportProviderLocked, and covering the epoch-transition race in a regression test so prior-epoch providers cannot reappear.
Modified files (3)
  • protocol/lavasession/consumer_session_manager_test.go
  • protocol/lavasession/reported_providers.go
  • protocol/lavasession/reported_providers_test.go
Latest Contributors(2)
UserCommitDate
NadavLevifix(session): prevent ...March 23, 2026
anna@magmadevs.comfeat(provideroptimizer...February 22, 2026
Backup lookup Include backupProviders when filtering reported entries so GetReportedProviders can handle backup pools without emitting missing-pairing errors.
Modified files (1)
  • protocol/lavasession/consumer_session_manager.go
Latest Contributors(2)
UserCommitDate
NadavLevifix(session): prevent ...March 23, 2026
anna@magmadevs.comfeat(provideroptimizer...February 22, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Prevent stale reconnect candidates from polluting reported providers

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Add generation counter to ReportedProviders to prevent stale reconnect candidates
• Discard stale provider candidates after epoch transitions
• Prevent "Failed to find reported provider in pairing list" errors
• Add comprehensive test reproducing the reconnect race condition
Diagram
flowchart LR
  RC["ReconnectCandidates captures<br/>generation snapshot"]
  RC -- "epoch transition<br/>Reset() called" --> UT["UpdateAllProviders<br/>clears reportedProviders"]
  UT -- "generation incremented" --> GC["Generation check<br/>in ReconnectProviders"]
  GC -- "stale candidate<br/>generation mismatch" --> SKIP["Skip re-reporting<br/>stale provider"]
  GC -- "current generation<br/>matches" --> REPORT["Re-report provider"]
Loading

Grey Divider

File Changes

1. protocol/lavasession/reported_providers.go 🐞 Bug fix +24/-5

Add generation tracking to discard stale candidates

• Add generation field to ReportedProviders struct to track epoch transitions
• Increment generation counter in Reset() method
• Modify ReconnectCandidates() to return struct containing candidates and generation snapshot
• Add generation validation in ReconnectProviders() before re-reporting stale candidates
• Skip re-reporting if generation has changed since candidate snapshot was captured

protocol/lavasession/reported_providers.go


2. protocol/lavasession/consumer_session_manager_test.go 🧪 Tests +97/-0

Add test for reconnect race condition

• Add new test TestGetReportedProviders_ReconnectRace reproducing the race condition
• Test forces interleaving between ReconnectProviders and UpdateAllProviders
• Verify stale providers are not re-added after epoch transition
• Verify GetReportedProviders does not return providers from previous epoch

protocol/lavasession/consumer_session_manager_test.go


3. protocol/lavasession/reported_providers_test.go 🧪 Tests +2/-2

Update test for new return type

• Update existing test to access candidates field from new reconnectCandidatesResult struct
• Maintain test coverage for reconnect candidate functionality

protocol/lavasession/reported_providers_test.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Mar 26, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (3) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Action required

1. Generation check TOCTOU 🐞 Bug ✓ Correctness
Description
In ReportedProviders.ReconnectProviders, the code checks rp.generation under an RLock but then
releases the lock before calling ReportProvider, allowing Reset() to run between the check and the
mutation and re-add a stale provider after an epoch transition. This defeats the intended
stale-candidate protection and can still lead to "reported provider not in pairing" failures
downstream.
Code

protocol/lavasession/reported_providers.go[R141-146]

+				rp.lock.RLock()
+				currentGen := rp.generation
+				rp.lock.RUnlock()
+				if currentGen == result.generation {
+					rp.ReportProvider(candidate.address, 0, 1, nil, []error{err}) // add a disconnection
+				} else {
Evidence
ReconnectProviders reads generation under rp.lock.RLock, unlocks, then conditionally calls
ReportProvider (which takes a write lock and mutates the map). Because Reset() also takes the write
lock and increments generation, a Reset can occur after the equality check but before ReportProvider
acquires the write lock, allowing a stale candidate to be re-reported into the new (post-Reset) map
despite the generation guard.

protocol/lavasession/reported_providers.go[126-154]
protocol/lavasession/reported_providers.go[32-40]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ReconnectProviders()` compares `currentGen` to the snapshot generation, but then releases the lock before mutating state via `ReportProvider()`. A concurrent `Reset()` can happen in that gap, so a stale candidate can still be re-added post-reset.

### Issue Context
You already added a `generation` counter and capture it together with the candidate snapshot. The remaining missing piece is to ensure the **generation verification and the subsequent map mutation happen under the same critical section**.

### Fix Focus Areas
- protocol/lavasession/reported_providers.go[126-154]

### Expected fix approach
- Re-check `rp.generation == result.generation` **under `rp.lock.Lock()`**, and only then mutate the map.
- To avoid deadlocks, either:
 - add a private `reportProviderLocked(...)` helper that assumes the lock is held, and call it from both `ReportProvider()` and `ReconnectProviders()`, or
 - add a `ReportProviderIfGeneration(expectedGen, ...)` method that performs the generation check and mutation under one `Lock()`.
- Do not perform `Lock()` -> call public `ReportProvider()` (it re-locks).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Stale success removes reports 🐞 Bug ✓ Correctness
Description
On successful reconnect, ReconnectProviders always calls RemoveReport without any generation check,
so a stale reconnect attempt finishing after Reset() can delete a newly-added report for the same
provider address in the new epoch. This can silently drop valid reports and make QoS/unavailability
reporting incorrect.
Code

protocol/lavasession/reported_providers.go[R134-136]

			if err == nil {
				rp.RemoveReport(candidate.address)
			} else {
-				rp.ReportProvider(candidate.address, 0, 1, nil, []error{err}) // add a disconnection
Evidence
ReconnectProviders unconditionally removes the report on err == nil. RemoveReport deletes the
entry under a write lock, and Reset clears the map and bumps generation; without a generation guard
on the success path, a reconnect attempt from the previous generation can remove a report that was
created after Reset in the new generation.

protocol/lavasession/reported_providers.go[126-136]
protocol/lavasession/reported_providers.go[79-87]
protocol/lavasession/reported_providers.go[32-40]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The reconnect **success** path (`err == nil`) removes the report without verifying the candidate is still in the same `generation`. A stale reconnect completion after `Reset()` can therefore delete a newly-created report in the new epoch.

### Issue Context
The error path already attempts to avoid acting on stale candidates using `generation`. The same logic needs to apply to the success path as well.

### Fix Focus Areas
- protocol/lavasession/reported_providers.go[126-154]
- protocol/lavasession/reported_providers.go[79-87]

### Expected fix approach
- Apply the same generation validation to the success path as to the error path.
- Preferably perform `if rp.generation == result.generation { remove }` under a single `Lock()` to avoid TOCTOU gaps (similar to the error-path fix).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Sleep ordering in test 🐞 Bug ⛯ Reliability
Description
TestGetReportedProviders_ReconnectRace uses a fixed time.Sleep to wait for ReconnectProviders to
enter reconnectCB, which can be flaky under CI load or different scheduling. This can cause spurious
failures or make the test non-deterministic.
Code

protocol/lavasession/consumer_session_manager_test.go[R1636-1638]

+	// Give ReconnectProviders time to call ReconnectCandidates and enter reconnectCB.
+	time.Sleep(100 * time.Millisecond)
+
Evidence
The test relies on time.Sleep(100ms) to enforce an interleaving, rather than synchronizing on an
explicit signal that ReconnectCandidates() has been called and the callback is blocking. Fixed
sleeps are sensitive to machine speed and scheduler variance.

protocol/lavasession/consumer_session_manager_test.go[1628-1642]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The test currently uses `time.Sleep(100 * time.Millisecond)` to wait for the goroutine to reach a specific point, which is inherently timing-dependent.

### Issue Context
You already have a blocking `reconnectCB`; add a second channel/handshake to deterministically signal that the callback has been entered (or that `ReconnectCandidates()` has completed) before performing the epoch transition.

### Fix Focus Areas
- protocol/lavasession/consumer_session_manager_test.go[1610-1642]

### Expected fix approach
- Add a `started := make(chan struct{})`.
- In `reconnectCB`, `close(started)` immediately before blocking on `epochTransitioned`.
- In the test, replace `time.Sleep(...)` with `<-started` (optionally with a timeout).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread protocol/lavasession/reported_providers.go Outdated
Comment thread protocol/lavasession/reported_providers.go Outdated
@codecov

codecov Bot commented Mar 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 84.00000% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/lavasession/consumer_session_manager.go 0.00% 1 Missing and 1 partial ⚠️
protocol/lavasession/reported_providers.go 91.30% 1 Missing and 1 partial ⚠️
Flag Coverage Δ
consensus 8.71% <ø> (ø)
protocol 33.96% <84.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/lavasession/consumer_session_manager.go 61.99% <0.00%> (+1.50%) ⬆️
protocol/lavasession/reported_providers.go 80.41% <91.30%> (-3.74%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NadavLevi NadavLevi requested a review from nimrod-teich March 26, 2026 15:42
@github-actions

github-actions Bot commented Mar 26, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit a7bd903. ± Comparison against base commit 4197251.

♻️ This comment has been updated with latest results.

@NadavLevi NadavLevi force-pushed the fix/reconnect-stale-reported-providers branch from 6057301 to ed7d6d8 Compare March 26, 2026 15:57
Comment thread protocol/lavasession/reported_providers.go
@NadavLevi NadavLevi force-pushed the fix/reconnect-stale-reported-providers branch from ed7d6d8 to 56cd818 Compare March 26, 2026 17:35
@NadavLevi NadavLevi requested a review from avitenzer March 26, 2026 17:40
avitenzer
avitenzer previously approved these changes Mar 26, 2026
…eported providers

ReconnectProviders could re-add a provider from a previous epoch after
Reset() cleared the reported list during an epoch transition. Add a
generation counter to ReportedProviders so stale candidates are discarded.

Both success and error paths in ReconnectProviders now verify the
generation under the same lock that performs the mutation, closing the
TOCTOU gap.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@NadavLevi NadavLevi force-pushed the fix/reconnect-stale-reported-providers branch from 56cd818 to a7bd903 Compare March 26, 2026 17:56
@nimrod-teich nimrod-teich merged commit bd37dbf into main Mar 26, 2026
32 checks passed
@nimrod-teich nimrod-teich deleted the fix/reconnect-stale-reported-providers branch March 26, 2026 18:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants