Skip to content

fix(session): prevent reported-provider/pairing race in GetReportedPr…#2249

Merged
nimrod-teich merged 1 commit into
mainfrom
fix/reported-providers-race
Mar 23, 2026
Merged

fix(session): prevent reported-provider/pairing race in GetReportedPr…#2249
nimrod-teich merged 1 commit into
mainfrom
fix/reported-providers-race

Conversation

@NadavLevi

@NadavLevi NadavLevi commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

User description

…oviders

Move csm.lock.RLock acquisition to before reading reportedProviders so that UpdateAllProviders cannot reset reportedProviders and rebuild pairing between the epoch check and the pairing lookup. This eliminates the "Failed to find a reported provider in pairing list" error seen at epoch boundaries.

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Ensure ConsumerSessionManager obtains csm.lock.RLock before accessing reported providers and epoch so GetReportedProviders never sees a snapshot that is inconsistent with the rebuilt pairing state. Add a regression test that reproduces the epoch-transition race between GetReportedProviders and UpdateAllProviders to confirm the locking fix.

TopicDetails
Reported Locks Guard GetReportedProviders by taking csm.lock.RLock before reading reported providers so pairing lookups never race with UpdateAllProviders resetting the provider list.
Modified files (1)
  • protocol/lavasession/consumer_session_manager.go
Latest Contributors(2)
UserCommitDate
NadavLevifeat-metrics-implement...March 19, 2026
anna@magmadevs.comfeat-provideroptimizer...February 22, 2026
Epoch race test Verify the epoch-boundary race by holding the write lock while GetReportedProviders runs, ensuring the fix returns an empty result instead of stale reported providers.
Modified files (1)
  • protocol/lavasession/consumer_session_manager_test.go
Latest Contributors(2)
UserCommitDate
NadavLevifeat-smart-router-impl...March 11, 2026
anna@magmadevs.comfeat-provideroptimizer...February 22, 2026
This pull request is reviewed by Baz. Review like a pro on (Baz).

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Prevent reported-provider/pairing race in GetReportedProviders

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Move lock acquisition before reportedProviders read in GetReportedProviders
• Prevents race condition between GetReportedProviders and UpdateAllProviders
• Eliminates "Failed to find reported provider" error at epoch boundaries
• Add comprehensive test reproducing and validating the race condition fix
Diagram
flowchart LR
  A["GetReportedProviders<br/>reads reportedProviders"] -->|without lock| B["UpdateAllProviders<br/>resets reportedProviders"]
  B -->|rebuilds pairing| C["Lookup fails<br/>stale data"]
  D["Fix: acquire lock<br/>before reading"] -->|atomic operation| E["Consistent snapshot<br/>no race"]
Loading

Grey Divider

File Changes

1. protocol/lavasession/consumer_session_manager.go 🐞 Bug fix +5/-2

Move lock acquisition before reportedProviders read

• Moved csm.lock.RLock() acquisition to before reading reportedProviders
• Added explanatory comment describing the race condition and fix
• Ensures epoch check and pairing lookup are atomic under the same lock

protocol/lavasession/consumer_session_manager.go


2. protocol/lavasession/consumer_session_manager_test.go 🧪 Tests +76/-0

Add test for reported-provider race condition

• Added TestGetReportedProviders_EpochTransitionRace test function
• Reproduces the race condition by holding write lock while GetReportedProviders runs
• Validates that fix prevents stale data from being returned
• Includes detailed comments explaining the race window and expected behavior

protocol/lavasession/consumer_session_manager_test.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Mar 22, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Action required

1. Sleep-based race test flaky 🐞 Bug ⛯ Reliability
Description
TestGetReportedProviders_EpochTransitionRace uses time.Sleep to orchestrate a concurrent
interleaving, so it can pass even if the pre-fix race still exists (the goroutine may not run until
after Reset+Unlock). This makes the regression test nondeterministic and allows false negatives on
slow/loaded CI.
Code

protocol/lavasession/consumer_session_manager_test.go[R1547-1549]

+	// Give the goroutine enough time to reach the RLock blocking point.
+	time.Sleep(100 * time.Millisecond)
+
Evidence
The test’s expected outcome is require.Empty(t, result). If the goroutine running
GetReportedProviders doesn’t execute before csm.reportedProviders.Reset() (because scheduling
delays exceed the fixed 100ms sleep), then even the old buggy implementation would read an
already-reset (empty) reportedProviders list and return empty, passing the test without validating
the fix.

protocol/lavasession/consumer_session_manager_test.go[1537-1568]
protocol/lavasession/consumer_session_manager.go[1773-1796]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`TestGetReportedProviders_EpochTransitionRace` uses `time.Sleep` to guess when the goroutine has reached the lock boundary. This can produce false negatives (test passes even when the bug exists) due to scheduler/CI variability.

### Issue Context
The test’s correctness depends on forcing a very specific interleaving. `time.Sleep` is not a synchronization primitive and cannot guarantee the goroutine reached the intended point.

### Fix Focus Areas
- Replace the `time.Sleep(...)` with deterministic coordination (e.g., a channel/WaitGroup barrier) so the test *guarantees* the goroutine is blocked on `csm.lock.RLock()` (or at least has progressed past the intended pre-lock step in the old implementation).
- protocol/lavasession/consumer_session_manager_test.go[1534-1568]

### Suggested approach (one option)
- Add an instrumented helper or hook (only in tests) that allows signaling right before/after the lock acquisition in `GetReportedProviders`, or
- Use `runtime.Gosched()` loops + a channel barrier that the goroutine closes after it reaches a known point (requires adding such a point), and only then perform the Reset/unlock sequence.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Test misses pairing rebuild 🐞 Bug ✓ Correctness
Description
The new test claims to reproduce the "Failed to find a reported provider in pairing list" race, but
it never rebuilds pairing to remove the reported provider, so it cannot hit the not-found path that
emits that error. As written, it only validates stale snapshot behavior after
reportedProviders.Reset().
Code

protocol/lavasession/consumer_session_manager_test.go[R1550-1554]

+	// Reset reportedProviders while holding the lock — this is what
+	// UpdateAllProviders does (along with rebuilding pairing).
+	// We only reset reportedProviders; pairing still contains A.
+	csm.reportedProviders.Reset()
+	csm.lock.Unlock()
Evidence
The production error is logged only when csm.pairing[reportedProvider.Address] lookup fails in
GetReportedProviders. The test explicitly keeps pairing intact (comment: "pairing still contains
A") and only resets reportedProviders, so the !ok branch is not exercised and the test does not
actually validate the reported production failure mode.

protocol/lavasession/consumer_session_manager.go[1782-1790]
protocol/lavasession/consumer_session_manager_test.go[1550-1568]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The test claims to reproduce a pairing mismatch (“reported provider not found in pairing list”), but it never removes the reported provider from `csm.pairing`, so it cannot exercise the real failure mode.

### Issue Context
The logged error occurs when the reported provider address is present in the reportedProviders snapshot but missing from `csm.pairing`.

### Fix Focus Areas
- Update the test to actually rebuild `csm.pairing` under `csm.lock.Lock()` to exclude the reported provider (mirroring `UpdateAllProviders` behavior) and assert the function does not emit/trigger the not-found path (or at least assert behavior consistent with the intended fix).
- Alternatively, adjust the test name/comment to match what it really tests (stale reportedProviders snapshot only).
- protocol/lavasession/consumer_session_manager_test.go[1495-1569]
- protocol/lavasession/consumer_session_manager.go[1773-1796]

### Suggested test change
- While holding `csm.lock.Lock()`, perform both:
 - `csm.reportedProviders.Reset()`
 - mutate `csm.pairing` to remove the reported address (or swap in a new pairing map without it)
- Then release lock and assert `GetReportedProviders` returns a consistent result and does not go through the missing-pairing branch.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

…oviders

Move csm.lock.RLock acquisition to before reading reportedProviders so that
UpdateAllProviders cannot reset reportedProviders and rebuild pairing between
the epoch check and the pairing lookup. This eliminates the "Failed to find
a reported provider in pairing list" error seen at epoch boundaries.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@NadavLevi NadavLevi force-pushed the fix/reported-providers-race branch from 8d481d0 to 1e1ec42 Compare March 22, 2026 21:48
Comment thread protocol/lavasession/consumer_session_manager.go
@codecov

codecov Bot commented Mar 22, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
consensus 8.71% <ø> (ø)
protocol 33.21% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/lavasession/consumer_session_manager.go 61.02% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread protocol/lavasession/consumer_session_manager_test.go
@github-actions

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 1e1ec42. ± Comparison against base commit 50e9a40.

@nimrod-teich nimrod-teich merged commit 2abf3db into main Mar 23, 2026
32 checks passed
@nimrod-teich nimrod-teich deleted the fix/reported-providers-race branch March 23, 2026 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants