Skip to content

fix: fixing initial connection failure to cache loop#2235

Merged
nimrod-teich merged 2 commits into
mainfrom
cache-connection-loop-fix
Mar 19, 2026
Merged

fix: fixing initial connection failure to cache loop#2235
nimrod-teich merged 2 commits into
mainfrom
cache-connection-loop-fix

Conversation

@Tomelia1999

@Tomelia1999 Tomelia1999 commented Mar 15, 2026

Copy link
Copy Markdown
Contributor

User description

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

Generated description

Below is a concise technical summary of the changes proposed in this PR:
Fix the cache client lifecycle so relayerCacheClientStore retains the gRPC connection, closes old *grpc.ClientConn instances, and surfaces detailed logging around connection attempts to prevent goroutine leaks during initial startup failures. Ensure Cache requests trigger resetOnConnectionError and that the reconnection loop logs its progress and exits promptly once codes.Unavailable failures are resolved.

TopicDetails
Reconnect logic Refine reconnection flow by logging loop lifecycle, exiting when connectClient succeeds, and resetting the cached client on codes.Unavailable errors surfaced in GetEntry/SetEntry.
Modified files (1)
  • protocol/performance/cache.go
Latest Contributors(2)
UserCommitDate
yaroms@lavanet.xyzchange-depsFebruary 02, 2025
shleikesfeat-PRT-Implement-cac...October 30, 2024
Cache connection Handle cache connection by storing *grpc.ClientConn, closing the previous connection before replacement, and adjusting logs to reflect connection states.
Modified files (1)
  • protocol/performance/cache.go
Latest Contributors(2)
UserCommitDate
yaroms@lavanet.xyzchange-depsFebruary 02, 2025
shleikesfeat-PRT-Implement-cac...October 30, 2024
This pull request is reviewed by Baz. Review like a pro on (Baz).

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

Fix cache connection loop and goroutine leak issues

🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Store gRPC connection to properly close on reconnect, preventing goroutine leaks
• Fix reconnection loop logic that was inverted, causing immediate exit on success
• Add connection error detection to trigger reconnection on unavailable status
• Improve logging and add gRPC imports for connection management
Diagram
flowchart LR
  A["Cache Client Store"] -->|"stores connection"| B["gRPC ClientConn"]
  A -->|"detects errors"| C["resetOnConnectionError"]
  C -->|"clears client"| A
  C -->|"triggers"| D["reconnectClient loop"]
  D -->|"on success"| E["closes old conn & stores new"]
  E -->|"exits loop"| F["Ready for requests"]
Loading

Grey Divider

File Changes

1. protocol/performance/cache.go 🐞 Bug fix +48/-8

Fix cache reconnection loop and connection leak handling

• Added conn field to store gRPC connection for proper cleanup on reconnect
• Imported gRPC packages for connection and error code handling
• Modified connectGRPCConnectionToRelayerCacheService() to return connection alongside client
• Fixed inverted reconnection loop logic (was returning on error, now returns on success)
• Added resetOnConnectionError() method to detect unavailable errors and trigger reconnection
• Integrated error handling in GetEntry() and SetEntry() to call resetOnConnectionError()
• Enhanced logging with more descriptive messages for connection lifecycle events
• Added comments explaining goroutine leak prevention and reconnection logic

protocol/performance/cache.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Mar 15, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (2) 📘 Rule violations (0) 📎 Requirement gaps (0) 📐 Spec deviations (0)

Grey Divider


Action required

1. r.conn.Close() error unchecked📘 Rule violation ⛯ Reliability
Description
The new code calls r.conn.Close() without checking/handling its returned error, which commonly
triggers errcheck in golangci-lint. This can fail make lint and also hides close failures that
could matter during reconnect/shutdown.
Code

protocol/performance/cache.go[R77-80]

+			if r.conn != nil {
+				utils.LavaFormatDebug("closing previous cache gRPC connection before replacing", utils.LogAttr("address", r.address))
+				r.conn.Close()
+			}
Evidence
PR Compliance ID 5 requires zero golangci-lint warnings; grpc.ClientConn.Close() returns an
error, and the added code ignores it by calling r.conn.Close() with no check/logging.

AGENTS.md
protocol/performance/cache.go[77-80]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`r.conn.Close()` returns an error, but the new reconnect logic ignores it. This is likely to trigger `golangci-lint` (`errcheck`) and can hide close failures.
## Issue Context
The reconnect path closes a previous `*grpc.ClientConn` before replacing it.
## Fix Focus Areas
- protocol/performance/cache.go[77-80]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

2. Status code mismatch 🐞 Bug ⛯ Reliability
Description
resetOnConnectionError() decodes error codes using google.golang.org/grpc/status.Code, while other
gRPC paths in this repo commonly use github.com/gogo/status.Code. Errors produced/handled via
gogo/status may not be recognized here, preventing codes.Unavailable from triggering a client reset
and reconnect.
Code

protocol/performance/cache.go[R129-132]

+	code := status.Code(err)
+	if code != codes.Unavailable {
+		return
+	}
Evidence
cache.resetOnConnectionError uses the google grpc/status package to extract codes, but core gRPC
error handling in lavasession (and other paths) uses github.com/gogo/status; mixing these can cause
code extraction to fail for gogo/status errors, so the Unavailable gate may never trip.

protocol/performance/cache.go[1-15]
protocol/performance/cache.go[123-132]
protocol/lavasession/common.go[3-28]
protocol/lavasession/common.go[59-62]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`resetOnConnectionError()` uses `google.golang.org/grpc/status.Code(err)` to decide whether to clear the cache client and trigger reconnection. Elsewhere in the repo, gRPC error-code extraction is typically done with `github.com/gogo/status`, so this mismatch can prevent recognizing `codes.Unavailable` for some error types.
### Issue Context
The cache client is built on gogoproto-generated gRPC stubs (cosmos/gogoproto), and other gRPC call paths in the repo extract codes via `github.com/gogo/status`. Aligning the status decoding avoids silent failure to trigger reconnection.
### Fix Focus Areas
- protocol/performance/cache.go[1-15]
- protocol/performance/cache.go[123-132]
- protocol/lavasession/common.go[3-28]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Conn not closed on cancel 🐞 Bug ⛯ Reliability
Description
When r.ctx is canceled, reconnectClient() exits without closing the stored *grpc.ClientConn, and
Cache exposes no Close method to close it during shutdown. This can leave gRPC internal goroutines
running after shutdown and contradicts the new conn field’s stated shutdown intent.
Code

protocol/performance/cache.go[R109-111]

  	case <-r.ctx.Done():
+			utils.LavaFormatInfo("cache service reconnection loop exiting (context cancelled)", utils.LogAttr("address", r.address))
  		return
Evidence
The PR adds a stored *grpc.ClientConn with an explicit comment about closing it on shutdown, but the
only Close call is on successful reconnect replacement; the ctx.Done() shutdown path returns without
closing r.conn, and there is no Cache.Close() to do so.

protocol/performance/cache.go[17-24]
protocol/performance/cache.go[107-112]
protocol/performance/cache.go[140-191]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The cache client now stores a `*grpc.ClientConn` (comment says it can be closed on shutdown), but there is no shutdown implementation: on `ctx.Done()` the reconnect loop exits without closing the connection, and `Cache` exposes no `Close()` method.
### Issue Context
`*grpc.ClientConn` maintains internal goroutines; without `Close()` they can survive beyond the cache lifecycle in tests or in processes that continue running after cache context cancellation.
### Fix Focus Areas
- protocol/performance/cache.go[17-24]
- protocol/performance/cache.go[93-121]
- protocol/performance/cache.go[140-191]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread protocol/performance/cache.go
@codecov

codecov Bot commented Mar 15, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 19.35484% with 25 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/performance/cache.go 19.35% 22 Missing and 3 partials ⚠️
Flag Coverage Δ
consensus 8.71% <ø> (ø)
protocol 33.55% <19.35%> (-0.06%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
protocol/performance/cache.go 50.00% <19.35%> (-15.72%) ⬇️

... and 8 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Mar 15, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit a1928b4. ± Comparison against base commit 28c9192.

♻️ This comment has been updated with latest results.

@Tomelia1999 Tomelia1999 changed the title FIX: fixing initial connection failure to cache loop fix: fixing initial connection failure to cache loop Mar 17, 2026
Comment thread protocol/performance/cache.go Outdated
Comment thread protocol/performance/cache.go
@nimrod-teich nimrod-teich merged commit 6995d56 into main Mar 19, 2026
32 checks passed
@nimrod-teich nimrod-teich deleted the cache-connection-loop-fix branch March 19, 2026 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants