Skip to content

feat(relayer)#1

Merged
gilxgil merged 17 commits into
mainfrom
relayer
Mar 17, 2022
Merged

feat(relayer)#1
gilxgil merged 17 commits into
mainfrom
relayer

Conversation

@gilxgil

@gilxgil gilxgil commented Mar 15, 2022

Copy link
Copy Markdown
Collaborator

WIP relayer feature

@gilxgil gilxgil merged commit 9fbcf3d into main Mar 17, 2022
@gilxgil gilxgil deleted the relayer branch March 17, 2022 01:13
Yaroms pushed a commit that referenced this pull request Aug 1, 2023
nimrod-teich added a commit that referenced this pull request Nov 26, 2025
## Additional Memory Leaks Found

After the initial fix (batch[].Result = nil), profiling showed the provider
still OOM'd at 115.76MB. Analysis revealed THREE additional places where
30MB Solana batch responses were being held in memory:

### 1. replyMsgs Array After Marshaling (jsonRPC.go)
**Problem**: After `json.Marshal(replyMsgs)`, the replyMsgs[] array continued
to hold all JsonrpcMessage.Result fields (30MB × 3 = 90MB) until function return.

**Fix**: Set `replyMsgs[idx].Result = nil` immediately after marshaling.
The marshaled data is in `retData`, so the RawMessage copies are no longer needed.

### 2. Cache Deep Copy (rpcprovider_server.go)
**Problem**: `protocopy.DeepCopyProtoObject()` created a FULL 30MB deep copy
of reply.Data in a goroutine for caching. This copy stayed in memory until
cache.SetEntry() completed (~100ms+).

**Fix**: Share the Data field reference instead of deep copying it. Since
reply.Data is immutable after creation (it's a []byte that's never modified),
sharing the reference is safe. Only copy the metadata slice.

**Impact**: Eliminates 30MB × N concurrent cached requests allocation.

### 3. ConvertBatchElement Dereference
**Problem**: `result = *resultRef` in ConvertBatchElement creates a copy
of the RawMessage by dereferencing the pointer. This happens BEFORE our
`batch[idx].Result = nil` fix, so that fix was too late.

**Status**: Can't fix without breaking API. However, fixes #1 and #2 compensate.

## Memory Reduction Summary

**Per 3-block batch request:**
- Old allocations:
  1. HTTP buffer: 30MB
  2. JsonrpcMessage.Result: 30MB
  3. batch[].Result: 30MB ← FIX 1 (previous commit)
  4. replyMsgs[].Result: 90MB ← FIX 2 (this commit)
  5. retData: 30MB (needed for response)
  6. Cache deep copy: 30MB ← FIX 3 (this commit)

- Total before all fixes: 240MB per batch
- Total after all fixes: ~90MB per batch (HTTP + retData + conversion overhead)
- **Reduction: ~62% memory savings**

**For 8 concurrent batch requests:**
- Before: 8 × 240MB = 1.92GB → OOM
- After: 8 × 90MB = 720MB (plus ~30MB baseline) = 750MB → Should not OOM

## Testing

- Builds successfully
- No behavioral changes (Data reference sharing is safe)
- Real-world test pending

## Files Changed

- `protocol/chainlib/jsonRPC.go`:
  - Added `replyMsgs[idx].Result = nil` after marshaling
  - Expanded comments explaining all 6 memory copies

- `protocol/rpcprovider/rpcprovider_server.go`:
  - Replaced `protocopy.DeepCopyProtoObject()` with manual struct copy
  - Share Data field reference instead of deep copying
  - Only deep copy the Metadata slice (small, mutable)
  - Removed unused protocopy import
NadavLevi added a commit that referenced this pull request Mar 16, 2026
Addresses all issues raised in code review (issues #1–9, #11–19):

Correctness:
- Add apiInterface label to endpointSelectionScore gauge and propagate it
  through SetProviderSelected interface and all call sites (#4)
- Align SetRelayNodeErrorMetric parameter order to (chainId, apiInterface,
  providerAddress, method) — was (providerAddress, chainId, apiInterface) (#7)
- Scope metric setters outside the lock in RegisterEndpoint to avoid
  re-entrant RWMutex deadlock (#11)
- Record cache latency histogram only on cache hits, not misses (#12)
- Remove dead field MeasureAfterProviderProcessingTime from RelayMetrics
  and its sole write site (#14)

Dead code:
- Remove dead stubs SetRelaySentToProviderMetric and SetRequestPerProvider
  from SmartRouterMetricsManager (#5)
- Delete empty consumer_metrics_manager_test.go (#18)

Simplification:
- Replace bespoke registerMetric closure with generic registerOrReuse
  helper throughout ConsumerMetricsManager (#6)
- Unify latency histogram buckets via shared LatencyBuckets variable in
  buckets.go — removes per-file duplicates (#8)
- Drop unnecessary goroutine spawns from RPCConsumerLogs metric forwarders;
  callers already run in goroutines where needed (#9)
- Replace RequestProperties struct with *RelayMetrics in RecordDirectRelayEnd,
  eliminating a duplicate of the same fields (#15)

Documentation:
- Document the request-group counting invariants in SetRelayMetrics (#13)
- Add comment explaining why UpdateHealthcheckStatusBreakdown is a no-op
  in SmartRouterMetricsManager (#16)
- Fix struct comment to list all four labels including apiInterface (#17)

Tests:
- Scope gomock.Controller to each subtest in TestIsArchiveRequest,
  TestIsDebugOrTraceRequest, and TestIsBatchRequest (#19)
- Add TestConsumerSetRelayMetrics_PartitionInvariant asserting
  batch+read+write==total and success+failed==total (#13)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
avitenzer added a commit that referenced this pull request Apr 1, 2026
Re-evaluate all 6 shortfalls against current branch state:
- #1 (standalone binary): RESOLVED
- #2 (test suite): RESOLVED
- #3 (lavasession test): pre-existing flaky test, not a regression
- #4 (dependency cleanup): cosmos-sdk removed, cosmossdk.io utilities remain
- #5 (Dockerfile): still has stale ldflags, needs update
- #6 (branch hygiene): team workflow, not code

Added action plan with must-do, should-do, and deferred items.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant