Skip to content

(feat): relay improvements#2205

Merged
manav2401 merged 9 commits into
masterfrom
improve-private-tx-purge-master
Apr 29, 2026
Merged

(feat): relay improvements#2205
manav2401 merged 9 commits into
masterfrom
improve-private-tx-purge-master

Conversation

@manav2401

@manav2401 manav2401 commented Apr 29, 2026

Copy link
Copy Markdown
Member

Description

This PR does some improvements in the preconf / private tx relay.

  • Improve purging of transactions from private transaction store
    • Earlier a very naive mechanism was used which led to unbounded growth of in-memory transactions in both relay and block producers.
    • The updated mechanism keeps better track of errors when submitting to block producers and purges transactions more aggressively if they're missing from local pool.
  • Improves error tracking and dumps a map of most common errors faced during transaction submission as a simple distribution. This helps in debugging relay - BP communication issues.
  • Improves observability / metrics

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

Copilot AI review requested due to automatic review settings April 29, 2026 16:23

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the preconf/private-tx relay by making private-tx store cleanup bounded (TTL + txpool-aware eviction), adding aggregated BP rejection/error reporting, and refining metrics for relay↔BP interactions.

Changes:

  • Add periodic sweeping to PrivateTxStore with a hard TTL and optional txpool-presence-based eviction (wired via SetTxPoolChecker from eth/backend.go).
  • Introduce an in-memory rejectionTracker and a background reporter to periodically log an aggregated rejection/error distribution.
  • Rename/expand relay metrics to provide more granular latency/success/failure tracking for preconf/private-tx/checkstatus RPC calls.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
eth/relay/service.go Renames/organizes relay timers and meters under relay/... metric paths.
eth/relay/relay.go Adds wiring to inject a txpool checker into the private-tx store; expands docs around private-tx purge behavior.
eth/relay/rejection_tracker.go Adds the rejection aggregation + formatting utilities (new file).
eth/relay/private_tx_store.go Adds sweep loop (TTL + txpool-aware eviction), new gauge + metric, and txpool checker plumbing.
eth/relay/private_tx_store_test.go Adds tests for TTL eviction, txpool-check eviction, and setter wiring.
eth/relay/multiclient.go Splits RPC metrics by operation, adds rejection tracking + periodic reporting goroutine, adds retry success metric.
eth/relay/multiclient_test.go Adds unit tests for rejection tracker behavior and verifies submission paths wire into the tracker.
eth/backend.go Wires eth.txPool.Has into the relay’s private-tx store for sweep-based cleanup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eth/relay/multiclient.go
Comment thread eth/relay/multiclient.go
Comment thread eth/relay/multiclient.go
Comment thread eth/relay/private_tx_store.go Outdated
Comment thread eth/relay/multiclient.go
@manav2401

Copy link
Copy Markdown
Member Author

@claude review this PR. Check for any breaking / risky changes from an implementation / usage point of view.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Code review skipped — your organization has reached its monthly code review spending cap.

An organization admin can view or raise the cap at claude.ai/admin-settings/claude-code. The cap resets at the start of the next billing period.

Once the cap resets or is raised, comment @claude review on this pull request to trigger a review.

@claude

claude Bot commented Apr 29, 2026

Copy link
Copy Markdown

Claude finished @manav2401's task in 2m 4s —— View job


PR Review: Relay Improvements

  • Read all changed files
  • Analyze concurrency and lock safety
  • Check for breaking changes (metrics, APIs, behavior)
  • Analyze error handling and edge cases
  • Post detailed review findings

Summary

Solid improvement to private tx lifecycle management. The sweep mechanism with TTL + txpool-awareness directly addresses the unbounded growth problem. The rejection tracker is well-designed with proper cardinality capping. Good test coverage across all new code paths.

I found one concurrency issue worth fixing before merge, a few items worth discussing, and a handful of minor observations.


Findings

1. [HIGH] sweepOnce holds store mutex while calling external txPoolChecker

File: eth/relay/private_tx_store.go:167-186

sweepOnce() holds s.mu.Lock() for the entire iteration, including the call to s.txPoolChecker(hash) at line 179. This checker is wired to eth.txPool.Has() (via eth/backend.go:407), which acquires the txpool's internal lock.

This creates a lock ordering dependency: store.mu → txpool.lock. If any txpool code path ever calls back into the private tx store (e.g., checking IsTxPrivate which acquires store.mu.RLock), you get a deadlock. Even without a full deadlock today, holding the store lock while calling into the txpool blocks all Add/Purge/IsTxPrivate callers for the duration of the iteration — which scales with store size.

Recommended fix: Snapshot entries under lock, release, check txpool without the lock, then re-lock to delete:

func (s *PrivateTxStore) sweepOnce() {
    // Snapshot under read lock
    s.mu.RLock()
    type entry struct {
        hash    common.Hash
        addedAt time.Time
    }
    entries := make([]entry, 0, len(s.txs))
    for h, t := range s.txs {
        entries = append(entries, entry{h, t})
    }
    s.mu.RUnlock()

    // Classify without holding the lock
    now := time.Now()
    toDelete := make([]common.Hash, 0)
    for _, e := range entries {
        age := now.Sub(e.addedAt)
        if age > privateTxTTL {
            toDelete = append(toDelete, e.hash)
        } else if age > privateTxGracePeriod && s.txPoolChecker != nil && !s.txPoolChecker(e.hash) {
            toDelete = append(toDelete, e.hash)
        }
    }

    // Delete under write lock, verifying timestamps haven't changed
    if len(toDelete) == 0 {
        return
    }
    expired := uint64(0)
    s.mu.Lock()
    for _, h := range toDelete {
        // Only delete if the entry wasn't re-added between snapshot and now
        if _, ok := s.txs[h]; ok {
            delete(s.txs, h)
            expired++
        }
    }
    s.mu.Unlock()
    s.txsExpired.Add(expired)
}

Fix this →


2. [MEDIUM] Rejection tracker records "already known" errors before the early return filters them

File: eth/relay/multiclient.go:141-148, 212-214, 300-305

In all three submission paths (submitPreconfTx, submitPrivateTx, retryPrivateTxSubmission), mc.rejectionTracker.record(err) is called before the isAlreadyKnownError(err) check. This means "already known" responses — which are explicitly treated as successes — get counted in the rejection summary.

The tests in TestRejectionTrackerWiredToSubmit (lines 1558-1631) actually verify that already-known is excluded, and they pass. Looking more carefully, this works because the test asserts total == 1 when one BP returns a real error and one returns "already known". But wait — both errors hit record(). Let me re-read...

Actually, after re-examining: the test at line 1577 asserts total == 1, but record() is called for both errors. The "already known" error would normalize to "already known" and be counted. So either:

  • The test is wrong (passes by coincidence due to timing), or
  • I'm misreading the flow

Looking again at submitPrivateTx line 204-214: record(err) is at line 208, then isAlreadyKnownError check is at line 210. Both run under the same mu.Lock(), so there's no race. The record definitely fires for "already known" errors.

The test at line 1577 should fail with total == 2, not total == 1. If it's passing, the mock might not be triggering the code path as expected. Either way, the intent (per the docstring on record() at line 29: "Callers typically filter out 'already known' error type") doesn't match the implementation.

Fix: Move record(err) after the isAlreadyKnownError check in all three locations, or guard it:

if err != nil {
    privateTxRpcFailureMeter.Mark(1)
    if isAlreadyKnownError(err) {
        privateTxRpcAlreadyKnownMeter.Mark(1)
        successfulIndices = append(successfulIndices, index)
        return
    }
    mc.rejectionTracker.record(err)  // moved after already-known check
    ...
}

Fix this →

Update: I re-read the test more carefully. In TestRejectionTrackerWiredToSubmit "private tx rejection is tracked, already-known is not" (line 1558), server s1 returns a fee cap error and s2 returns "already known". The submitPrivateTx call at line 1574 submits to both. In the goroutine for s2, record(already known) IS called at line 208, then the early return fires at line 210-214. So the tracker receives TWO record() calls: one for the fee error, one for "already known". The assertion at line 1577 (total == 1) should fail. If this test is passing in CI, there may be a timing issue or the test harness isn't capturing both calls. This needs investigation.


3. [MEDIUM-LOW] reportRejections logs even when there are zero rejections

File: eth/relay/multiclient.go:391-407

The reporter logs every minute regardless of whether any rejections occurred:

log.Info("[tx-relay] BP rejection summary", "total", total, "errors", formatRejectionCounts(counts))

On a healthy relay this produces ~1440 empty log lines per day. Consider adding a total > 0 guard:

if total > 0 {
    log.Info(...)
}

4. [LOW] Metric names changed — breaking for existing dashboards

File: eth/relay/service.go, eth/relay/multiclient.go

All metric names have been renamed (e.g., preconfs/submitrelay/preconf/submit/latency, preconfs/rpc/successrelay/bp/rpc/preconf/success, etc.). The old totalPrivateTxsMeter (Meter type) was replaced with privateTxStoreSizeGauge (Gauge type) — also a type change.

This is intentional and an improvement in naming, but any existing Grafana dashboards or alerting rules referencing the old metric names will silently stop receiving data. Not a code bug — just an operational awareness item for the deploy.


5. [LOW] SetTxPoolChecker uses write lock but sweepOnce reads txPoolChecker without synchronizing on the same field

File: eth/relay/private_tx_store.go:141-146 vs 179

SetTxPoolChecker acquires s.mu.Lock() to set s.txPoolChecker. sweepOnce also acquires s.mu.Lock() and reads s.txPoolChecker. Since both hold the same mutex, this is actually safe as written. However, if the sweep is refactored to the snapshot pattern (recommendation #1), the txPoolChecker read would be outside the lock. In that case, you'd want to snapshot the checker reference under the read lock too:

s.mu.RLock()
checker := s.txPoolChecker
// ... snapshot entries ...
s.mu.RUnlock()

6. [LOW] normalizeRejectionMessage treats leading colon as "no useful prefix"

File: eth/relay/rejection_tracker.go:56

if i := strings.IndexByte(msg, ':'); i > 0 {

Using i > 0 means :foo returns :foo unchanged (the test at line 1515 confirms this). This is fine for current error formats, but if a colon-prefixed error ever appears it would bypass normalization. Not a real risk given Go's error conventions — just noting the edge case is covered by tests.


Non-Issues (Verified Safe)

  • Map deletion during iteration in sweepOnce: Go spec explicitly allows deleting map entries during range — safe.
  • closeOnce added to multiClient.close(): Good addition — prevents double-close panics on reporterDone channel.
  • Atomic counters for metrics in PrivateTxStore: Correct use of atomic.Uint64 — no races with the periodic Store(0) reset since the report goroutine is the only writer of resets.
  • txPool.Has is safe to call concurrently: The Has method on TxPool acquires its own internal lock — no external synchronization needed at the call site (though the concern is about holding another lock while calling it, per finding MAT-156 Span sync #1).

Breaking / Risky Changes Assessment

Change Breaking? Risk Level Notes
Metric names renamed Yes (operational) Low Dashboards need updating
totalPrivateTxsMeter (Meter) → privateTxStoreSizeGauge (Gauge) Yes (operational) Low Type change — queries expecting rate-based metric will break
New sweep goroutine in NewPrivateTxStore No Medium See finding #1 re: lock holding
New rejectionTracker + reporter goroutine No Low Clean lifecycle management
SetTxPoolChecker new API on RelayService No Low Additive, well-guarded
Private tx no longer purged on BP rejection Behavioral Low Intentional — well-documented rationale in comments
multiClient.close() now idempotent via closeOnce No Positive Fixes potential double-close panic

Overall: No breaking API changes. The behavioral change (not purging on rejection, relying on TTL/sweep instead) is well-reasoned and properly documented. The main item to fix before merge is the lock-holding pattern in sweepOnce.


@claude

claude Bot commented Apr 29, 2026

Copy link
Copy Markdown

Code review

1 issue found. Checked for bugs and CLAUDE.md compliance.


Bug in eth/relay/multiclient.go line 148

mc.rejectionTracker.record(err) is called before the isAlreadyKnownError(err) check in all three call sites (submitPreconfTx line 143, submitPrivateTx line 208, retryPrivateTxSubmission line 302). This means "already known" errors which are intentionally treated as successful submissions are incorrectly counted as rejections in the tracker.

This contradicts the record() docstring: "Callers typically filter out already-known error type while other RPC related errors reach here."

The tests in TestRejectionTrackerWiredToSubmit will fail because total will be 2 (not the expected 1), and counts will contain "already known" despite require.NotContains.

The same ordering issue exists at these two other locations:

Suggested fix (for the first call site at lines 143-148):

				// If the tx is already known, treat it as preconfirmed for this node
				if isAlreadyKnownError(err) {
					preconfRpcAlreadyKnownMeter.Mark(1)
					preconfOfferedCount.Add(1)
					return
				}
				mc.rejectionTracker.record(err)

Accepting the suggestion above fixes only this first call site. The same fix (moving record(err) below the isAlreadyKnownError check + early return) must also be applied at lines 208 and 302.

@manav2401

Copy link
Copy Markdown
Member Author

@claude
Re. your suggestions,

  • 1 is addressed
  • 2 is kept intentionally - I would like to track "already known" errors as well even though they lead to a successful submission.
  • 3 to 6 are not needed and changes are intentional.

@claude

claude Bot commented Apr 29, 2026

Copy link
Copy Markdown

Claude finished @manav2401's task in 45s —— View job


Acknowledging Review Feedback


Thanks for the follow-up. Verified the current state:

Finding #1 — sweepOnce lock: Confirmed fixed. The three-phase pattern (RLock snapshot → filter without lock → Lock to delete) at private_tx_store.go:173-216 is clean. The re-add guard at line 210 (cur.Equal(entry.addedAt)) correctly prevents deleting entries that were re-added between snapshot and delete. The explanatory comment at lines 167-172 is helpful too.

Finding #2 — "already known" in rejection tracker: Understood, intentionally tracked. The tests at lines 1558-1604 now explicitly assert total == 2 and verify counts["already known"] is present, which is consistent with the implementation. No issue here.

Findings #3-#6: Acknowledged as intentional.

LGTM — no remaining concerns.


View job

@codecov

codecov Bot commented Apr 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.27160% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.23%. Comparing base (ad2d891) to head (6110b9a).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
eth/relay/private_tx_store.go 85.96% 7 Missing and 1 partial ⚠️
eth/relay/multiclient.go 84.78% 7 Missing ⚠️
eth/relay/relay.go 0.00% 4 Missing ⚠️

❌ Your patch check has failed because the patch coverage (88.27%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2205      +/-   ##
==========================================
+ Coverage   52.21%   52.23%   +0.01%     
==========================================
  Files         884      885       +1     
  Lines      155479   155616     +137     
==========================================
+ Hits        81190    81281      +91     
- Misses      69060    69099      +39     
- Partials     5229     5236       +7     
Files with missing lines Coverage Δ
eth/backend.go 52.73% <100.00%> (+0.19%) ⬆️
eth/relay/rejection_tracker.go 100.00% <100.00%> (ø)
eth/relay/service.go 95.62% <ø> (ø)
eth/relay/relay.go 89.01% <0.00%> (-4.10%) ⬇️
eth/relay/multiclient.go 90.07% <84.78%> (-1.83%) ⬇️
eth/relay/private_tx_store.go 83.57% <85.96%> (+3.57%) ⬆️

... and 19 files with indirect coverage changes

Files with missing lines Coverage Δ
eth/backend.go 52.73% <100.00%> (+0.19%) ⬆️
eth/relay/rejection_tracker.go 100.00% <100.00%> (ø)
eth/relay/service.go 95.62% <ø> (ø)
eth/relay/relay.go 89.01% <0.00%> (-4.10%) ⬇️
eth/relay/multiclient.go 90.07% <84.78%> (-1.83%) ⬇️
eth/relay/private_tx_store.go 83.57% <85.96%> (+3.57%) ⬆️

... and 19 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot AI review requested due to automatic review settings April 29, 2026 18:04
@manav2401 manav2401 merged commit 5b9a762 into master Apr 29, 2026
12 of 14 checks passed
@sonarqubecloud

Copy link
Copy Markdown

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread eth/relay/multiclient.go
Comment on lines +399 to +402
log.Info("[tx-relay] BP rejection summary",
"total", total,
"errors", formatRejectionCounts(counts),
)

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reportRejections logs an Info-level "BP rejection summary" on every rejectionReportInterval tick, even when total is 0 and counts is nil/empty. This will add steady log noise on healthy nodes. Consider logging only when total > 0 (or when the formatted error string is non-empty), or lowering this to Debug level.

Suggested change
log.Info("[tx-relay] BP rejection summary",
"total", total,
"errors", formatRejectionCounts(counts),
)
if total > 0 {
log.Info("[tx-relay] BP rejection summary",
"total", total,
"errors", formatRejectionCounts(counts),
)
}

Copilot uses AI. Check for mistakes.
Comment on lines +184 to +190
now := time.Now()
toDelete := make([]entry, 0)
for _, entry := range entries {
age := now.Sub(entry.addedAt)
if age > privateTxGracePeriod && s.txPoolChecker != nil && !s.txPoolChecker(entry.hash) {
toDelete = append(toDelete, entry)
}

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s.txPoolChecker is read and invoked without holding any lock in sweepOnce(). Since SetTxPoolChecker writes the function pointer under s.mu, this can race (and can even panic if the checker becomes nil between the nil-check and the call). Capture txPoolChecker into a local variable under RLock (e.g., alongside the entries snapshot), then use that local for the rest of the sweep pass.

Copilot uses AI. Check for mistakes.
txsAdded atomic.Uint64
txsPurged atomic.Uint64 // deleted by an explicit call
txsDeleted atomic.Uint64 // deleted because tx got included
txsExpired atomic.Uint64 // deleted by sweep (txpool eviction or TTL)

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The txsExpired metric comment says entries are deleted by sweep due to "txpool eviction or TTL", but the sweep implementation only deletes based on txpool absence after the grace period (no TTL-based expiry is implemented here). Please update the comment (or implement the TTL path if that was intended) to avoid misleading future readers.

Suggested change
txsExpired atomic.Uint64 // deleted by sweep (txpool eviction or TTL)
txsExpired atomic.Uint64 // deleted by sweep after txpool absence following the grace period

Copilot uses AI. Check for mistakes.
Comment thread eth/relay/relay.go
Comment on lines +134 to +140
// logged but as they can't be fixed immediately by the relay, they'll be pruned
// after the default private tx store timeout.
// - Transient rejection due to reorg or nonce race. The relay's local pool will
// evict the tx eventually once it catches up with the txpool-aware sweep.
// - Transient rejection due to BPs current mempool state (e.g. pool full). This
// will be handled by local re-submissions in BPs. If the tx still remains invalid
// it will be evicted after the default private tx store timeout.

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment mentions pruning "after the default private tx store timeout", but PrivateTxStore currently doesn’t have a time-based TTL eviction—only block-inclusion cleanup, explicit purge, and the txpool-aware sweep. Please either point to the actual eviction mechanism/constant, or reword this to match the implemented behavior.

Suggested change
// logged but as they can't be fixed immediately by the relay, they'll be pruned
// after the default private tx store timeout.
// - Transient rejection due to reorg or nonce race. The relay's local pool will
// evict the tx eventually once it catches up with the txpool-aware sweep.
// - Transient rejection due to BPs current mempool state (e.g. pool full). This
// will be handled by local re-submissions in BPs. If the tx still remains invalid
// it will be evicted after the default private tx store timeout.
// logged; the relay does not immediately purge them from the private tx store,
// and they remain until removed by the store's existing cleanup paths.
// - Transient rejection due to reorg or nonce race. The relay's local pool will
// evict the tx eventually once it catches up with the txpool-aware sweep.
// - Transient rejection due to BPs current mempool state (e.g. pool full). This
// will be handled by local re-submissions in BPs; if the tx later drops out of
// the relay's local pool, it can then be removed by the txpool-aware sweep.

Copilot uses AI. Check for mistakes.
Comment thread eth/relay/multiclient.go
Comment on lines +368 to 371
// All BPs accepted the tx but at least one of them didn't offer a preconf
if firstErr == nil {
belowThresholdPreconfMeter.Mark(1)
}

Copilot AI Apr 29, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In checkTxStatus, the comment says "All BPs accepted the tx" / "didn't offer a preconf", but this method is issuing txpool_txStatus queries (not submitting a tx for acceptance). Please adjust the comment to describe the actual condition being measured (e.g., all BPs report pending vs mixed/unknown).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants