Skip to content

Slow TX Observability, Per-TX Latency Metric, and Timeout-Interrupt Diagnostics#2134

Merged
lucca30 merged 14 commits intodevelopfrom
lmartins/log-slow-txs
Mar 10, 2026
Merged

Slow TX Observability, Per-TX Latency Metric, and Timeout-Interrupt Diagnostics#2134
lucca30 merged 14 commits intodevelopfrom
lmartins/log-slow-txs

Conversation

@lucca30
Copy link
Copy Markdown
Contributor

@lucca30 lucca30 commented Mar 9, 2026

Description

Summary

This branch adds miner observability and diagnostics in three areas:

  1. Top-10 slow transaction tracking in 10-minute tumbling windows (producer-only logging).
  2. Per-transaction apply latency metric for tail analysis.
  3. Timeout interruption diagnostics with precise timing context.

It also adds dedicated unit tests and benchmark coverage for the new tracker.


Behavior Changes

1) Slow tx tracking (windowed top-K)

  • Tracks the slowest successful tx applications in a bounded top 10 structure.
  • Data structure: min-heap (K=10) with:
    • O(1) fast reject when not in top-K
    • O(log K) insert/replace when qualifying
  • Logging cadence: every 10 minutes, then reset.
  • Logging routine runs only when the node is a block producer (worker.IsRunning()).
  • If not producing during a tick, tracker is reset without logging to avoid stale carryover.

2) Timeout interruption diagnostics

On block-build timeout interruption, logs now include richer context:

  • flagSetAt
  • flagToAbortDelay
  • flagToTxInterruptDelay (when observed)
  • last tx context (txHash, txIndex, sender, txElapsed)

Also includes dedicated handling/logging for vm.ErrInterrupt in tx execution flow.

3) Per-tx duration metric

  • Adds worker/txApplyDuration metric.
  • Updated on successful tx application.
  • Uses custom timer reservoir (ExpDecaySample(8192, 0.015)) to preserve tail visibility at high throughput.

Logging Format Helper

  • Added common.PrettyTime (UTC format: MM-DD|HH:MM:SS.mmm) for consistent log time rendering.
  • Includes unit test coverage in common/format_test.go.

Test Changes

  • New unit tests for top-K tracker:
    • keeps top 10
    • descending snapshot order
    • reset semantics

Benchmark Impact

Benchmark command:

go test ./miner -run '^$' -bench BenchmarkSlowTxTopWindowTrackerImpact -benchmem

Representative run (Apple M4 Pro):

  • Baseline/200tx: 204.0 ns/op

  • TrackerMostlyNoInsert/200tx: 446.7 ns/op

  • TrackerMixedInsert/200tx: 853.0 ns/op

  • Baseline/800tx: 794.0 ns/op

  • TrackerMostlyNoInsert/800tx: 1328 ns/op

  • TrackerMixedInsert/800tx: 2492 ns/op

Delta vs baseline (impact view)

Assuming 200 tx insertions per block:

  • Mostly-no-insert path: 446.7 - 204.0 = +242.7 ns/block (about +1.21 ns/tx)
  • Mixed-insert path: 853.0 - 204.0 = +649.0 ns/block (about +3.25 ns/tx)

Assuming 800 tx insertions per block:

  • Mostly-no-insert path: 1328 - 794 = +534 ns/block (about +0.67 ns/tx)
  • Mixed-insert path: 2492 - 794 = +1698 ns/block (about +2.12 ns/tx)

Allocation profile in tracker modes:

  • 480 B/op, 10 allocs/op (bounded by K=10, not by tx volume)

Conclusion:

  • For common-path behavior, overhead is sub-2 ns/tx.
  • Even for mixed insertion behavior, overhead stays within low single-digit ns/tx.
  • No materially relevant performance regression for 200/800 tx insertion assumptions.

Validation Performed

  • go test ./common
  • go test ./internal/cli/server
  • go test ./miner -run TestSlowTx
  • go test ./miner -run '^$' -bench BenchmarkSlowTxTopWindowTrackerImpact -benchmem

Changes

  • Bugfix (non-breaking change that solves an issue)
  • Hotfix (change that solves an urgent issue, and requires immediate attention)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (change that is not backwards-compatible and/or changes current functionality)
  • Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

  • I have added at least 2 reviewer or the whole pos-v1 team
  • I have added sufficient documentation in code
  • I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
  • Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
  • Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

  • This PR requires changes to heimdall
    • In case link the PR here:
  • This PR requires changes to matic-cli
    • In case link the PR here:

Testing

  • I have added unit tests
  • I have added tests to CI
  • I have tested this code manually on local environment
  • I have tested this code manually on remote devnet using express-cli
  • I have tested this code manually on amoy
  • I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

@lucca30 lucca30 requested a review from a team March 9, 2026 19:27
@claude
Copy link
Copy Markdown

claude bot commented Mar 9, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@claude
Copy link
Copy Markdown

claude bot commented Mar 9, 2026

Code review

1 issue found. Checked for bugs and CLAUDE.md compliance.

Bug: Off-by-one in txIndex diagnostic log

File: miner/worker.go:1254

When the interrupt fires after a successful commitTransaction, env.tcount has already been incremented (at line 1161), so it points to the next unprocessed slot, not the index of the transaction identified by txHash (lastTxHash, set at line 1304).

Example: If lastTxHash refers to the tx at index 4, env.tcount is already 5 by the time this log fires. An operator seeing txHash=0xabc txIndex=5 would think tx 0xabc was at index 5, when it was actually at index 4.

Suggested fix: Change env.tcount to env.tcount-1 on line 1254.

@lucca30
Copy link
Copy Markdown
Contributor Author

lucca30 commented Mar 9, 2026

@claude Do a review again, see if you find any other issues

@claude
Copy link
Copy Markdown

claude bot commented Mar 9, 2026

Claude finished @lucca30's task in 2m 7s —— View job


Code Review (Re-review)

  • Gather context and read all changed files
  • Review miner/worker.go changes
  • Review common/format.go and tests
  • Review config/flag changes
  • Review benchmark test
  • Post findings

1 issue found, 2 minor observations. Checked for bugs, races, security, and CLAUDE.md compliance.


Issue: Weak ordering guarantee between interruptFlagSetAt and interruptBlockBuilding atomics

File: miner/worker.go:2145-2148

In createInterruptTimer, the goroutine writes:

interruptFlagSetAt.Store(time.Now().UnixNano())  // line 2146
interruptBlockBuilding.Store(true)                // line 2148

The main loop reads in the opposite order:

if w.interruptBlockBuilding.Load() {              // line 1241
    if flagSetAt := w.interruptFlagSetAt.Load()   // line 1246

Go's memory model does not guarantee ordering across different atomic variables. The main goroutine could observe interruptBlockBuilding=true while still reading a stale interruptFlagSetAt=0. The flagSetAt > 0 guard means this just silently omits the timing fields from the log rather than producing incorrect data, so the impact is low — it's a diagnostic gap, not a correctness bug.

If you want to eliminate this, you could combine both values behind a single sync.Mutex, or accept the rare missing timestamp as acceptable for a diagnostic log (which I think is fine).

Severity: Low (self-inflicted, diagnostic-only impact)


Observation 1: Previous off-by-one is fixed

The previous review flagged env.tcount being used directly in the interrupt log. The new code correctly captures lastTxIndex = env.tcount before commitTransaction (which increments tcount on success), so the logged index now matches the tx identified by lastTxHash. Nice fix.

Observation 2: ErrInterrupt handler uses txs.Pop() (drops all sender txs)

File: miner/worker.go:1449

When a tx is interrupted due to timeout, txs.Pop() drops all remaining txs from the same sender. This is more aggressive than txs.Shift() (which would try the next tx). Given that the block-building timer has already fired at this point, dropping the sender's remaining txs is the right call — there's no time left to try more txs anyway, and the loop will exit on the next iteration when it re-checks interruptBlockBuilding. No issue here, just noting for completeness.


Summary

The implementation is clean and well-structured. Config plumbing follows existing patterns (DurationFlag + raw string for TOML). The benchmark test correctly handles the global metrics.Enable() one-way switch. The slow-tx aggregation (one log per block, sorted by duration) is a good design choice that avoids log spam. The ErrInterrupt case handler is a sensible addition for timeout diagnostics.

The only finding is the minor atomic ordering gap noted above, which has no impact on correctness — just a rare chance of omitting timing fields from a diagnostic log.


@kamuikatsurgi
Copy link
Copy Markdown
Member

IMO, the per tx latency isn't much useful as it would just be a another metric. It doesn't particularly tell use which txs are slower vs which are faster. I don't see how this can help us.

@lucca30
Copy link
Copy Markdown
Contributor Author

lucca30 commented Mar 9, 2026

IMO, the per tx latency isn't much useful as it would just be a another metric. It doesn't particularly tell use which txs are slower vs which are faster. I don't see how this can help us.

@kamuikatsurgi
I agree that this data has no good use for the things we have today.
But the goal is to help us refine which should be the ideal --miner.slowtxthreshold value to just capture relevant outliers. Today we have no idea on what is the avg time a tx take, how outliers behave, if they are common or not etc

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 9, 2026

Codecov Report

❌ Patch coverage is 69.23077% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.48%. Comparing base (a328aa9) to head (91d3170).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines Patch % Lines
miner/slow_tx_tracker.go 56.89% 25 Missing ⚠️
miner/worker.go 77.10% 18 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (69.23%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2134      +/-   ##
===========================================
+ Coverage    51.44%   51.48%   +0.04%     
===========================================
  Files          881      882       +1     
  Lines       153870   154004     +134     
===========================================
+ Hits         79157    79290     +133     
+ Misses       69542    69538       -4     
- Partials      5171     5176       +5     
Files with missing lines Coverage Δ
common/format.go 8.33% <100.00%> (+8.33%) ⬆️
miner/worker.go 69.08% <77.10%> (+0.54%) ⬆️
miner/slow_tx_tracker.go 56.89% <56.89%> (ø)

... and 23 files with indirect coverage changes

Files with missing lines Coverage Δ
common/format.go 8.33% <100.00%> (+8.33%) ⬆️
miner/worker.go 69.08% <77.10%> (+0.54%) ⬆️
miner/slow_tx_tracker.go 56.89% <56.89%> (ø)

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@lucca30
Copy link
Copy Markdown
Contributor Author

lucca30 commented Mar 10, 2026

@claude Do an analysis for the new changes I did. Checking if any bug, error or sec breach was introduced.

@claude
Copy link
Copy Markdown

claude bot commented Mar 10, 2026

Claude finished @lucca30's task in 1m 13s —— View job


Analysis of New Changes

  • Gather context — read the full diff and changed files
  • Review miner/worker.go changes for bugs, errors, and security issues
  • Review common/format.go and common/format_test.go
  • Review test files (miner/slow_tx_tracker_test.go, miner/worker_metrics_benchmark_test.go)
  • Review config/flag changes
  • Post findings

0 bugs, 0 security issues found. Checked for bugs, data races, security breaches, and correctness issues across all 6 changed files.


What was checked

1. slowTxTopTracker data structure (worker.go:184-273)

  • Min-heap logic is correct: O(1) fast reject, O(log K) replace, bounded at K=10.
  • SnapshotAndReset correctly copies before resetting, returns descending order.
  • No concurrency issue — the tracker is only accessed from mainLoop (single goroutine) and flushSlowTxWindow (also called from mainLoop via the ticker case). Safe without a mutex.

2. Tracker integration in commitTransactions (worker.go:1495-1505)

  • w.slowTxTracker.Add(...) is only called on successful tx commit (errors.Is(err, nil) case). Correct — only tracks actually applied txs.
  • Guarded by w.IsRunning() so non-producers don't track. Consistent with the flush logic.

3. Window ticker in mainLoop (worker.go:881, 975-981)

  • Ticker created before the select loop and properly deferred for cleanup.
  • When not running, calls Reset() to avoid stale carryover. Good.

4. ErrInterrupt handler (worker.go:1537-1546)

  • The ltx.Hash reference on line 1545 is valid here — ltx was set earlier in the loop iteration (line 1391-1398) and hasn't been reassigned. This is the same pattern used by the default case on line 1551.
  • txs.Pop() correctly drops the sender's remaining txs since the block-building timer has already fired.

5. Interrupt diagnostic context (worker.go:1313-1323, 1339-1365)

  • lastTxIndex = env.tcount is captured before commitTransaction (line 1479), fixing the off-by-one from the previous review. Correct.
  • lastCommitStart zero-check guards the log fields properly — if no tx was attempted, the tx-specific fields are omitted.

6. createInterruptTimer atomic ordering (worker.go:2229-2232)

  • interruptFlagSetAt.Store(...) is now written before interruptBlockBuilding.Store(true), and the reader checks interruptBlockBuilding first then reads interruptFlagSetAt. This is the correct store-before-publish / load-after-observe pattern. The interruptCtx.Err() != context.Canceled guard ensures the timestamp is only set on actual timeout (not on cancel). Clean.
  • On the reader side (worker.go:1344), flagSetAt > 0 still guards against the race where the goroutine hasn't written the timestamp yet (e.g., if the scheduler delays between the two stores). The fallback is just omitting the timing fields — no incorrect data.

7. PrettyTime (common/format.go:45-52)

  • Simple UTC formatter. Test covers timezone conversion correctly.

8. Removed flag test (internal/cli/server/command_test.go:150-168)

  • Correctly verifies --miner.slowtxthreshold is rejected. Uses io.Discard for UI writers to avoid test noise.

9. Benchmark test (miner/worker_metrics_benchmark_test.go)

  • Sink variables prevent dead-code elimination. Benchmark modes cover all tracker code paths. No issues.

10. Security check

  • No new RPC endpoints, no new peer message handlers, no new untrusted input paths.
  • The slow tx tracker is bounded (K=10, fixed) — no unbounded allocation risk.
  • formatSlowTxs uses fmt.Sprintf on bounded data (max 10 entries) — no memory concern.
  • No secrets logged; tx hashes and sender addresses are already public data.

Previous review findings status

  • Off-by-one in txIndex: Fixed in this version. lastTxIndex = env.tcount is captured before commitTransaction.
  • Weak atomic ordering: Addressed — store order is now interruptFlagSetAt before interruptBlockBuilding, which is the correct pattern.

The implementation is solid. No bugs, errors, or security issues introduced.

Copy link
Copy Markdown
Member

@manav2401 manav2401 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, couple of small nits.

@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

@lucca30 lucca30 merged commit 94b8bf9 into develop Mar 10, 2026
24 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants