Slow TX Observability, Per-TX Latency Metric, and Timeout-Interrupt Diagnostics by lucca30 · Pull Request #2134 · 0xPolygon/bor

lucca30 · 2026-03-09T19:27:38Z

Description

Summary

This branch adds miner observability and diagnostics in three areas:

Top-10 slow transaction tracking in 10-minute tumbling windows (producer-only logging).
Per-transaction apply latency metric for tail analysis.
Timeout interruption diagnostics with precise timing context.

It also adds dedicated unit tests and benchmark coverage for the new tracker.

Behavior Changes

1) Slow tx tracking (windowed top-K)

Tracks the slowest successful tx applications in a bounded top 10 structure.
Data structure: min-heap (K=10) with:
- O(1) fast reject when not in top-K
- O(log K) insert/replace when qualifying
Logging cadence: every 10 minutes, then reset.
Logging routine runs only when the node is a block producer (worker.IsRunning()).
If not producing during a tick, tracker is reset without logging to avoid stale carryover.

2) Timeout interruption diagnostics

On block-build timeout interruption, logs now include richer context:

flagSetAt
flagToAbortDelay
flagToTxInterruptDelay (when observed)
last tx context (txHash, txIndex, sender, txElapsed)

Also includes dedicated handling/logging for vm.ErrInterrupt in tx execution flow.

3) Per-tx duration metric

Adds worker/txApplyDuration metric.
Updated on successful tx application.
Uses custom timer reservoir (ExpDecaySample(8192, 0.015)) to preserve tail visibility at high throughput.

Logging Format Helper

Added common.PrettyTime (UTC format: MM-DD|HH:MM:SS.mmm) for consistent log time rendering.
Includes unit test coverage in common/format_test.go.

Test Changes

New unit tests for top-K tracker:
- keeps top 10
- descending snapshot order
- reset semantics

Benchmark Impact

Benchmark command:

go test ./miner -run '^$' -bench BenchmarkSlowTxTopWindowTrackerImpact -benchmem

Representative run (Apple M4 Pro):

Baseline/200tx: 204.0 ns/op
TrackerMostlyNoInsert/200tx: 446.7 ns/op
TrackerMixedInsert/200tx: 853.0 ns/op
Baseline/800tx: 794.0 ns/op
TrackerMostlyNoInsert/800tx: 1328 ns/op
TrackerMixedInsert/800tx: 2492 ns/op

Delta vs baseline (impact view)

Assuming 200 tx insertions per block:

Mostly-no-insert path: 446.7 - 204.0 = +242.7 ns/block (about +1.21 ns/tx)
Mixed-insert path: 853.0 - 204.0 = +649.0 ns/block (about +3.25 ns/tx)

Assuming 800 tx insertions per block:

Mostly-no-insert path: 1328 - 794 = +534 ns/block (about +0.67 ns/tx)
Mixed-insert path: 2492 - 794 = +1698 ns/block (about +2.12 ns/tx)

Allocation profile in tracker modes:

480 B/op, 10 allocs/op (bounded by K=10, not by tx volume)

Conclusion:

For common-path behavior, overhead is sub-2 ns/tx.
Even for mixed insertion behavior, overhead stays within low single-digit ns/tx.
No materially relevant performance regression for 200/800 tx insertion assumptions.

Validation Performed

go test ./common
go test ./internal/cli/server
go test ./miner -run TestSlowTx
go test ./miner -run '^$' -bench BenchmarkSlowTxTopWindowTrackerImpact -benchmem

Changes

Bugfix (non-breaking change that solves an issue)
Hotfix (change that solves an urgent issue, and requires immediate attention)
New feature (non-breaking change that adds functionality)
Breaking change (change that is not backwards-compatible and/or changes current functionality)
Changes only for a subset of nodes

Breaking changes

Please complete this section if any breaking changes have been made, otherwise delete it

Nodes audience

In case this PR includes changes that must be applied only to a subset of nodes, please specify how you handled it (e.g. by adding a flag with a default value...)

Checklist

I have added at least 2 reviewer or the whole pos-v1 team
I have added sufficient documentation in code
I will be resolving comments - if any - by pushing each fix in a separate commit and linking the commit hash in the comment reply
Created a task in Jira and informed the team for implementation in Erigon client (if applicable)
Includes RPC methods changes, and the Notion documentation has been updated

Cross repository changes

This PR requires changes to heimdall
- In case link the PR here:
This PR requires changes to matic-cli
- In case link the PR here:

Testing

I have added unit tests
I have added tests to CI
I have tested this code manually on local environment
I have tested this code manually on remote devnet using express-cli
I have tested this code manually on amoy
I have created new e2e tests into express-cli

Manual tests

Please complete this section with the steps you performed if you ran manual tests for this functionality, otherwise delete it

Additional comments

Please post additional comments in this section if you have them, otherwise delete it

claude · 2026-03-09T19:35:45Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

claude · 2026-03-09T19:36:39Z

Code review

1 issue found. Checked for bugs and CLAUDE.md compliance.

Bug: Off-by-one in `txIndex` diagnostic log

File: miner/worker.go:1254

When the interrupt fires after a successful commitTransaction, env.tcount has already been incremented (at line 1161), so it points to the next unprocessed slot, not the index of the transaction identified by txHash (lastTxHash, set at line 1304).

Example: If lastTxHash refers to the tx at index 4, env.tcount is already 5 by the time this log fires. An operator seeing txHash=0xabc txIndex=5 would think tx 0xabc was at index 5, when it was actually at index 4.

Suggested fix: Change env.tcount to env.tcount-1 on line 1254.

lucca30 · 2026-03-09T20:15:52Z

@claude Do a review again, see if you find any other issues

claude · 2026-03-09T20:16:09Z

Claude finished @lucca30's task in 2m 7s —— View job

Code Review (Re-review)

1 issue found, 2 minor observations. Checked for bugs, races, security, and CLAUDE.md compliance.

Issue: Weak ordering guarantee between `interruptFlagSetAt` and `interruptBlockBuilding` atomics

File: miner/worker.go:2145-2148

In createInterruptTimer, the goroutine writes:

interruptFlagSetAt.Store(time.Now().UnixNano())  // line 2146
interruptBlockBuilding.Store(true)                // line 2148

The main loop reads in the opposite order:

if w.interruptBlockBuilding.Load() {              // line 1241
    if flagSetAt := w.interruptFlagSetAt.Load()   // line 1246

Go's memory model does not guarantee ordering across different atomic variables. The main goroutine could observe interruptBlockBuilding=true while still reading a stale interruptFlagSetAt=0. The flagSetAt > 0 guard means this just silently omits the timing fields from the log rather than producing incorrect data, so the impact is low — it's a diagnostic gap, not a correctness bug.

If you want to eliminate this, you could combine both values behind a single sync.Mutex, or accept the rare missing timestamp as acceptable for a diagnostic log (which I think is fine).

Severity: Low (self-inflicted, diagnostic-only impact)

Observation 1: Previous off-by-one is fixed

The previous review flagged env.tcount being used directly in the interrupt log. The new code correctly captures lastTxIndex = env.tcount before commitTransaction (which increments tcount on success), so the logged index now matches the tx identified by lastTxHash. Nice fix.

Observation 2: `ErrInterrupt` handler uses `txs.Pop()` (drops all sender txs)

File: miner/worker.go:1449

When a tx is interrupted due to timeout, txs.Pop() drops all remaining txs from the same sender. This is more aggressive than txs.Shift() (which would try the next tx). Given that the block-building timer has already fired at this point, dropping the sender's remaining txs is the right call — there's no time left to try more txs anyway, and the loop will exit on the next iteration when it re-checks interruptBlockBuilding. No issue here, just noting for completeness.

Summary

The implementation is clean and well-structured. Config plumbing follows existing patterns (DurationFlag + raw string for TOML). The benchmark test correctly handles the global metrics.Enable() one-way switch. The slow-tx aggregation (one log per block, sorted by duration) is a good design choice that avoids log spam. The ErrInterrupt case handler is a sensible addition for timeout diagnostics.

The only finding is the minor atomic ordering gap noted above, which has no impact on correctness — just a rare chance of omitting timing fields from a diagnostic log.

kamuikatsurgi · 2026-03-09T20:31:59Z

IMO, the per tx latency isn't much useful as it would just be a another metric. It doesn't particularly tell use which txs are slower vs which are faster. I don't see how this can help us.

lucca30 · 2026-03-09T20:38:10Z

IMO, the per tx latency isn't much useful as it would just be a another metric. It doesn't particularly tell use which txs are slower vs which are faster. I don't see how this can help us.

@kamuikatsurgi
I agree that this data has no good use for the things we have today.
But the goal is to help us refine which should be the ideal --miner.slowtxthreshold value to just capture relevant outliers. Today we have no idea on what is the avg time a tx take, how outliers behave, if they are common or not etc

codecov · 2026-03-09T21:59:11Z

Codecov Report

❌ Patch coverage is 69.23077% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.48%. Comparing base (a328aa9) to head (91d3170).
⚠️ Report is 4 commits behind head on develop.

Files with missing lines	Patch %	Lines
miner/slow_tx_tracker.go	56.89%	25 Missing ⚠️
miner/worker.go	77.10%	18 Missing and 1 partial ⚠️

❌ Your patch status has failed because the patch coverage (69.23%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #2134      +/-   ##
===========================================
+ Coverage    51.44%   51.48%   +0.04%     
===========================================
  Files          881      882       +1     
  Lines       153870   154004     +134     
===========================================
+ Hits         79157    79290     +133     
+ Misses       69542    69538       -4     
- Partials      5171     5176       +5

Files with missing lines	Coverage Δ
common/format.go	`8.33% <100.00%> (+8.33%)`	⬆️
miner/worker.go	`69.08% <77.10%> (+0.54%)`	⬆️
miner/slow_tx_tracker.go	`56.89% <56.89%> (ø)`

... and 23 files with indirect coverage changes

Files with missing lines	Coverage Δ
common/format.go	`8.33% <100.00%> (+8.33%)`	⬆️
miner/worker.go	`69.08% <77.10%> (+0.54%)`	⬆️
miner/slow_tx_tracker.go	`56.89% <56.89%> (ø)`

... and 23 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

internal/cli/server/config.go

lucca30 · 2026-03-10T00:39:45Z

@claude Do an analysis for the new changes I did. Checking if any bug, error or sec breach was introduced.

claude · 2026-03-10T00:40:03Z

Claude finished @lucca30's task in 1m 13s —— View job

Analysis of New Changes

Gather context — read the full diff and changed files
Review miner/worker.go changes for bugs, errors, and security issues
Review common/format.go and common/format_test.go
Review test files (miner/slow_tx_tracker_test.go, miner/worker_metrics_benchmark_test.go)
Review config/flag changes
Post findings

0 bugs, 0 security issues found. Checked for bugs, data races, security breaches, and correctness issues across all 6 changed files.

What was checked

1. slowTxTopTracker data structure (worker.go:184-273)

Min-heap logic is correct: O(1) fast reject, O(log K) replace, bounded at K=10.
SnapshotAndReset correctly copies before resetting, returns descending order.
No concurrency issue — the tracker is only accessed from mainLoop (single goroutine) and flushSlowTxWindow (also called from mainLoop via the ticker case). Safe without a mutex.

2. Tracker integration in commitTransactions (worker.go:1495-1505)

w.slowTxTracker.Add(...) is only called on successful tx commit (errors.Is(err, nil) case). Correct — only tracks actually applied txs.
Guarded by w.IsRunning() so non-producers don't track. Consistent with the flush logic.

3. Window ticker in mainLoop (worker.go:881, 975-981)

Ticker created before the select loop and properly deferred for cleanup.
When not running, calls Reset() to avoid stale carryover. Good.

4. ErrInterrupt handler (worker.go:1537-1546)

The ltx.Hash reference on line 1545 is valid here — ltx was set earlier in the loop iteration (line 1391-1398) and hasn't been reassigned. This is the same pattern used by the default case on line 1551.
txs.Pop() correctly drops the sender's remaining txs since the block-building timer has already fired.

5. Interrupt diagnostic context (worker.go:1313-1323, 1339-1365)

lastTxIndex = env.tcount is captured before commitTransaction (line 1479), fixing the off-by-one from the previous review. Correct.
lastCommitStart zero-check guards the log fields properly — if no tx was attempted, the tx-specific fields are omitted.

6. createInterruptTimer atomic ordering (worker.go:2229-2232)

interruptFlagSetAt.Store(...) is now written before interruptBlockBuilding.Store(true), and the reader checks interruptBlockBuilding first then reads interruptFlagSetAt. This is the correct store-before-publish / load-after-observe pattern. The interruptCtx.Err() != context.Canceled guard ensures the timestamp is only set on actual timeout (not on cancel). Clean.
On the reader side (worker.go:1344), flagSetAt > 0 still guards against the race where the goroutine hasn't written the timestamp yet (e.g., if the scheduler delays between the two stores). The fallback is just omitting the timing fields — no incorrect data.

7. PrettyTime (common/format.go:45-52)

Simple UTC formatter. Test covers timezone conversion correctly.

8. Removed flag test (internal/cli/server/command_test.go:150-168)

Correctly verifies --miner.slowtxthreshold is rejected. Uses io.Discard for UI writers to avoid test noise.

9. Benchmark test (miner/worker_metrics_benchmark_test.go)

Sink variables prevent dead-code elimination. Benchmark modes cover all tracker code paths. No issues.

10. Security check

No new RPC endpoints, no new peer message handlers, no new untrusted input paths.
The slow tx tracker is bounded (K=10, fixed) — no unbounded allocation risk.
formatSlowTxs uses fmt.Sprintf on bounded data (max 10 entries) — no memory concern.
No secrets logged; tx hashes and sender addresses are already public data.

Previous review findings status

Off-by-one in txIndex: Fixed in this version. lastTxIndex = env.tcount is captured before commitTransaction.
Weak atomic ordering: Addressed — store order is now interruptFlagSetAt before interruptBlockBuilding, which is the correct pattern.

The implementation is solid. No bugs, errors, or security issues introduced.

manav2401

LGTM, couple of small nits.

miner/worker.go

miner/slow_tx_tracker_test.go

sonarqubecloud · 2026-03-10T13:42:06Z

Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

lucca30 added 8 commits March 9, 2026 12:11

including log report for slow txs and the interrupted one

0229827

include perTxApply metrics

d07b9e8

pretty logs

8f12a06

make lint

3e7fb39

Merge remote-tracking branch 'origin/develop' into lmartins/log-slow-txs

3c8b322

solve merge conflict

6e9203b

full hash log for slow txs

f7a64fc

full tx hash log

a76c74a

lucca30 requested a review from a team March 9, 2026 19:27

make docs

ad3d691

lucca30 added 2 commits March 9, 2026 16:58

address cr

9095cca

address failing unit test

1631580

cffls reviewed Mar 9, 2026

View reviewed changes

internal/cli/server/config.go Outdated Show resolved Hide resolved

log top 10 slowest each 10 minutes

45293cf

moving code from worker.go

d571c91

cffls approved these changes Mar 10, 2026

View reviewed changes

marcello33 approved these changes Mar 10, 2026

View reviewed changes

manav2401 approved these changes Mar 10, 2026

View reviewed changes

miner/worker.go Show resolved Hide resolved

miner/slow_tx_tracker_test.go Show resolved Hide resolved

address comments

91d3170

lucca30 merged commit 94b8bf9 into develop Mar 10, 2026
24 of 28 checks passed

Conversation

lucca30 commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Summary

Behavior Changes

1) Slow tx tracking (windowed top-K)

2) Timeout interruption diagnostics

3) Per-tx duration metric

Logging Format Helper

Test Changes

Benchmark Impact

Delta vs baseline (impact view)

Validation Performed

Changes

Breaking changes

Nodes audience

Checklist

Cross repository changes

Testing

Manual tests

Additional comments

Uh oh!

claude bot commented Mar 9, 2026

Code review

Uh oh!

claude bot commented Mar 9, 2026

Code review

Bug: Off-by-one in txIndex diagnostic log

Uh oh!

lucca30 commented Mar 9, 2026

Uh oh!

claude bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review (Re-review)

Issue: Weak ordering guarantee between interruptFlagSetAt and interruptBlockBuilding atomics

Observation 1: Previous off-by-one is fixed

Observation 2: ErrInterrupt handler uses txs.Pop() (drops all sender txs)

Summary

Uh oh!

kamuikatsurgi commented Mar 9, 2026

Uh oh!

lucca30 commented Mar 9, 2026

Uh oh!

codecov bot commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

lucca30 commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis of New Changes

What was checked

Previous review findings status

Uh oh!

manav2401 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sonarqubecloud bot commented Mar 10, 2026

Quality Gate failed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lucca30 commented Mar 9, 2026 •

edited

Loading

Bug: Off-by-one in `txIndex` diagnostic log

claude bot commented Mar 9, 2026 •

edited

Loading

Issue: Weak ordering guarantee between `interruptFlagSetAt` and `interruptBlockBuilding` atomics

Observation 2: `ErrInterrupt` handler uses `txs.Pop()` (drops all sender txs)

codecov bot commented Mar 9, 2026 •

edited

Loading

claude bot commented Mar 10, 2026 •

edited

Loading