Integrate wip-scale-2 performance improvements into main#385
Conversation
…ode into wip-scale2
…n loading Instrument the loadUnminedTransactions function with detailed Prometheus metrics to track performance and identify bottlenecks during startup: - UTXO index readiness tracking (gauge + wait duration histogram) - Iterator creation timing with success/error status - Iterator processing with detailed transaction statistics (skipped, already mined, locked, added) - Mark transactions on longest chain timing - Sort transactions timing bucketed by volume (<1k, 1k-10k, 10k-100k, 100k-1M, >1M) - Parent chain validation timing and filtered count - AddDirectly calls with individual and batch timing Also fixed unused import in pkg/k8sresolver/k8s.go
Restructured UTracer.Start() to process logging and metrics independently of tracing state. Previously, early return when tracing was disabled prevented log messages, metrics (histograms/ counters), and stats from being recorded. Changes: - Move option processing before tracing check - Execute logging at span start/end regardless of tracing state - Record metrics (histogram/counter) in endFn unconditionally - Gate OpenTelemetry span operations behind tracingEnabled check - Add test coverage for disabled tracing with logging/metrics This allows observability through logs and metrics even when distributed tracing is disabled for performance reasons.
…ode into wip-scale2
When getminingcandidate is called while processNewBlockAnnouncement is processing a new block, immediately return an empty block template for the new height instead of timing out or blocking. This prevents miners from wasting hashrate on stale work and eliminates timeout errors during block processing.
Simplified the mining candidate caching from ~200 lines to ~100 lines by removing: - Infinite loop with multiple continue statements - Duplicated "stale cache" logic (appeared twice) - Complex lock upgrade pattern with double-checking - generationChan coordination mechanism - Unnecessary retry loops New approach: 1. Check cache with read lock (fast path) 2. Acquire write lock for generation (prevents concurrent generation) 3. Double-check cache after acquiring write lock (race prevention) 4. Generate and update cache This maintains all functionality while being much easier to understand and maintain.
Added two new Prometheus metrics to track BlockAssembler state changes:
1. teranode_blockassembly_state_transitions_total{from, to}
- Counter tracking every state transition
- Labeled with from/to states
- Shows transition patterns and frequency
2. teranode_blockassembly_state_duration_seconds{state}
- Histogram of time spent in each state
- Buckets: 1ms to 60s
- Enables P50/P95/P99 duration analysis
These metrics capture state changes between Prometheus scrapes, providing complete visibility into state transitions and durations for Grafana dashboards.
…king Added comprehensive Grafana dashboard to visualize BlockAssembler state transitions and durations: Panels: - State Timeline: Visual timeline showing state changes over time - State Durations: P50/P95/P99 percentiles for each state - State Transitions: Rate of state changes per second - Current State: Real-time state display with color coding - Time in Current State: Duration gauge with thresholds - State Distribution: Pie chart showing time percentage per state - State Entries: Count of state entries in last 5 minutes - State Transition Matrix: Table view of all transitions Enables monitoring of: - Block processing performance (BlockchainSubscription duration) - Mining candidate generation speed (GetMiningCandidate duration) - Reorg frequency and duration - State transition patterns Includes documentation with example queries, alerting rules, and use cases.
Added missing 'name' field in YAML frontmatter to fix agent parse error.
…sion Reverts the GetMiningCandidate caching refactor from b72e399 which caused TestGetBlockAssemblyBlockCandidate/with_10_txs to timeout. The simplified locking approach introduced a deadlock when transactions were added. Restores the original iterative caching logic with generating flag and generationChan coordination which properly handles concurrent requests. Also fixes: - Test data to use proper TxInpoints with parent transaction hashes - Mock expectations for TxCount/QueueLength/SubtreeCount methods - Test cleanup to prevent logger races All blockassembly tests now pass with race detection enabled.
…e/wip-scale-2 # Conflicts: # services/pruner/worker.go # settings/interface.go # settings/settings.go # stores/utxo/pruner_coverage_test.go # stores/utxo/pruner_unmined.go # stores/utxo/simple_pruner_test.go
…3/teranode into release/wip-scale-2
…in/teranode into release/wip-scale-2
Fixed compiler errors in double spend test files where ValidateBlock was called with incorrect parameters. Removed nil argument that was incompatible with the variadic bool parameter.
|
🤖 Claude Code Review Status: Complete Current Review: This large PR (184 files, 20k+ insertions) integrates performance improvements from wip-scale-2. The changes are generally well-structured, but several areas warrant attention: Critical:
Performance:
Code Quality:
Recommendations:
Overall, the code quality is good and the performance optimizations appear sound, but the race condition in transaction mined status updates should be addressed before merging. |
Set publishChannel to nil before closing to prevent concurrent sends to closed channel. This eliminates race between Stop() and Publish() operations during shutdown.
Increase test timeout back to 10 minutes to prevent timeouts on slower test runs and comprehensive test suites.
…licting transactions during reorg During reorg move-forward processing, conflicting transactions must be filtered out via dequeueDuringBlockMovement(). The wip-scale-2 merge accidentally changed the skipDequeue parameter from false to true, causing this critical filtering to be skipped. This resulted in conflicting transactions remaining in candidate subtrees when they should have been excluded, causing 9 sequential test failures in double-spend scenarios. Root cause: Line 565 of SubtreeProcessor.go had skipDequeue changed from false to true. Impact: All double-spend and reorg tests failed because conflicting transactions were incorrectly included in the mining candidate subtrees. Fix: Restore skipDequeue parameter to false, re-enabling the dequeue filtering logic that excludes conflicting and already-mined transactions from candidate subtrees.
During reorg with both moveBack and moveForward blocks, the last moveForward block must have skipDequeue=false to allow non-conflicting transactions from the losing chain to be added back to candidate subtrees. Previously, skipDequeue was hardcoded to true for ALL moveForward blocks during a reorg (line 2278), which prevented dequeueDuringBlockMovement() from running. This caused non-conflicting transactions (like ParentSibling in the test) to be lost after a reorg, even though they didn't conflict with the winning chain. Fix: Change skipDequeue from 'true' to '!lastMoveForwardBlock', so only the last block processes the queue, similar to the skipNotification pattern. This complements the fix on line 565 which handled the normal moveForward path.
|




Summary
This PR integrates 197+ commits from the
wip-scale-2branch intomain, bringing significant performance improvements and optimizations developed during large-scale testing.Key Changes
Performance & Scalability
Infrastructure
Testing
Bug Fixes
Test Results
Running full test suite to validate integration. Previous run: 8,520+ tests passing with test infrastructure improvements applied.
Migration Notes
This merge brings the performance optimizations from scale testing into the main branch. Services should see improved performance, especially:
Generated with Claude Code