Skip to content

feat: add tx validation caching#608

Merged
ozgb merged 17 commits into
mainfrom
ozgb-add-tx-validation-caching
Feb 5, 2026
Merged

feat: add tx validation caching#608
ozgb merged 17 commits into
mainfrom
ozgb-add-tx-validation-caching

Conversation

@ozgb

@ozgb ozgb commented Feb 4, 2026

Copy link
Copy Markdown
Contributor

Overview

Adds two separate transaction validation caches - one for use in the mempool, one for pre-dispatch & tx application

The cache for the mempool is keyed by tx hash only - this means we get cache hits even if the ledger state progresses.

For pre-dispatch, we use a more-strict (ledger_state, tx_hash) keyed cache. This means if a transaction was valid in the mempool, but the ledger state has progressed such that it's no longer valid, we catch that error here.

If a tx is added to the mempool and included in the same block, the node will now only run the transaction validation once!

TODO for future PRs:

  • Make use of ledger re-validation in the mempool
    • we can use this to kick transactions that have become invalid out of the mempool sooner than the cache TTL without incurring the performance hit of a full revalidation
  • Make use of ledger re-validation as a block producer
    • We can create custom block authorship (a small modification of the default basic_authorship.rs) to use re-validation instead of full validation when proposing blocks
    • This means block importers will still use full validation, so the chain integrity is safe, but may give some performance benefits for producing blocks
  • Refactor Bridge / Ledger interaction (tech debt)
    • The separation between Bridge and Ledger structs in the mindight-node-ledger crate is unclear
    • This needs refactoring - but not in this PR

Ticket: https://shielded.atlassian.net/browse/PM-21592

🗹 TODO before merging

  • Ready

📌 Submission Checklist

  • Changes are backward-compatible (or flagged if breaking)
  • Pull request description explains why the change is needed
  • Self-reviewed the diff
  • I have included a change file, or skipped for this reason:
  • If the changes introduce a new feature, I have bumped the node minor version
  • Update documentation (if relevant)
  • Updated AGENTS.md if build commands, architecture, or workflows changed
  • No new todos introduced

🧪 Testing Evidence

Please describe any additional testing aside from CI:

  • Additional tests are provided (if possible)

🔱 Fork Strategy

  • Node Runtime Update
    • the only runtime update is to remove redundant double-tx-verification performed in pre-dispatch
  • Node Client Update
  • Other:
  • N/A

Links

ozgb added 7 commits February 4, 2026 11:47
Implement separate caches for mempool and pre_dispatch validation:

- Soft cache (tx_hash only): Used for mempool revalidation to avoid
  redundant ZK proof verification during periodic housekeeping
- Strict cache (state_hash + tx_hash): Used for pre_dispatch to ensure
  state-dependent validation is performed correctly

Add versioned host API (v2) with `strict` parameter while maintaining
backwards compatibility with legacy callers that default to strict=true.

Update pallet to pass strict=false for mempool validation and strict=true
for pre_dispatch validation.
…d() calls

Change the strict cache to store VerifiedTransaction (via type erasure)
instead of just Result<(), Error>. This allows validate_guaranteed_execution
to reuse the cached VerifiedTransaction from validate_unsigned(strict=true),
avoiding a redundant call to well_formed() which performs expensive ZK proof
verification.

Flow:
1. validate_unsigned(strict=true) calls well_formed() and caches VerifiedTransaction
2. validate_guaranteed_execution retrieves cached VerifiedTransaction
3. apply() dry-run uses the cached value directly

The strict cache uses Arc<dyn Any + Send + Sync> for type erasure since
Bridge<S, D> is generic over multiple signature and database types.
…ied_transaction

- Add get_verified_transaction() as single source of truth for cache logic
- Simplify do_validate_transaction and do_validate_guaranteed_execution
- Add apply_verified_transaction to Ledger for using pre-verified transactions
- Update apply_transaction to use cached VerifiedTransaction when available
- Pass runtime_version to apply_transaction for cache key generation
…e methods

Remove dead code introduced by centralized caching:
- Remove Ledger::apply_transaction (replaced by apply_verified_transaction)
- Remove Ledger::validate_transaction
- Remove Transaction::validate
- Update tests to use well_formed() directly
Add info/warn logs at key points in transaction processing:
- When a transaction is validated for the mempool (or rejected)
- When a transaction is rejected at pre-dispatch
- When a transaction is applied to the ledger
Add observability for the two-tier transaction validation caches:
- ledger_tx_validation_cache_hits_total (labeled by cache_type: strict/soft)
- ledger_tx_validation_cache_misses_total
- ledger_tx_validation_cache_size (labeled by cache_type: strict/soft)

Extend TransactionValidationWasCached enum to distinguish between
SoftCache and StrictCache hits, enabling accurate metric reporting.
…atch

Remove the redundant validate_transaction call from pre_dispatch since
validate_guaranteed_execution already runs well_formed() via
get_verified_transaction. This also removes the strict flag that was
added to support two-tier caching for pre_dispatch vs mempool, as it's
no longer needed.

Changes:
- Simplify pre_dispatch to only call validate_guaranteed_execution
- Remove strict parameter from validate_transaction and do_validate_transaction
- Remove v2 versioned APIs from LedgerBridge and LedgerBridgeHf traits
- Move strict cache metrics to validate_guaranteed_execution
- Remove unused TransactionValidationWasCached enum
@github-actions

github-actions Bot commented Feb 4, 2026

Copy link
Copy Markdown
Contributor

kics-logo

KICS version: v2.1.16

Category Results
CRITICAL CRITICAL 0
HIGH HIGH 0
MEDIUM MEDIUM 97
LOW LOW 12
INFO INFO 83
TRACE TRACE 0
TOTAL TOTAL 192
Metric Values
Files scanned placeholder 29
Files parsed placeholder 29
Files failed to scan placeholder 0
Total executed queries placeholder 73
Queries failed to execute placeholder 0
Execution time placeholder 7

@ozgb ozgb changed the title feat: add tx caching feat: add tx validation caching Feb 4, 2026
@ozgb ozgb marked this pull request as ready for review February 4, 2026 14:49
@ozgb ozgb requested a review from a team as a code owner February 4, 2026 14:49
Comment thread ledger/src/versions/common/mod.rs
Comment thread ledger/src/versions/common/mod.rs
Comment thread ledger/src/versions/common/mod.rs Outdated
Comment thread ledger/src/versions/common/mod.rs Outdated
tgunnoe added a commit that referenced this pull request Feb 4, 2026
Comment thread ledger/src/versions/common/api/ledger.rs Outdated
Comment thread ledger/src/versions/common/mod.rs
Comment thread pallets/midnight/src/lib.rs Outdated

@gilescope gilescope left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the non-strick tx pool we need to be caching negatives.

@justinfrevert

Copy link
Copy Markdown
Contributor

@ozgb
Have we considered an approach that leverages the existing tx pool(via unsigned validation) for at least some of this logic? I'm thinking there could be some overlap here, as some of that is for keeping valid txes, rejecting rejecting txs, according tags we select such as tx hash and state hash. Though, with that approach we get fork awareness, version awareness, dependencies, etc for free.

@m2ux

m2ux commented Feb 5, 2026

Copy link
Copy Markdown
Contributor

Code Review Recommendations for PR #608

Author: Rust/Substrate Code Review Agent

Overall: ✅ Approve - Well-designed two-tier caching mechanism with good documentation and observability.


High Priority (Clarification needed)

  1. validate_unsigned removal in pre_dispatch (pallets/midnight/src/lib.rs, lines 426-439)
    • Previous implementation called validate_unsigned() before validate_guaranteed_execution()
    • This PR removes that call, relying solely on validate_guaranteed_execution
    • Question: Can you confirm that validate_guaranteed_execution now performs all checks that validate_unsigned previously did? The simplification looks correct, but want to verify nothing was inadvertently dropped.

Medium Priority (Follow-up PRs)

  1. Track downcast failures as metric (ledger/src/versions/common/mod.rs, lines 87-94)

    • Type erasure with Arc<dyn Any> logs a warning on downcast failure
    • Suggestion: Add Prometheus counter tx_validation_cache_downcast_failures to monitor for unexpected type issues in production
  2. Add soft cache metrics to validate_transaction (ledger/src/versions/common/mod.rs, lines 503-516)

    • Cache hit/miss metrics recorded for strict cache but not soft cache
    • Suggestion: Add similar metrics to validate_transaction to monitor mempool caching effectiveness

Low Priority (Nice to have)

  1. Make cache size configurable (ledger/src/versions/common/mod.rs, lines 88-94)

    • Cache size is fixed at 1000 entries
    • Consider making configurable via constant or runtime parameter for production tuning
  2. Profile VerifiedTransaction clone cost (ledger/src/versions/common/mod.rs, line 647)

    • Cache hit returns vt.clone() - may be significant for large transactions
    • If profiling shows impact, consider returning Arc<VerifiedTransaction<D>>

@m2ux

m2ux commented Feb 5, 2026

Copy link
Copy Markdown
Contributor

Test Suite Review Recommendations for PR #608

Author: Test Suite Review Agent

Overall: ⚠️ Acceptable - Core functionality exercised via E2E, but explicit caching tests are missing.


Test Coverage Summary

Component Unit Integration E2E Status
get_verified_transaction ✅ Implicit Partial
do_validate_transaction ✅ Implicit Partial
do_validate_guaranteed_execution ✅ Implicit Partial
Cache key generation None
Cache invalidation None

What's Tested

  • ✅ All E2E tests pass (Node E2E, Toolkit E2E, Hardfork E2E)
  • ✅ Existing unit tests updated to use new apply_verified_transaction API
  • ✅ No regression in existing test suite

High Priority (Follow-up PRs)

  1. Add unit tests for cache hit/miss scenarios (ledger/src/versions/common/mod.rs)

    • No tests verify that cached VerifiedTransaction is actually reused
    • Suggestion: Add test that calls validation twice with same tx, verify second call hits cache
  2. Add test for strict cache state invalidation

    • Strict cache keys on (state_hash, tx_hash) - no test verifies cache miss when state changes
    • Suggestion: Validate tx, apply different tx (changes state), verify original requires revalidation

Medium Priority (Follow-up PRs)

  1. Test downcast failure path (ledger/src/versions/common/mod.rs, lines 132-136)

    • Arc<dyn Any> downcast failure logs warning and returns cache miss
    • This error path is untested
  2. Verify Prometheus metrics are recorded

    • No tests assert that cache hit/miss/size metrics are incremented correctly

Conclusion: PR can proceed to merge. E2E tests provide implicit coverage of the validation path. Dedicated unit tests for caching behavior would strengthen confidence but are not blocking.

@m2ux m2ux left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m2ux

m2ux commented Feb 5, 2026

Copy link
Copy Markdown
Contributor

Documentation Review Recommendations for PR #608

Author: Documentation Review Agent

Overall: ✅ Good - Thorough inline documentation with minor gaps in operator-facing docs.


Documentation Coverage Summary

Document Status Quality
Change file ✅ Present Good
Inline Rust docs ✅ Present Excellent
Prometheus metrics docs ❌ Not present Gap
Configuration guide ❌ Not updated N/A

What's Documented Well

  • ✅ Change file covers the feature with PR/JIRA links
  • ✅ Inline doc comments explain two-tier caching design thoroughly
  • ✅ Type erasure rationale (Arc<dyn Any>) explained in detail
  • ✅ Cache key semantics (strict vs soft) documented
  • ✅ Metric naming follows convention (midnight_ledger_tx_*_cache_*)
  • ✅ Log messages are descriptive with emoji prefixes

Medium Priority (Follow-up PRs)

  1. Document new Prometheus metrics

    • New metrics: midnight_ledger_tx_validation_cache_hits, midnight_ledger_tx_validation_cache_misses, midnight_ledger_tx_validation_cache_size
    • Suggestion: Create docs/metrics.md or add section to docs/configuration-guide.md listing available Prometheus metrics
  2. Operator guidance for cache tuning

    • Cache size is fixed at 1000 entries
    • Suggestion: Add guidance on expected cache hit rates and what values indicate problems

Low Priority (Nice to have)

  1. ADR for caching design
    • Two-tier caching is a significant architectural choice
    • Consider documenting the decision rationale in an ADR

Conclusion: Documentation is acceptable for merge. Inline Rust docs are thorough. Prometheus metrics documentation gap is a general project concern rather than specific to this PR.

@gilescope gilescope left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see how it performs in QANet. It's been a while since the negative caching was an issue.

@ozgb ozgb enabled auto-merge February 5, 2026 11:56
@ozgb

ozgb commented Feb 5, 2026

Copy link
Copy Markdown
Contributor Author

@ozgb Have we considered an approach that leverages the existing tx pool(via unsigned validation) for at least some of this logic? I'm thinking there could be some overlap here, as some of that is for keeping valid txes, rejecting rejecting txs, according tags we select such as tx hash and state hash. Though, with that approach we get fork awareness, version awareness, dependencies, etc for free.

This PR still uses the existing tx pool - it just adds caching for the expensive well_formed checks we perform

@ozgb ozgb added this pull request to the merge queue Feb 5, 2026
Merged via the queue into main with commit 99be71e Feb 5, 2026
40 checks passed
@ozgb ozgb deleted the ozgb-add-tx-validation-caching branch February 5, 2026 16:48
gilescope pushed a commit that referenced this pull request Apr 8, 2026
* Unified toml files parameters
m2ux added a commit that referenced this pull request Apr 23, 2026
* Unified toml files parameters
Signed-off-by: Mike Clay <mike.clay@shielded.io>
m2ux added a commit that referenced this pull request Apr 23, 2026
* Unified toml files parameters
Signed-off-by: Mike Clay <mike.clay@shielded.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants