Skip to content

feat: add Prometheus per-query metrics for midnight data sources [PM-22100]#904

Merged
gilescope merged 10 commits into
mainfrom
task/PM-22100-dbsync-query-metrics
Mar 12, 2026
Merged

feat: add Prometheus per-query metrics for midnight data sources [PM-22100]#904
gilescope merged 10 commits into
mainfrom
task/PM-22100-dbsync-query-metrics

Conversation

@m2ux

@m2ux m2ux commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Add per-query Prometheus timing histograms for each SQL query executed by midnight-node's three data sources, enabling per-query latency comparison between Cardano preview and mainnet environments.

🎫 Ticket 📐 Engineering 🧪 Test Plan


Motivation

The Midnight node queries Cardano DBSync multiple times during each block production cycle. The partner-chains data sources already record per-method Prometheus metrics, but midnight-node defines three additional data sources — cNight observation, federated authority observation, and a candidates wrapper — that have zero Prometheus visibility for their individual SQL queries.

Without per-query timing data, the team cannot diagnose the 6-second query latencies observed on mainnet DBSync or verify that index optimizations are effective.


Changes

  • Metrics module — New MidnightDataSourceMetrics struct with a single HistogramVec (midnight_data_source_query_time_elapsed{query_name="..."}) and RAII timer helper
  • 13 sub-query timers — Individual Prometheus timers placed around each SQL call inside method bodies:
    • cNight observation (5): cnight_get_block_by_hash, cnight_get_registrations, cnight_get_deregistrations, cnight_get_asset_creates, cnight_get_asset_spends
    • Federated authority (3): fedauth_get_block_by_hash, fedauth_get_council_utxo, fedauth_get_technical_committee_utxo
    • Candidates (5): candidates_get_token_utxo_for_epoch, candidates_get_latest_block_for_epoch, candidates_get_utxos_for_address, candidates_get_stake_distribution, candidates_get_epoch_nonce
  • Data source wiring — All three midnight data sources rewired from McFollowerMetrics to MidnightDataSourceMetrics
  • Service plumbingservice.rs registers midnight metrics; main_chain_follower.rs routes both McFollowerMetrics (partner-chains) and MidnightDataSourceMetrics (midnight-specific) to the appropriate data sources

Design Decisions

  1. SQL-level timing — Each timer wraps a single SQL query call, giving precise per-query latency.
  2. Custom SubQueryTimer RAII guard — Captures the Prometheus Histogram handle at construction, avoiding a second label lookup at drop time.
  3. Separate metrics namespacemidnight_data_source_query_* is distinct from partner-chains' partner_chains_data_source_method_*, avoiding label confusion.
  4. No call counter — Prometheus histograms automatically generate _count suffixes; a separate CounterVec would be redundant.

Submission Checklist

  • Changes are backward-compatible (or flagged if breaking)
  • Pull request description explains why the change is needed
  • Self-reviewed the diff
  • I have included a change file, or skipped for this reason:
  • If the changes introduce a new feature, I have bumped the node minor version
  • Update documentation (if relevant)
  • No new todos introduced

Fork Strategy

  • Node Runtime Update
  • Node Client Update
  • Other
  • N/A

TODO before merging

  • Ready for review

@m2ux m2ux requested a review from a team as a code owner March 11, 2026 12:07
@github-actions

github-actions Bot commented Mar 11, 2026

Copy link
Copy Markdown
Contributor

kics-logo

KICS version: v2.1.19

Category Results
CRITICAL CRITICAL 0
HIGH HIGH 2
MEDIUM MEDIUM 52
LOW LOW 3
INFO INFO 64
TRACE TRACE 0
TOTAL TOTAL 121
Metric Values
Files scanned placeholder 27
Files parsed placeholder 27
Files failed to scan placeholder 0
Total executed queries placeholder 73
Queries failed to execute placeholder 0
Execution time placeholder 11

@LGLO LGLO left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR does not do what ticket is about.

Problem Statement
Current state:

The Midnight node queries Cardano's DBSync database multiple times per block and per epoch, but only a single aggregate timing histogram (data_source_method_time_elapsed) is exposed via Prometheus.

It is impossible to determine which specific DBSync query is slow — the histogram combines all queries into one bucket set.

What is desired by the ticket:

Each DBSync query has its own labelled Prometheus timing histogram, enabling per-query latency comparison between preview and mainnet.

In other words, it is not about complete copy of solution from Partner Chains, but getting better insight the PC gives. Recording metrices should be done inside methods at SQL queries level.

@m2ux m2ux self-assigned this Mar 11, 2026
@m2ux

m2ux commented Mar 11, 2026

Copy link
Copy Markdown
Contributor Author

This PR does not do what ticket is about.

Problem Statement
Current state:

The Midnight node queries Cardano's DBSync database multiple times per block and per epoch, but only a single aggregate timing histogram (data_source_method_time_elapsed) is exposed via Prometheus.

It is impossible to determine which specific DBSync query is slow — the histogram combines all queries into one bucket set.

What is desired by the ticket:

Each DBSync query has its own labelled Prometheus timing histogram, enabling per-query latency comparison between preview and mainnet.

In other words, it is not about complete copy of solution from Partner Chains, but getting better insight the PC gives. Recording metrices should be done inside methods at SQL queries level.

Understood. Will re-evaluate the changes needed.

@gilescope gilescope left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see any way this will slow down the db calls, so I'm all for this.
(But yeah we do need per-query resolution)

@m2ux m2ux requested a review from LGLO March 11, 2026 15:19
@m2ux m2ux changed the title feat: add Prometheus per-query metrics for midnight data sources feat: add Prometheus per-query metrics for midnight data sources [PM-22100] Mar 11, 2026
@m2ux m2ux enabled auto-merge March 11, 2026 16:24
m2ux and others added 8 commits March 11, 2026 16:35
Introduce MidnightDataSourceMetrics with public accessor methods,
replacing the upstream McFollowerMetrics whose accessors are
crate-private in partner-chains v1.8.1. The local observed_async_trait!
macro now records call counts and timing histograms for all six
midnight-specific data source methods (candidates, cnight observation,
federated authority observation).

JIRA: PM-22100
Made-with: Cursor
* update scanner action to latest version

Signed-off-by: Giles Cope <gilescope@gmail.com>

* update scanner action to latest version

Signed-off-by: Giles Cope <gilescope@gmail.com>

* ci: rename workflow name from build to scan

Signed-off-by: Giles Cope <gilescope@gmail.com>

---------

Signed-off-by: Giles Cope <gilescope@gmail.com>
…rces

Add individual Prometheus timing histograms for each SQL query inside
midnight-node data source methods. This provides per-query latency
visibility alongside the existing method-level timing, enabling
precise identification of slow DBSync queries on mainnet.

13 sub-query timers added across 3 data sources:
- cNight observation: 5 queries (block lookup + 4 concurrent UTXO queries)
- Federated authority: 3 queries (block lookup + 2 governance UTXOs)
- Candidates: 5 queries across 3 methods

Ref: PM-22100
Made-with: Cursor
Replace 13 inline timer patterns with a shared start_sub_query_timer()
helper and SubQueryTimer RAII guard in the metrics module. Each call
site reduces from 3 lines to 1.

Made-with: Cursor
… only

Remove the observed_async_trait! macro timing (method-level) since
per-SQL-query timing provides more useful granularity. Simplify
MidnightDataSourceMetrics to histogram-only (remove unused call counter).
Rename metric to midnight_data_source_query_time_elapsed with query_name
label for clarity.

Made-with: Cursor
@m2ux m2ux force-pushed the task/PM-22100-dbsync-query-metrics branch from 753c8e7 to ad33938 Compare March 11, 2026 16:35
@m2ux m2ux requested review from a team as code owners March 11, 2026 16:35
@m2ux m2ux added this pull request to the merge queue Mar 11, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to Branch Protection failures Mar 11, 2026
You're not authorized to push to this branch. Visit "About protected branches" for more information.
@gilescope gilescope added this pull request to the merge queue Mar 12, 2026
Merged via the queue into main with commit ade4507 Mar 12, 2026
34 checks passed
@gilescope gilescope deleted the task/PM-22100-dbsync-query-metrics branch March 12, 2026 13:17
justinfrevert pushed a commit that referenced this pull request Mar 31, 2026
…22100] (#904)

* feat: add Prometheus per-query metrics to midnight data sources

Introduce MidnightDataSourceMetrics with public accessor methods,
replacing the upstream McFollowerMetrics whose accessors are
crate-private in partner-chains v1.8.1. The local observed_async_trait!
macro now records call counts and timing histograms for all six
midnight-specific data source methods (candidates, cnight observation,
federated authority observation).

JIRA: PM-22100
Made-with: Cursor

* ci: normalise scan job name (#823)

* update scanner action to latest version

Signed-off-by: Giles Cope <gilescope@gmail.com>

* update scanner action to latest version

Signed-off-by: Giles Cope <gilescope@gmail.com>

* ci: rename workflow name from build to scan

Signed-off-by: Giles Cope <gilescope@gmail.com>

---------

Signed-off-by: Giles Cope <gilescope@gmail.com>

* chore: add change file for PM-22100 dbsync query metrics

Made-with: Cursor

* feat: add sub-query SQL-level Prometheus timing for midnight data sources

Add individual Prometheus timing histograms for each SQL query inside
midnight-node data source methods. This provides per-query latency
visibility alongside the existing method-level timing, enabling
precise identification of slow DBSync queries on mainnet.

13 sub-query timers added across 3 data sources:
- cNight observation: 5 queries (block lookup + 4 concurrent UTXO queries)
- Federated authority: 3 queries (block lookup + 2 governance UTXOs)
- Candidates: 5 queries across 3 methods

Ref: PM-22100
Made-with: Cursor

* refactor: extract sub-query timer helper to reduce repetition

Replace 13 inline timer patterns with a shared start_sub_query_timer()
helper and SubQueryTimer RAII guard in the metrics module. Each call
site reduces from 3 lines to 1.

Made-with: Cursor

* refactor: remove method-level timing, keep SQL-level sub-query timers only

Remove the observed_async_trait! macro timing (method-level) since
per-SQL-query timing provides more useful granularity. Simplify
MidnightDataSourceMetrics to histogram-only (remove unused call counter).
Rename metric to midnight_data_source_query_time_elapsed with query_name
label for clarity.

Made-with: Cursor

* style: apply rustfmt to candidates data source

Made-with: Cursor

* chore: update changes file to reflect SQL-level sub-query timing

Made-with: Cursor

---------

Signed-off-by: Giles Cope <gilescope@gmail.com>
Co-authored-by: Squirrel <giles.cope@shielded.io>
justinfrevert pushed a commit that referenced this pull request Mar 31, 2026
…22100] (#904)

* feat: add Prometheus per-query metrics to midnight data sources

Introduce MidnightDataSourceMetrics with public accessor methods,
replacing the upstream McFollowerMetrics whose accessors are
crate-private in partner-chains v1.8.1. The local observed_async_trait!
macro now records call counts and timing histograms for all six
midnight-specific data source methods (candidates, cnight observation,
federated authority observation).

JIRA: PM-22100
Made-with: Cursor

* ci: normalise scan job name (#823)

* update scanner action to latest version

Signed-off-by: Giles Cope <gilescope@gmail.com>

* update scanner action to latest version

Signed-off-by: Giles Cope <gilescope@gmail.com>

* ci: rename workflow name from build to scan

Signed-off-by: Giles Cope <gilescope@gmail.com>

---------

Signed-off-by: Giles Cope <gilescope@gmail.com>

* chore: add change file for PM-22100 dbsync query metrics

Made-with: Cursor

* feat: add sub-query SQL-level Prometheus timing for midnight data sources

Add individual Prometheus timing histograms for each SQL query inside
midnight-node data source methods. This provides per-query latency
visibility alongside the existing method-level timing, enabling
precise identification of slow DBSync queries on mainnet.

13 sub-query timers added across 3 data sources:
- cNight observation: 5 queries (block lookup + 4 concurrent UTXO queries)
- Federated authority: 3 queries (block lookup + 2 governance UTXOs)
- Candidates: 5 queries across 3 methods

Ref: PM-22100
Made-with: Cursor

* refactor: extract sub-query timer helper to reduce repetition

Replace 13 inline timer patterns with a shared start_sub_query_timer()
helper and SubQueryTimer RAII guard in the metrics module. Each call
site reduces from 3 lines to 1.

Made-with: Cursor

* refactor: remove method-level timing, keep SQL-level sub-query timers only

Remove the observed_async_trait! macro timing (method-level) since
per-SQL-query timing provides more useful granularity. Simplify
MidnightDataSourceMetrics to histogram-only (remove unused call counter).
Rename metric to midnight_data_source_query_time_elapsed with query_name
label for clarity.

Made-with: Cursor

* style: apply rustfmt to candidates data source

Made-with: Cursor

* chore: update changes file to reflect SQL-level sub-query timing

Made-with: Cursor

---------

Signed-off-by: Giles Cope <gilescope@gmail.com>
Co-authored-by: Squirrel <giles.cope@shielded.io>
gilescope pushed a commit that referenced this pull request Apr 8, 2026
@gilescope gilescope added this to the node-1.0.0 milestone Apr 10, 2026
m2ux added a commit that referenced this pull request Apr 23, 2026
Signed-off-by: Mike Clay <mike.clay@shielded.io>
m2ux added a commit that referenced this pull request Apr 23, 2026
Signed-off-by: Mike Clay <mike.clay@shielded.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants