Add timing for block availability by michaelsproul · Pull Request #5510 · sigp/lighthouse

michaelsproul · 2024-04-02T06:25:50Z

Issue Addressed

Presently the block import metrics that measure block observation delay and import delay (time to import a block) are unaware of blob-processing and related delays.

This can lead to the import_delay seeming artificially high because it also factors in the time spent waiting for blobs to arrive.

Proposed Changes

Add an available timestamp to the block delay cache which records when a block became fully available.
Add a similar attestable timestamp that records when a block became capable of being attested to (this allows us to ignore the non-critical disk writes and similar that are present in the import_delay). Revived from an old PR: Add basic cause analysis for delayed head blocks #3232.
Log these delays every slot (not just for late blocks), in milliseconds.
Add a debug log with the exact time taken by each newPayload call to the EL. This helps per-block investigations and is clearer and more accurate than the metric histogram. We can also use the exact ms value combined with log timestamp and the observed_delay to calculate the time taken doing other stuff prior to calling the EL (log_timestamp - slot_start - time_taken_ms - observed_delay).
Add matching prometheus metrics.

Additional Info

The available delay currently reported by this PR is much higher than I expected, and sometimes seems to include things like snapshot cache misses. I need to do some more investigating to make sure I've put the observation of availability at the earliest possible point.

ethDreamer · 2024-04-02T15:59:37Z

beacon_node/beacon_chain/src/data_availability_checker.rs

+                            .slot_clock
+                            .now_duration()
+                            .ok_or(AvailabilityCheckError::SlotClockError)?;
                        results.push(MaybeAvailableBlock::Available(AvailableBlock {


I'm fairly certain this function is only called during range-sync or backfill. So these blocks are not related to the current slot. Also due to the way sync works, none of the blocks hit this cache until all the data for all the blocks in the request are downloaded. Thus it's a bit meaningless to measure how far into the current slot they became available. Might be a good idea to exclude them from the data?

ethDreamer · 2024-04-02T16:37:15Z

beacon_node/beacon_chain/src/data_availability_checker.rs

    blobs: Option<BlobSidecarList<E>>,
+    /// Timestamp at which this block first became available (UNIX timestamp, time since 1970).
+    available_timestamp: Duration,
 }


Storing the timestamp here will force you to include the timestamp for backfill & range-sync blocks which will only pollute the data. I think a better way would be to include this timestamp in the AvailableExecutedBlock & then populate it inside make_available(). This will probe what I believe you actually want as make_available() is the final destination of:

put_rpc_blobs()

put_gossip_blob()

put_pending_executed_block()

which are the 3 entry points for which new data can come in & complete a block.

The other entry points:

verify_kzg_for_rpc_block()

verify_kzg_for_rpc_blocks()

Don't actually hit the DA cache as they assume all blobs are already present.

I tried doing this just now but got stuck because to fill in the available_timestamp on AvailableExecutedBlock here would require it to be part of the AvailableBlock anyway:

lighthouse/beacon_node/beacon_chain/src/block_verification_types.rs

Lines 194 to 200 in 3058b96

MaybeAvailableBlock::Available(available_block) => {

Self::Available(AvailableExecutedBlock::new(

available_block,

import_data,

payload_verification_outcome,

))

}

beacon_node/beacon_chain/src/beacon_chain.rs

michaelsproul · 2024-04-22T04:51:44Z

@AgeManning I pushed a commit here 73cb982 which I think fixes the issues with RPC blobs not being recorded. We now record a seen timestamp for every blob, and when we call make_available we take the max of the seen timestamps to calculate the blobs_available_timestamp (this is similar to what @ethDreamer suggested here: #5510 (comment)).

From watching the logs it seems to produce similar results. If you could try your metrics dash with the latest commit and confirm that it looks as expected, that would be great 🙏

michaelsproul · 2024-04-23T01:48:47Z

@Mergifyio queue

mergify · 2024-04-23T01:48:51Z

queue

🛑 The pull request has been removed from the queue `default`

Details

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

michaelsproul · 2024-04-23T01:49:24Z

Marking as a breaking change because this PR deletes and changes some metrics related to block processing.

michaelsproul · 2024-04-23T05:40:20Z

@Mergifyio queue

mergify · 2024-04-23T05:40:27Z

queue

🛑 The pull request has been removed from the queue `default`

Details

The queue conditions cannot be satisfied due to failing checks.

You can take a look at Queue: Embarked in merge queue check runs for more details.

In case of a failure due to a flaky test, you should first retrigger the CI.
Then, re-embark the pull request into the merge queue by posting the comment
@mergifyio refresh on the pull request.

michaelsproul · 2024-04-23T12:39:49Z

@Mergifyio requeue

mergify · 2024-04-23T12:39:52Z

requeue

✅ This pull request will be re-embarked automatically

Details

The followup queue command will be automatically executed to re-embark the pull request

mergify · 2024-04-23T12:39:53Z

queue

✅ The pull request has been merged automatically

Details

The pull request has been merged automatically at 72a3360

dapplion · 2024-07-25T07:07:58Z

beacon_node/beacon_chain/src/metrics.rs

+    );
+    pub static ref BEACON_BLOCK_DELAY_HEAD_SLOT_START_EXCEEDED_TOTAL: Result<IntCounter> = try_create_int_counter(
+        "beacon_block_delay_head_slot_start_exceeded_total",
+        "A counter that is triggered when the duration between the start of the block's slot and the current time \


@michaelsproul why did you change from histogram to gauge here?

that was @AgeManning

thinking was to get exact values rather than buckets

hmm I guess the average observation frequency is 1 every 12 seconds so kinda ok? Still weird tho

Add timing for block availability

77fea62

michaelsproul added work-in-progress PR is a work-in-progress UX-and-logs deneb labels Apr 2, 2024

ethDreamer reviewed Apr 2, 2024

View reviewed changes

AgeManning changed the base branch from stable to unstable April 3, 2024 00:08

AgeManning added 5 commits April 4, 2024 16:23

Attestation metrics analysis

83eff83

Prettier printing

d7191c5

Add some metrics and timings to track late blocks

15fef3b

Update to latest unstable

16aa2ee

fmt

32b072e

AgeManning added ready-for-review The code is ready for review and removed work-in-progress PR is a work-in-progress labels Apr 21, 2024

Merge latest unstable

b35f72c

michaelsproul commented Apr 22, 2024

View reviewed changes

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

beacon_node/beacon_chain/src/beacon_chain.rs Outdated Show resolved Hide resolved

michaelsproul added 2 commits April 22, 2024 11:42

Small tweaks

d21a7f1

Try pushing blob timing down into verification

73cb982

AgeManning approved these changes Apr 22, 2024

View reviewed changes

michaelsproul added ready-for-merge This PR is ready to merge. backwards-incompat Backwards-incompatible API change and removed ready-for-review The code is ready for review labels Apr 23, 2024

mergify bot added a commit that referenced this pull request Apr 23, 2024

Merge of #5510

54179f7

mergify bot mentioned this pull request Apr 23, 2024

merge queue: embarking unstable (82b131d) and #5510 together #5625

Closed

5 tasks

Simplify for clippy

958bac4

mergify bot added a commit that referenced this pull request Apr 23, 2024

Merge of #5510

382f12e

mergify bot mentioned this pull request Apr 23, 2024

merge queue: embarking unstable (82b131d) and #5510 together #5627

Closed

5 tasks

mergify bot added a commit that referenced this pull request Apr 23, 2024

Merge of #5510

74afc47

mergify bot mentioned this pull request Apr 23, 2024

merge queue: embarking unstable (82b131d) and #5510 together #5629

Closed

5 tasks

mergify bot merged commit 72a3360 into sigp:unstable Apr 23, 2024

chong-he mentioned this pull request Jun 25, 2024

BlockDelays dashboard is outdated sigp/lighthouse-metrics#52

Closed

chong-he mentioned this pull request Jul 4, 2024

Update MissedAttestation dashboard and delete BlockDelays dashboard sigp/lighthouse-metrics#53

Merged

michaelsproul deleted the block-availability-metric branch July 15, 2024 01:57

michaelsproul mentioned this pull request Jul 15, 2024

Measure consensus verification time #6089

Merged

jimmygchen mentioned this pull request Jul 24, 2024

Add DataColumnSidecar gossip topic and message handling #6147

Merged

dapplion reviewed Jul 25, 2024

View reviewed changes

	MaybeAvailableBlock::Available(available_block) => {
	Self::Available(AvailableExecutedBlock::new(
	available_block,
	import_data,
	payload_verification_outcome,
	))
	}

Conversation

michaelsproul commented Apr 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue Addressed

Proposed Changes

Additional Info

Uh oh!

ethDreamer Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

michaelsproul Apr 3, 2024

Choose a reason for hiding this comment

Uh oh!

ethDreamer Apr 2, 2024

Choose a reason for hiding this comment

Uh oh!

michaelsproul Apr 22, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

michaelsproul commented Apr 22, 2024

Uh oh!

michaelsproul commented Apr 23, 2024

Uh oh!

mergify bot commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛑 The pull request has been removed from the queue default

Uh oh!

michaelsproul commented Apr 23, 2024

Uh oh!

michaelsproul commented Apr 23, 2024

Uh oh!

mergify bot commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛑 The pull request has been removed from the queue default

Uh oh!

michaelsproul commented Apr 23, 2024

Uh oh!

mergify bot commented Apr 23, 2024

✅ This pull request will be re-embarked automatically

Uh oh!

mergify bot commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ The pull request has been merged automatically

Uh oh!

dapplion Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

michaelsproul Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

dapplion Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michaelsproul commented Apr 2, 2024 •

edited

Loading

mergify bot commented Apr 23, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

mergify bot commented Apr 23, 2024 •

edited

Loading

🛑 The pull request has been removed from the queue `default`

mergify bot commented Apr 23, 2024 •

edited

Loading