Add timing for block availability#5510
Conversation
| .slot_clock | ||
| .now_duration() | ||
| .ok_or(AvailabilityCheckError::SlotClockError)?; | ||
| results.push(MaybeAvailableBlock::Available(AvailableBlock { |
There was a problem hiding this comment.
I'm fairly certain this function is only called during range-sync or backfill. So these blocks are not related to the current slot. Also due to the way sync works, none of the blocks hit this cache until all the data for all the blocks in the request are downloaded. Thus it's a bit meaningless to measure how far into the current slot they became available. Might be a good idea to exclude them from the data?
| blobs: Option<BlobSidecarList<E>>, | ||
| /// Timestamp at which this block first became available (UNIX timestamp, time since 1970). | ||
| available_timestamp: Duration, | ||
| } |
There was a problem hiding this comment.
Storing the timestamp here will force you to include the timestamp for backfill & range-sync blocks which will only pollute the data. I think a better way would be to include this timestamp in the AvailableExecutedBlock & then populate it inside make_available(). This will probe what I believe you actually want as make_available() is the final destination of:
which are the 3 entry points for which new data can come in & complete a block.
The other entry points:
Don't actually hit the DA cache as they assume all blobs are already present.
There was a problem hiding this comment.
I tried doing this just now but got stuck because to fill in the available_timestamp on AvailableExecutedBlock here would require it to be part of the AvailableBlock anyway:
lighthouse/beacon_node/beacon_chain/src/block_verification_types.rs
Lines 194 to 200 in 3058b96
|
@AgeManning I pushed a commit here 73cb982 which I think fixes the issues with RPC blobs not being recorded. We now record a seen timestamp for every blob, and when we call From watching the logs it seems to produce similar results. If you could try your metrics dash with the latest commit and confirm that it looks as expected, that would be great 🙏 |
|
@Mergifyio queue |
🛑 The pull request has been removed from the queue
|
|
Marking as a breaking change because this PR deletes and changes some metrics related to block processing. |
|
@Mergifyio queue |
🛑 The pull request has been removed from the queue
|
|
@Mergifyio requeue |
✅ This pull request will be re-embarked automaticallyDetailsThe followup |
✅ The pull request has been merged automaticallyDetailsThe pull request has been merged automatically at 72a3360 |
| ); | ||
| pub static ref BEACON_BLOCK_DELAY_HEAD_SLOT_START_EXCEEDED_TOTAL: Result<IntCounter> = try_create_int_counter( | ||
| "beacon_block_delay_head_slot_start_exceeded_total", | ||
| "A counter that is triggered when the duration between the start of the block's slot and the current time \ |
There was a problem hiding this comment.
@michaelsproul why did you change from histogram to gauge here?
There was a problem hiding this comment.
that was @AgeManning
thinking was to get exact values rather than buckets
There was a problem hiding this comment.
hmm I guess the average observation frequency is 1 every 12 seconds so kinda ok? Still weird tho
Issue Addressed
Presently the block import metrics that measure block observation delay and import delay (time to import a block) are unaware of blob-processing and related delays.
This can lead to the
import_delayseeming artificially high because it also factors in the time spent waiting for blobs to arrive.Proposed Changes
availabletimestamp to the block delay cache which records when a block became fully available.attestabletimestamp that records when a block became capable of being attested to (this allows us to ignore the non-critical disk writes and similar that are present in theimport_delay). Revived from an old PR: Add basic cause analysis for delayed head blocks #3232.debuglog with the exact time taken by eachnewPayloadcall to the EL. This helps per-block investigations and is clearer and more accurate than the metric histogram. We can also use the exactmsvalue combined with log timestamp and theobserved_delayto calculate the time taken doing other stuff prior to calling the EL (log_timestamp - slot_start - time_taken_ms - observed_delay).Additional Info
The
availabledelay currently reported by this PR is much higher than I expected, and sometimes seems to include things like snapshot cache misses. I need to do some more investigating to make sure I've put the observation of availability at the earliest possible point.