feat(l1): add peer table timing metrics (Kademlia baseline)#6459
Conversation
Add Prometheus histograms to measure peer table operation latencies on the current flat IndexMap implementation. These metrics provide the baseline for comparing against the k-bucket implementation in #6458. Metrics added: - ethrex_kademlia_insert_contact_duration_seconds: times contact insertion via IndexMap::entry in do_new_contacts - ethrex_kademlia_iter_contacts_duration_seconds: times full table scan in do_get_closest_nodes Also adds three Grafana panels to the P2P Packets dashboard for visualizing p50/p99 latencies and operation rates.
🤖 Kimi Code ReviewThe PR looks good overall. The metrics are properly feature-gated, histogram buckets are appropriately sized for the expected latency ranges, and the Grafana dashboard JSON is syntactically correct. Minor observation: In #[cfg(feature = "metrics")]
use ethrex_metrics::p2p::METRICS_P2P;Correctness check:
Security/Performance:
The implementation is correct and ready to merge. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
🤖 Codex Code ReviewFindings:
No consensus, EVM, trie, or validation-risk issues stood out in this diff beyond the observability accuracy concerns above. I could not run a full Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
Lines of code reportTotal lines added: Detailed view |
🤖 Claude Code ReviewNow I have everything I need to write the review. Review:
|
🤖 Kimi Code ReviewThe PR adds Prometheus histogram metrics for Kademlia peer table operations and corresponding Grafana dashboard panels. Overall, the implementation is correct and follows established patterns in the codebase. Minor suggestions:
Verification notes:
Conclusion: Code is correct, secure, and idiomatic. No blocking issues. Automated review by Kimi (Moonshot AI) · kimi-k2.5 · custom prompt |
Greptile SummaryThis PR adds Prometheus histogram metrics to measure peer table operation latencies (
Confidence Score: 3/5Safe to merge functionally, but the insert_contact metric will measure skewed latencies for new contacts, undermining its value as a baseline. The code is correct and won't cause crashes or regressions. However, the stated purpose of the PR is to establish a reliable baseline for comparing with the k-bucket implementation, and the timer boundary bug means the insert metric conflates IndexMap time with async notification overhead for new contacts, reducing comparability. crates/networking/p2p/peer_table.rs — timer placement in do_new_contacts
|
| Filename | Overview |
|---|---|
| crates/blockchain/metrics/p2p.rs | Adds two Prometheus Histogram fields and their observe methods to MetricsP2P; buckets and registration look correct. |
| crates/networking/p2p/peer_table.rs | Metrics instrumentation added, but the insert_contact timer encompasses the async record_new_discovery() call for new contacts, skewing the measurement. |
| metrics/provisioning/grafana/dashboards/common_dashboards/p2p_packets.json | Three new Grafana panels added (insert p50/p99, iter p50/p99, ops rate); PromQL queries and panel IDs look correct. |
Sequence Diagram
sequenceDiagram
participant Caller
participant PeerTableServer
participant IndexMap
participant Prometheus
Note over PeerTableServer: do_new_contacts()
Caller->>PeerTableServer: new_contacts(nodes)
loop for each node
PeerTableServer->>PeerTableServer: insert_start = Instant::now()
PeerTableServer->>IndexMap: contacts.entry(node_id)
alt Vacant
IndexMap-->>PeerTableServer: VacantEntry
PeerTableServer->>IndexMap: insert(Contact)
PeerTableServer->>PeerTableServer: record_new_discovery().await ⚠️ included in timer
else Occupied
IndexMap-->>PeerTableServer: OccupiedEntry
PeerTableServer->>IndexMap: add_protocol()
end
PeerTableServer->>Prometheus: observe(insert_start.elapsed())
end
Note over PeerTableServer: do_get_closest_nodes()
Caller->>PeerTableServer: get_closest_nodes(node_id)
PeerTableServer->>PeerTableServer: scan_start = Instant::now()
PeerTableServer->>IndexMap: iterate all contacts
IndexMap-->>PeerTableServer: Vec<Node>
PeerTableServer->>Prometheus: observe(scan_start.elapsed())
PeerTableServer-->>Caller: closest nodes
Prompt To Fix All With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 999-1017
Comment:
**Timer includes async `record_new_discovery()` for new contacts**
`insert_start` is created before `self.contacts.entry(node_id)` but the observation happens after the entire `match` block — which, for the `Vacant` arm, contains `METRICS.record_new_discovery().await`. This means the measured duration for genuinely new contacts includes the async overhead of `record_new_discovery()`, not just the IndexMap entry operation, making the metric inconsistent across the two arms and potentially noisy compared to the Occupied arm baseline.
To measure only the IndexMap operation, move the observation inside the match arms before the async call:
```rust
match self.contacts.entry(node_id) {
Entry::Vacant(vacant_entry) => {
vacant_entry.insert(Contact::new(node, protocol));
#[cfg(feature = "metrics")]
{
use ethrex_metrics::p2p::METRICS_P2P;
METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64());
}
METRICS.record_new_discovery().await;
}
Entry::Occupied(mut occupied_entry) => {
occupied_entry.get_mut().add_protocol(protocol);
#[cfg(feature = "metrics")]
{
use ethrex_metrics::p2p::METRICS_P2P;
METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64());
}
}
}
```
How can I resolve this? If you propose a fix, please make it concise.Reviews (1): Last reviewed commit: "feat(l1): add peer table timing metrics ..." | Re-trigger Greptile
| @@ -996,6 +1009,12 @@ impl PeerTableServer { | |||
| occupied_entry.get_mut().add_protocol(protocol); | |||
| } | |||
| } | |||
|
|
|||
| #[cfg(feature = "metrics")] | |||
| { | |||
| use ethrex_metrics::p2p::METRICS_P2P; | |||
| METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64()); | |||
| } | |||
There was a problem hiding this comment.
Timer includes async
record_new_discovery() for new contacts
insert_start is created before self.contacts.entry(node_id) but the observation happens after the entire match block — which, for the Vacant arm, contains METRICS.record_new_discovery().await. This means the measured duration for genuinely new contacts includes the async overhead of record_new_discovery(), not just the IndexMap entry operation, making the metric inconsistent across the two arms and potentially noisy compared to the Occupied arm baseline.
To measure only the IndexMap operation, move the observation inside the match arms before the async call:
match self.contacts.entry(node_id) {
Entry::Vacant(vacant_entry) => {
vacant_entry.insert(Contact::new(node, protocol));
#[cfg(feature = "metrics")]
{
use ethrex_metrics::p2p::METRICS_P2P;
METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64());
}
METRICS.record_new_discovery().await;
}
Entry::Occupied(mut occupied_entry) => {
occupied_entry.get_mut().add_protocol(protocol);
#[cfg(feature = "metrics")]
{
use ethrex_metrics::p2p::METRICS_P2P;
METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64());
}
}
}Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/networking/p2p/peer_table.rs
Line: 999-1017
Comment:
**Timer includes async `record_new_discovery()` for new contacts**
`insert_start` is created before `self.contacts.entry(node_id)` but the observation happens after the entire `match` block — which, for the `Vacant` arm, contains `METRICS.record_new_discovery().await`. This means the measured duration for genuinely new contacts includes the async overhead of `record_new_discovery()`, not just the IndexMap entry operation, making the metric inconsistent across the two arms and potentially noisy compared to the Occupied arm baseline.
To measure only the IndexMap operation, move the observation inside the match arms before the async call:
```rust
match self.contacts.entry(node_id) {
Entry::Vacant(vacant_entry) => {
vacant_entry.insert(Contact::new(node, protocol));
#[cfg(feature = "metrics")]
{
use ethrex_metrics::p2p::METRICS_P2P;
METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64());
}
METRICS.record_new_discovery().await;
}
Entry::Occupied(mut occupied_entry) => {
occupied_entry.get_mut().add_protocol(protocol);
#[cfg(feature = "metrics")]
{
use ethrex_metrics::p2p::METRICS_P2P;
METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64());
}
}
}
```
How can I resolve this? If you propose a fix, please make it concise.
🤖 Codex Code Review
No consensus, gas-accounting, trie, RLP, or transaction-validation concerns in this PR; it is limited to observability. I couldn’t run Automated review by OpenAI Codex · gpt-5.4 · custom prompt |
🤖 Claude Code ReviewNow I have full context. Let me write the review. Review: feat(l1): add peer table timing metrics (Kademlia baseline)Overall: Straightforward observability addition. The Bug:
|
| #[cfg(feature = "metrics")] | ||
| { | ||
| use ethrex_metrics::p2p::METRICS_P2P; | ||
| METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64()); |
There was a problem hiding this comment.
The timing window here includes the METRICS.record_new_discovery().await call inside the Vacant arm (line 1005 on the PR branch). That async operation will dominate the measurement for new contacts, making the histogram reflect "insertion + async metrics recording" rather than the IndexMap::entry cost you're trying to baseline.
Consider moving insert_start to just before the match and observing just after, but wrapping only the match itself — or narrower: start the timer immediately before contacts.entry() and observe right after the match closes but before record_new_discovery. Alternatively, you could move the record_new_discovery call to after the metrics observation:
let new_contact = match self.contacts.entry(node_id) {
Entry::Vacant(vacant_entry) => {
vacant_entry.insert(Contact::new(node, protocol));
true
}
Entry::Occupied(mut occupied_entry) => {
occupied_entry.get_mut().add_protocol(protocol);
false
}
};
#[cfg(feature = "metrics")]
{
use ethrex_metrics::p2p::METRICS_P2P;
METRICS_P2P.observe_insert_contact_duration(insert_start.elapsed().as_secs_f64());
}
if new_contact {
METRICS.record_new_discovery().await;
}| ] | ||
| } | ||
| , | ||
| { |
There was a problem hiding this comment.
nit: The trailing comma before the new JSON block (}, on the previous context line followed by + ,) produces }, , which is unusual formatting. It works because the previous element already had its comma, but this leading-comma style is inconsistent with the rest of the file. Consider appending the comma to the preceding } instead (i.e., adding the new panels as a natural continuation of the array).
Summary
Add Prometheus histogram metrics to measure peer table operation latencies on the current flat
IndexMapimplementation. This provides the baseline for comparing against the k-bucket implementation in #6458.Metrics added (behind
#[cfg(feature = "metrics")]):ethrex_kademlia_insert_contact_duration_seconds— times contact insertion viaIndexMap::entryindo_new_contactsethrex_kademlia_iter_contacts_duration_seconds— times full table scan indo_get_closest_nodesGrafana panels added to P2P Packets dashboard:
Workflow
Test plan
cargo clippypasses with and withoutmetricsfeature--metricsflag