feat(errors): add structured error registry with classification, metr…#2261
Conversation
Review Summary by Qodofeat(errors): add structured error registry with classification, metrics, and observability
WalkthroughsDescription• Introduces centralized error classification system with two-tier architecture (Tier 2: chain-specific, Tier 1: generic/transport) • Implements 100+ named error codes across four layers (Protocol, Node, Blockchain, User) with structured metadata (code, name, category, retryable flag) • Adds comprehensive error matchers (CodeEquals, MessageContains, MessageRegex, HTTPStatusContains, GRPCCodeEquals) for flexible pattern matching • Defines ChainFamily enum mapping 50+ chain IDs to families (EVM, Solana, Bitcoin, Cosmos, Starknet, NEAR, Aptos, etc.) • Implements LavaError struct with Is() and ABCICode() methods for error.Is() compatibility and gRPC wire protocol support • Adds DetectConnectionError with three-layer detection (structured, string fallback, syscall errno) for connection-level failures • Integrates structured logging via LogCodedError/LogCodedWarning with error_code, error_name, error_category, retryable, and chain_id fields • Implements Prometheus lava_errors_total counter with full label cardinality for observability • Provides legacy sdkerrors → LavaError mapping for backward compatibility • Refactors all relay path error handlers (JSON-RPC, REST, gRPC, Tendermint) to use unified classification with chain context • Enhances relay error selection with classification-aware precedence (majority consensus, external-beats-internal preference) • Fixes JSON-RPC empty error message handling bug • Adds 1000+ lines of comprehensive test coverage including registry invariants, classification validation, fixture-based regression tests, and metrics integration Diagramflowchart LR
A["Error Sources<br/>JSON-RPC, REST, gRPC,<br/>Tendermint, Connection"]
B["Error Classification<br/>Tier 2: Chain-specific<br/>Tier 1: Generic/Transport"]
C["LavaError Registry<br/>100+ Named Codes<br/>Chain Family Mapping"]
D["Structured Logging<br/>LogCodedError/<br/>LogCodedWarning"]
E["Prometheus Metrics<br/>lava_errors_total<br/>Full Label Cardinality"]
F["Relay Processing<br/>Error Selection<br/>Retry Logic"]
A -->|DetectConnectionError| B
A -->|ClassifyError| B
B -->|Match Against| C
B -->|Extract Metadata| D
D -->|Emit| E
B -->|Determine Behavior| F
File Changes1. protocol/common/error_registry_test.go
|
Code Review by Qodo
|
fec73ab to
21dc454
Compare
avitenzer
left a comment
There was a problem hiding this comment.
Location: lava_errors_total Prometheus counter definition
Issue: Labels {error_code, error_name, error_category, retryable, chain_id} — with 100+ codes × N chains, cardinality could explode. error_name is redundant with error_code (1:1 mapping).
Suggestion: Drop error_name label from the counter (keep it in structured logs). Consider bucketing chain_id or using an exemplar instead.
86ebf88 to
c99fb4a
Compare
…ics, and observability Introduces a centralized error classification system that replaces ad-hoc error handling across the relay path with structured, two-tier classification (chain-specific Tier 2, generic/transport Tier 1) and Prometheus metrics. Core components: - Error registry with named error codes (error_codes.go), matchers (error_classifier.go), and chain family mappings for EVM, Solana, Cosmos, Bitcoin, Starknet, NEAR, and Aptos - Structured logging via LogCodedError/LogCodedWarning with error_code, error_name, error_category, retryable, and chain_id fields - lava_errors_total Prometheus counter with full label cardinality - Legacy sdkerrors → LavaError mapping for backward compatibility - DetectConnectionError for connection-level failures (timeout, refused, reset, GOAWAY, RST_STREAM, Envoy connection termination, ECONNRESET) Classification coverage: - JSON-RPC standard codes (-32700 to -32000) and EIP-1474 codes - HTTP status codes (4xx, 5xx) including Cloudflare 520-530 - gRPC status codes (Unimplemented, Unavailable) - Chain-specific errors: EVM tx errors, Solana/Bitcoin/Starknet/NEAR node errors, Cosmos tx errors - Transport-level: connection reset, truncated JSON, rate limiting, method not found/supported variants Integration: - All relay path error handlers (JSON-RPC, REST, gRPC, Tendermint) classify and log with chain_id via error handler structs - ResultsManager extracts JSON-RPC error codes from response bodies for accurate classification with chain_id context - Smart router direct RPC path classifies with chain_id - Provider session errors log with chain_id Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
c99fb4a to
2c64c23
Compare
…ics, and observability
Introduces a centralized error classification system that replaces ad-hoc error handling across the relay path with structured, two-tier classification (chain-specific Tier 2, generic/transport Tier 1) and Prometheus metrics.
Core components:
Classification coverage:
Integration:
Description
Closes: #XXXX
Author Checklist
All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.
I have...
!in the type prefix if API or client breaking changemainbranchReviewers Checklist
All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.
I have...