Skip to content

feat(errors): add structured error registry with classification, metr…#2261

Merged
nimrod-teich merged 1 commit into
mainfrom
feature/error-registry-poc
Apr 9, 2026
Merged

feat(errors): add structured error registry with classification, metr…#2261
nimrod-teich merged 1 commit into
mainfrom
feature/error-registry-poc

Conversation

@NadavLevi

Copy link
Copy Markdown
Contributor

…ics, and observability

Introduces a centralized error classification system that replaces ad-hoc error handling across the relay path with structured, two-tier classification (chain-specific Tier 2, generic/transport Tier 1) and Prometheus metrics.

Core components:

  • Error registry with named error codes (error_codes.go), matchers (error_classifier.go), and chain family mappings for EVM, Solana, Cosmos, Bitcoin, Starknet, NEAR, and Aptos
  • Structured logging via LogCodedError/LogCodedWarning with error_code, error_name, error_category, retryable, and chain_id fields
  • lava_errors_total Prometheus counter with full label cardinality
  • Legacy sdkerrors → LavaError mapping for backward compatibility
  • DetectConnectionError for connection-level failures (timeout, refused, reset, GOAWAY, RST_STREAM, Envoy connection termination, ECONNRESET)

Classification coverage:

  • JSON-RPC standard codes (-32700 to -32000) and EIP-1474 codes
  • HTTP status codes (4xx, 5xx) including Cloudflare 520-530
  • gRPC status codes (Unimplemented, Unavailable)
  • Chain-specific errors: EVM tx errors, Solana/Bitcoin/Starknet/NEAR node errors, Cosmos tx errors
  • Transport-level: connection reset, truncated JSON, rate limiting, method not found/supported variants

Integration:

  • All relay path error handlers (JSON-RPC, REST, gRPC, Tendermint) classify and log with chain_id via error handler structs
  • ResultsManager extracts JSON-RPC error codes from response bodies for accurate classification with chain_id context
  • Smart router direct RPC path classifies with chain_id
  • Provider session errors log with chain_id

Description

Closes: #XXXX


Author Checklist

All items are required. Please add a note to the item if the item is not applicable and
please add links to any relevant follow up issues.

I have...

  • read the contribution guide
  • included the correct type prefix in the PR title, you can find examples of the prefixes below:
  • confirmed ! in the type prefix if API or client breaking change
  • targeted the main branch
  • provided a link to the relevant issue or specification
  • reviewed "Files changed" and left comments if necessary
  • included the necessary unit and integration tests
  • updated the relevant documentation or specification, including comments for documenting Go code
  • confirmed all CI checks have passed

Reviewers Checklist

All items are required. Please add a note if the item is not applicable and please add
your handle next to the items reviewed if you only reviewed selected items.

I have...

  • confirmed the correct type prefix in the PR title
  • confirmed all author checklist items have been addressed
  • reviewed state machine logic, API design and naming, documentation is accurate, tests and test coverage

@qodo-code-review

Copy link
Copy Markdown

Review Summary by Qodo

feat(errors): add structured error registry with classification, metrics, and observability

✨ Enhancement 🧪 Tests 🐞 Bug fix

Grey Divider

Walkthroughs

Description
• Introduces centralized error classification system with two-tier architecture (Tier 2:
  chain-specific, Tier 1: generic/transport)
• Implements 100+ named error codes across four layers (Protocol, Node, Blockchain, User) with
  structured metadata (code, name, category, retryable flag)
• Adds comprehensive error matchers (CodeEquals, MessageContains, MessageRegex,
  HTTPStatusContains, GRPCCodeEquals) for flexible pattern matching
• Defines ChainFamily enum mapping 50+ chain IDs to families (EVM, Solana, Bitcoin, Cosmos,
  Starknet, NEAR, Aptos, etc.)
• Implements LavaError struct with Is() and ABCICode() methods for error.Is() compatibility
  and gRPC wire protocol support
• Adds DetectConnectionError with three-layer detection (structured, string fallback, syscall
  errno) for connection-level failures
• Integrates structured logging via LogCodedError/LogCodedWarning with error_code, error_name,
  error_category, retryable, and chain_id fields
• Implements Prometheus lava_errors_total counter with full label cardinality for observability
• Provides legacy sdkerrors → LavaError mapping for backward compatibility
• Refactors all relay path error handlers (JSON-RPC, REST, gRPC, Tendermint) to use unified
  classification with chain context
• Enhances relay error selection with classification-aware precedence (majority consensus,
  external-beats-internal preference)
• Fixes JSON-RPC empty error message handling bug
• Adds 1000+ lines of comprehensive test coverage including registry invariants, classification
  validation, fixture-based regression tests, and metrics integration
Diagram
flowchart LR
  A["Error Sources<br/>JSON-RPC, REST, gRPC,<br/>Tendermint, Connection"]
  B["Error Classification<br/>Tier 2: Chain-specific<br/>Tier 1: Generic/Transport"]
  C["LavaError Registry<br/>100+ Named Codes<br/>Chain Family Mapping"]
  D["Structured Logging<br/>LogCodedError/<br/>LogCodedWarning"]
  E["Prometheus Metrics<br/>lava_errors_total<br/>Full Label Cardinality"]
  F["Relay Processing<br/>Error Selection<br/>Retry Logic"]
  
  A -->|DetectConnectionError| B
  A -->|ClassifyError| B
  B -->|Match Against| C
  B -->|Extract Metadata| D
  D -->|Emit| E
  B -->|Determine Behavior| F
Loading

Grey Divider

File Changes

1. protocol/common/error_registry_test.go 🧪 Tests +1030/-0

Add comprehensive error registry test coverage

• Comprehensive test suite (1030 lines) for error registry integrity, classification, and matchers
• Tests registry invariants: no duplicate codes/names, code ranges, and all error codes registered
• Validates error classification across chain families (EVM, Solana, Bitcoin, Cosmos, Starknet,
 NEAR, Aptos)
• Tests connection error detection, user input validation, and unsupported method classification
• Verifies matcher precedence and prevents shadowed matchers in Tier-1 and Tier-2 mappings

protocol/common/error_registry_test.go


2. protocol/common/error_classifier.go ✨ Enhancement +657/-0

Implement error classification with Tier-1 and Tier-2 matchers

• Implements Tier-2 chain-specific error mappings for Solana, Bitcoin, Cosmos, Starknet, and NEAR
• Implements Tier-1 generic error mappings partitioned by transport (JSON-RPC, REST, gRPC)
• Adds ClassifyError central classification function with connection error precedence
• Implements DetectConnectionError with three-layer detection (structured, string fallback,
 syscall errno)
• Adds helper functions: IsUnsupportedMethodError, IsUserInputError, ClassifyMessage,
 IsClientCancellation

protocol/common/error_classifier.go


3. protocol/rpcprovider/rpcprovider_server.go ✨ Enhancement +32/-36

Integrate structured error logging and classification

• Replace utils.LavaFormatError calls with common.LogCodedError and common.LogCodedWarning for
 structured error logging
• Add error classification via common.ClassifyLegacyError and common.ExtractLavaError for
 protocol and node errors
• Refactor unsupported method error handling to use chainlib.IsUnsupportedMethodErrorType and
 return early
• Update error logging in session management, relay validation, and subscription handling

protocol/rpcprovider/rpcprovider_server.go


View more (54)
4. protocol/relaycore/results_manager.go ✨ Enhancement +45/-10

Add chain-aware error classification to results manager

• Add chainID field to ResultsManagerInst for chain-specific error classification
• Implement transportFromProtocolMessage helper to derive transport type from API collection
• Replace generic error logging with common.LogCodedError using classified errors
• Extract JSON-RPC error codes from response bodies for accurate classification
• Add LavaError field to RelayError struct to track classified errors

protocol/relaycore/results_manager.go


5. protocol/chainlib/tendermintRPC.go ✨ Enhancement +1/-1

Add chain context to Tendermint RPC error handler

• Initialize TendermintRPCErrorHandler with chainFamily and chainID fields for chain-specific
 error classification

protocol/chainlib/tendermintRPC.go


6. protocol/common/error_codes.go ✨ Enhancement +606/-0

Centralized error code registry with comprehensive classification

• Introduces 606 lines of structured error code definitions across four layers (Protocol, Node,
 Blockchain, User errors)
• Defines 100+ named error codes (e.g., LavaErrorConnectionTimeout, LavaErrorChainNonceTooLow)
 with metadata (code, name, category, retryable flag)
• Covers error ranges: 1000-1999 (internal protocol), 2000-2999 (node), 3000-3999 (blockchain),
 4000-4999 (user input)
• Includes chain-specific errors for Solana, Starknet, Bitcoin, NEAR, and other supported chains

protocol/common/error_codes.go


7. protocol/common/error_registry.go ✨ Enhancement +555/-0

Error classification framework with matchers and chain family mapping

• Implements error classification system with ErrorCategory (internal/external),
 ErrorSubCategory (unsupported method, user error, rate limit)
• Defines ChainFamily enum mapping 50+ chain IDs to families (EVM, Solana, Bitcoin, Cosmos,
 Starknet, etc.)
• Provides error matchers (CodeEquals, MessageContains, MessageRegex, HTTPStatusContains,
 GRPCCodeEquals) for flexible error pattern matching
• Implements LavaError struct with Is() and ABCICode() methods for error.Is() compatibility
 and gRPC wire protocol support
• Includes LavaWrappedError for wrapping classified errors with context while preserving
 errors.Is() matching

protocol/common/error_registry.go


8. protocol/chainlib/node_error_handler.go Refactoring +205/-172

Unified error classification path across all transport types

• Replaces custom UnsupportedMethodError and SolanaNonRetryableError types with registry-based
 classification via ClassifyNodeError()
• Introduces ExtractNodeErrorDetails() to extract numeric error codes from JSON-RPC, gRPC, and
 HTTP errors
• Implements handleAndClassify() shared path for all transports (JSON-RPC, REST, gRPC, Tendermint)
 with single-log invariant
• Adds chainFamily and chainID fields to error handler structs for Tier 2 classification context
• Refactors IsUnsupportedMethodError(), IsSolanaNonRetryableError(), and ShouldRetryError() to
 use registry lookups

protocol/chainlib/node_error_handler.go


9. protocol/rpcsmartrouter/error_mapper_test.go 🧪 Tests +245/-121

Registry-based error classification test suite

• Replaces 40+ lines of legacy error mapping tests with new registry-based classification tests
• Adds tests for classifyDirectRPCError() covering connection errors, HTTP status codes, JSON-RPC
 body extraction, and Solana-specific codes
• Introduces endpoint health classification tests (TestClassifyEndpointHealth_*) validating
 internal vs external error handling and rate-limit carve-outs
• Tests errors.Is() matching against LavaError sentinels and client cancellation carve-out logic

protocol/rpcsmartrouter/error_mapper_test.go


10. protocol/rpcconsumer/rpcconsumer_server.go ✨ Enhancement +56/-40

Structured error logging with code-based metrics integration

• Extracts cross-validation capacity validation into validateCrossValidationCapacity() method
 returning LogCodedWarning()
• Adds SetChainID() calls to UsedProviders for structured logging context
• Replaces utils.LavaFormatError() with common.LogCodedWarning() / common.LogCodedError() for
 error code metrics emission
• Adds IsUserError flag detection alongside IsUnsupportedMethod for zero-CU carve-out logic
• Introduces IsUnsupportedMethodError() and IsUserInputError() registry-based checks replacing
 message pattern matching

protocol/rpcconsumer/rpcconsumer_server.go


11. protocol/relaycore/relay_processor.go ✨ Enhancement +62/-3

Relay processor integration with error registry classification

• Adds chainID parameter to NewResultsManager() constructor for error context
• Implements HasNonRetryableUserFacingErrors() method checking both IsUnsupportedMethod and
 IsUserError flags
• Updates shouldRetryRelay() to use HasNonRetryableUserFacingErrors() instead of
 HasUnsupportedMethodErrors()
• Adds bestLavaError extraction in buildFailureResult() with LogCodedError() call for metrics

protocol/relaycore/relay_processor.go


12. protocol/chainlib/node_error_handler_test.go 🧪 Tests +32/-343

Refactor error handler tests to use structured error classification

• Simplified test assertions to use Contains instead of exact equality checks for error messages
• Replaced type-assertion based error detection with IsUnsupportedMethodErrorType() function calls
• Removed extensive test coverage for message pattern matching functions
 (IsUnsupportedMethodMessage, IsUnsupportedMethodErrorMessageBytes)
• Removed benchmark tests and error constant verification tests
• Updated smart contract error tests to use the new error classification API

protocol/chainlib/node_error_handler_test.go


13. protocol/rpcsmartrouter/rpcsmartrouter_server.go ✨ Enhancement +84/-73

Integrate structured error classification into smart router

• Added validateCrossValidationCapacity() method to extract cross-validation validation logic
• Set chainID on UsedProviders instances for chain-aware error classification
• Replaced inline error classification with call to validateCrossValidationCapacity()
• Enhanced direct RPC relay error handling with structured error classification using
 classifyDirectRPCError()
• Added client cancellation detection via common.IsClientCancellation() to avoid marking endpoints
 unhealthy on relay races
• Replaced classifyRelayError() with classifyEndpointHealth() for better error categorization

protocol/rpcsmartrouter/rpcsmartrouter_server.go


14. protocol/common/legacy_error_mapping_test.go 🧪 Tests +183/-0

Add comprehensive legacy error mapping tests

• New test file validating legacy sdkerrors to LavaError mapping
• Tests cover session errors, provider errors, protocol errors, chain tracker errors, and
 performance errors
• Verifies backward compatibility with sdkerrors (codespace, code) tuple matching
• Tests fallback to message-based classification when no mapping exists
• Validates transport-scoped fallback behavior for gRPC vs JSON-RPC

protocol/common/legacy_error_mapping_test.go


15. protocol/chainlib/handle_and_classify_test.go 🧪 Tests +206/-0

Add tests for node error classification and handling

• New test file for handleAndClassify() and ClassifyNodeError() functions
• Tests unsupported method detection, retryable vs non-retryable error classification
• Validates single-metric-per-call invariant across all error paths
• Tests gRPC status errors, HTTP errors, and plain errors
• Verifies UnwrapLavaError() behavior for wrapped and plain errors

protocol/chainlib/handle_and_classify_test.go


16. protocol/common/legacy_error_mapping.go ✨ Enhancement +135/-0

Implement legacy sdkerrors to LavaError mapping

• New file implementing ClassifyLegacyError() for sdkerrors backward compatibility
• Maps (codespace, code) tuples from legacy errors to LavaError equivalents
• Covers session, provider, protocol, chain tracker, and performance error codes
• Falls back to message-based classification via ClassifyError() when no mapping exists
• Extracts sdkerrors interface (Codespace/ABCICode) for tuple-based lookup

protocol/common/legacy_error_mapping.go


17. protocol/rpcprovider/rpcprovider_server_test.go 🧪 Tests +83/-20

Update provider tests for structured error classification

• Updated error handling tests to use IsUnsupportedMethodErrorType() instead of type assertions
• Simplified test assertions for unsupported method error properties
• Added finalizeSessionErrorPath() helper to replicate error-branch logic
• Added TestFinalizeSession_SingleMetricPerFailure() to verify single-metric invariant
• Updated tests to verify LavaError wrapping behavior instead of direct property access

protocol/rpcprovider/rpcprovider_server_test.go


18. protocol/rpcsmartrouter/error_mapper.go ✨ Enhancement +66/-63

Refactor error mapping to use structured classification

• Replaced MapDirectRPCError() with classifyDirectRPCError() and classifyAndWrap()
• Removed connection error detection functions (moved to common.DetectConnectionError())
• Added classifyEndpointHealth() to decide endpoint health based on error classification
• Implements external-beats-internal preference and rate-limit carve-out for endpoint health
• Uses extractLavaError() to unwrap classified errors from LavaWrappedError

protocol/rpcsmartrouter/error_mapper.go


19. protocol/relaycore/relay_errors_classification_test.go 🧪 Tests +102/-0

Add tests for relay error selection precedence

• New test file for GetBestErrorMessageForUser() precedence logic
• Tests majority consensus preference over individual error scores
• Tests external error preference over internal errors
• Tests score-based tiebreaking within same category
• Validates LavaError propagation through error selection

protocol/relaycore/relay_errors_classification_test.go


20. protocol/common/error_logging.go ✨ Enhancement +110/-0

Add structured error logging with metrics callbacks

• New file implementing structured error logging with metrics integration
• Defines ErrorMetricsCallback for Prometheus counter updates
• Implements LogCodedError() and LogCodedWarning() for classified error logging
• Provides EmitErrorMetric() for metric-only emission without logging
• Includes ExtractJSONRPCErrorCode() helper for JSON-RPC error code extraction

protocol/common/error_logging.go


21. protocol/common/error_fixtures_test.go 🧪 Tests +116/-0

Add fixture-based error classification regression tests

• New test file loading error fixtures from JSON for regression testing
• Tests ClassifyError() against real error responses from various chains
• Validates fixture file structure and prevents duplicate test names
• Provides mechanism for adding new fixtures without code changes

protocol/common/error_fixtures_test.go


22. protocol/relaycore/relay_errors.go ✨ Enhancement +33/-5

Enhance relay error selection with classification-aware precedence

• Enhanced GetBestErrorMessageForUser() with documented precedence rules
• Implemented majority consensus check (step 1) that overrides scoring
• Added external-beats-internal preference (step 2) for scored candidates
• Added LavaError field to RelayError struct for error classification
• Fixed off-by-one error in mergeAllErrors() comma insertion logic

protocol/relaycore/relay_errors.go


23. protocol/metrics/error_metrics_integration_test.go 🧪 Tests +83/-0

Add end-to-end error metrics integration test

• New integration test verifying end-to-end error metrics flow
• Tests Prometheus counter increments for classified errors
• Validates label cardinality (error_code, error_name, error_category, retryable, chain_id)
• Verifies metric aggregation across multiple error emissions

protocol/metrics/error_metrics_integration_test.go


24. protocol/lavasession/used_providers.go ✨ Enhancement +30/-4

Add chain context to used providers for error classification

• Added chainID field to UsedProviders for chain-aware error classification
• Implemented SetChainID() method to attach chain context after construction
• Updated shouldRetryWithThisError() to accept receiver and use chain context
• Passes chainID to common.IsUnsupportedMethodError() for Tier-2 classification

protocol/lavasession/used_providers.go


25. protocol/chainlib/no_retry_test.go 🧪 Tests +11/-19

Update retry logic tests for chain-aware classification

• Updated test comment for Solana -32009 error to clarify chain context requirement
• Changed expected retry behavior for Solana missing-in-long-term-storage without chain context
• Updated IsSolanaNonRetryableErrorType() test to reflect unsupported methods as non-retryable
• Simplified SolanaNonRetryableError test to verify LavaError wrapping
• Removed error code constant verification tests

protocol/chainlib/no_retry_test.go


26. protocol/metrics/error_metrics_test.go 🧪 Tests +77/-0

Add error metrics initialization tests

• New test file for InitErrorMetrics() Prometheus counter setup
• Tests counter registration and label aggregation
• Verifies metric increments for distinct label combinations

protocol/metrics/error_metrics_test.go


27. protocol/common/error_logging_test.go 🧪 Tests +69/-0

Add error logging function tests

• New test file for LogCodedError() and LogCodedWarning() functions
• Tests nil LavaError handling (defaults to LavaErrorUnknown)
• Tests metrics callback invocation with correct parameters
• Verifies no panic when callback is nil

protocol/common/error_logging_test.go


28. protocol/rpcsmartrouter/direct_rpc_relay.go ✨ Enhancement +9/-7

Integrate chain family into direct RPC relay error handling

• Added chainFamily field to DirectRPCRelaySender for Tier-2 error classification
• Replaced MapDirectRPCError() calls with classifyAndWrap() for structured classification
• Updated JSON-RPC, REST, and gRPC relay methods to pass transport type to classification
• Passes chainFamily to classification functions for chain-specific matchers

protocol/rpcsmartrouter/direct_rpc_relay.go


29. protocol/chainlib/chainproxy/rpcInterfaceMessages/restMessage.go 📝 Documentation +9/-9

Improve REST message error handling comments

• Added clarifying comments about status-code range checking vs registry lookup
• Improved variable naming in Cosmos SDK transaction error check
• Updated comments to use em-dash for consistency

protocol/chainlib/chainproxy/rpcInterfaceMessages/restMessage.go


30. protocol/lavasession/used_providers_test.go 🧪 Tests +5/-3

Update used providers tests for instance method calls

• Updated TestShouldRetryWithThisError() to create UsedProviders instance
• Changed function calls from package-level to instance method (up.shouldRetryWithThisError())
• Tests now verify chain-aware unsupported method detection

protocol/lavasession/used_providers_test.go


31. protocol/relaycore/relay_processor_test.go 🧪 Tests +6/-10

Update relay processor tests for new classification API

• Updated test calls from IsUnsupportedMethodMessage() to IsUnsupportedMethodError()
• Changed test expectations for "method not supported" (now retryable, not unsupported)
• Updated registry-based classification test comments
• Removed "method not supported" from unsupported patterns list

protocol/relaycore/relay_processor_test.go


32. protocol/rpcprovider/provider_state_machine.go ✨ Enhancement +2/-10

Use structured error classification in provider state machine

• Replaced inline unsupported method detection with common.IsUnsupportedMethodError() call
• Removed manual HTTP status code checks for REST endpoints
• Passes chainId and statusCode to classification function for accurate detection
• Updated comment to reference error registry classification

protocol/rpcprovider/provider_state_machine.go


33. protocol/lavasession/consumer_types.go ✨ Enhancement +9/-5

Use shared client cancellation detection in consumer session

• Replaced inline context cancellation check with common.IsClientCancellation()
• Updated comments to reference shared rule from common package
• Clarifies that context.DeadlineExceeded is NOT exempt from refusal counter

protocol/lavasession/consumer_types.go


34. protocol/metrics/error_metrics.go ✨ Enhancement +46/-0

Add error metrics initialization for Prometheus

• New file implementing InitErrorMetrics() for Prometheus counter setup
• Registers lava_errors_total counter with labels for error_name, error_category, retryable,
 chain_id
• Handles already-registered counter gracefully for test compatibility
• Sets error metrics callback in common package for structured logging integration

protocol/metrics/error_metrics.go


35. protocol/chainlib/rest.go ✨ Enhancement +1/-1

Initialize REST error handler with chain context

• Updated RestErrorHandler initialization to include chainFamily and chainID
• Uses common.GetChainFamilyOrDefault() to resolve chain family from chain ID

protocol/chainlib/rest.go


36. protocol/rpcconsumer/rpcconsumer.go ✨ Enhancement +1/-0

Initialize error metrics in consumer startup

• Added metrics.InitErrorMetrics() call during consumer startup
• Initializes Prometheus error counter before metrics manager creation

protocol/rpcconsumer/rpcconsumer.go


37. protocol/chainlib/debug_retry_test.go 🧪 Tests +4/-2

Update debug retry test for context-aware classification

• Updated test to use ShouldRetryErrorWithContext() with explicit transport parameter
• Added comment clarifying that gRPC transport must be specified for registry detection
• Imports common package for transport type constant

protocol/chainlib/debug_retry_test.go


38. protocol/chainlib/grpc.go ✨ Enhancement +1/-1

Initialize gRPC error handler with chain context

• Updated GRPCErrorHandler initialization to include chainFamily and chainID
• Uses common.GetChainFamilyOrDefault() to resolve chain family from chain ID

protocol/chainlib/grpc.go


39. protocol/chainlib/jsonRPC.go ✨ Enhancement +1/-1

Initialize JSON-RPC error handler with chain context

• Updated JsonRPCErrorHandler initialization to include chainFamily and chainID
• Uses common.GetChainFamilyOrDefault() to resolve chain family from chain ID

protocol/chainlib/jsonRPC.go


40. protocol/rpcsmartrouter/direct_rpc_integration_test.go 🧪 Tests +4/-4

Update direct RPC integration test assertions

• Updated timeout test assertion to expect "deadline exceeded" instead of "timeout"
• Updated server error test assertion to expect "503" instead of "service unavailable"
• Added comments clarifying that errors preserve original messages with classification in metadata

protocol/rpcsmartrouter/direct_rpc_integration_test.go


41. protocol/common/endpoints.go ✨ Enhancement +5/-0

Add user error flag to relay result

• Added IsUserError field to RelayResult struct
• Documents behavioral contract matching IsUnsupportedMethod (zero retries, zero CU, no scoring)
• Notes that user errors are not cached for subsequent requests

protocol/common/endpoints.go


42. protocol/rpcconsumer/consumer_relay_state_machine.go ✨ Enhancement +5/-3

Clarify non-retryable user-facing error detection

• Renamed internal check to HasNonRetryableUserFacingErrors() for clarity
• Updated comment to clarify coverage of both unsupported method and user input errors
• Maintains historical function name for backward compatibility

protocol/rpcconsumer/consumer_relay_state_machine.go


43. protocol/chainlib/chainproxy/rpcInterfaceMessages/restMessage_test.go 🧪 Tests +9/-0

Add test for 501 Not Implemented as node error

• Added test case for 501 Not Implemented (treated as node error, not client error)
• Verifies that 5xx status codes trigger retries on another provider

protocol/chainlib/chainproxy/rpcInterfaceMessages/restMessage_test.go


44. protocol/chainlib/chainproxy/rpcInterfaceMessages/jsonRPCMessage.go 🐞 Bug fix +5/-1

Fix JSON-RPC empty error message handling

• Fixed CheckResponseError() to return false when error message is empty
• Prevents empty error messages from being treated as valid errors

protocol/chainlib/chainproxy/rpcInterfaceMessages/jsonRPCMessage.go


45. .github/workflows/lava.yml Additional files +32/-0

...

.github/workflows/lava.yml


46. error-registry-design.md Additional files +601/-0

...

error-registry-design.md


47. go.mod Additional files +1/-1

...

go.mod


48. protocol/common/errors.go Additional files +0/-141

...

protocol/common/errors.go


49. protocol/common/errors_test.go Additional files +0/-412

...

protocol/common/errors_test.go


50. protocol/common/return_errors.go Additional files +6/-1

...

protocol/common/return_errors.go


51. protocol/common/svm.go Additional files +10/-6

...

protocol/common/svm.go


52. protocol/common/testdata/error_fixtures.json Additional files +839/-0

...

protocol/common/testdata/error_fixtures.json


53. protocol/rpcprovider/rpcprovider.go Additional files +1/-0

...

protocol/rpcprovider/rpcprovider.go


54. protocol/rpcprovider/test_mode_consistency_test.go Additional files +3/-4

...

protocol/rpcprovider/test_mode_consistency_test.go


55. protocol/rpcsmartrouter/rpcsmartrouter.go Additional files +1/-0

...

protocol/rpcsmartrouter/rpcsmartrouter.go


56. protocol/rpcsmartrouter/rpcsmartrouter_test.go Additional files +0/-113

...

protocol/rpcsmartrouter/rpcsmartrouter_test.go


57. testutil/e2e/allowedErrorList.go Additional files +2/-0

...

testutil/e2e/allowedErrorList.go


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented Apr 9, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (4)   📘 Rule violations (2)   📎 Requirement gaps (0)   🎨 UX Issues (0)
🐞\ ≡ Correctness (1) ☼ Reliability (2) ◔ Observability (1)
📘\ ⚙ Maintainability (2)

Grey Divider


Action required

1. error-registry-design.md not snake_case 📘
Description
A new documentation file is named with hyphens (error-registry-design.md) instead of snake_case,
violating the repository naming convention requirement. This reduces consistency and can break
tooling or conventions that assume snake_case paths.
Code

error-registry-design.md[1]

+# Error Categorization & Standardized Logging Plan
Evidence
PR Compliance ID 4 requires file names to use snake_case. The PR adds error-registry-design.md,
which uses hyphens rather than underscores.

AGENTS.md
error-registry-design.md[1-1]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A newly added documentation file uses hyphens in its filename instead of the required `snake_case` convention.

## Issue Context
Compliance requires file names to be `snake_case`.

## Fix Focus Areas
- error-registry-design.md[1-1]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Misnamed tests lack underscore 📘
Description
Several newly added tests do not follow the required TestComponent_Scenario naming convention
(missing the _ separator). This reduces test discoverability/consistency and violates the project
test naming standard.
Code

protocol/common/error_registry_test.go[R18-245]

+func TestRegistryNoDuplicateCodes(t *testing.T) {
+	// registerError panics on duplicate codes at init time,
+	// but let's also verify the registry is consistent.
+	seen := make(map[uint32]string)
+	for code, le := range errorRegistry {
+		if existing, ok := seen[code]; ok {
+			t.Errorf("duplicate code %d: %s and %s", code, existing, le.Name)
+		}
+		seen[code] = le.Name
+		assert.Equal(t, code, le.Code, "registry key %d doesn't match error code %d (%s)", code, le.Code, le.Name)
+	}
+}
+
+func TestRegistryNoDuplicateNames(t *testing.T) {
+	seen := make(map[string]uint32)
+	for _, le := range errorRegistry {
+		if existingCode, ok := seen[le.Name]; ok {
+			t.Errorf("duplicate name %s: codes %d and %d", le.Name, existingCode, le.Code)
+		}
+		seen[le.Name] = le.Code
+	}
+}
+
+func TestRegistryCodeRanges(t *testing.T) {
+	for _, le := range errorRegistry {
+		if le.Code == 0 {
+			continue // UNKNOWN_ERROR
+		}
+		switch {
+		case le.Code >= 1000 && le.Code < 2000:
+			assert.Equal(t, CategoryInternal, le.Category,
+				"code %d (%s) is in protocol range but not CategoryInternal", le.Code, le.Name)
+		case le.Code >= 2000 && le.Code < 5000:
+			assert.Equal(t, CategoryExternal, le.Category,
+				"code %d (%s) is in external range but not CategoryExternal", le.Code, le.Name)
+		default:
+			t.Errorf("code %d (%s) is outside valid ranges (1000-4999)", le.Code, le.Name)
+		}
+	}
+}
+
+func TestAllErrorCodesRegistered(t *testing.T) {
+	// Spot-check that key error codes are in the registry
+	// Verify UNKNOWN_ERROR is registered at code 0
+	assert.Equal(t, LavaErrorUnknown, getLavaError(0))
+	assert.Equal(t, "UNKNOWN_ERROR", getLavaError(0).Name)
+
+	codes := []uint32{
+		1001, // CONNECTION_TIMEOUT
+		1002, // CONNECTION_REFUSED
+		2001, // METHOD_NOT_FOUND
+		2005, // RATE_LIMITED
+		3001, // NONCE_TOO_LOW
+		3101, // EXECUTION_REVERTED
+		4001, // PARSE_ERROR
+	}
+	for _, code := range codes {
+		le := getLavaError(code)
+		assert.NotEqual(t, LavaErrorUnknown, le, "code %d should be registered", code)
+		assert.Equal(t, code, le.Code)
+	}
+}
+
+func TestRegisterError_RejectsReservedCodeZero(t *testing.T) {
+	assert.PanicsWithValue(t,
+		"error code 0 is reserved for LavaErrorUnknown; MY_NEW_ERROR must use a non-zero code",
+		func() {
+			registerError(&LavaError{Code: 0, Name: "MY_NEW_ERROR"})
+		},
+	)
+}
+
+func TestRegisterError_DuplicateCodeMentionsExistingName(t *testing.T) {
+	// LavaErrorConnectionTimeout has code 1001 — re-registering under the same
+	// code but a different name must panic with a message that names the existing
+	// owner so the offender can find the collision fast.
+	defer func() {
+		r := recover()
+		if r == nil {
+			t.Fatalf("expected panic on duplicate code")
+		}
+		msg, ok := r.(string)
+		if !ok {
+			t.Fatalf("expected string panic, got %T: %v", r, r)
+		}
+		assert.Contains(t, msg, "duplicate error code: 1001")
+		assert.Contains(t, msg, "PROTOCOL_CONNECTION_TIMEOUT") // existing owner
+		assert.Contains(t, msg, "COLLIDING_NAME")              // new entrant
+	}()
+	registerError(&LavaError{Code: 1001, Name: "COLLIDING_NAME"})
+}
+
+func TestRegisterError_DuplicateNameMentionsExistingCode(t *testing.T) {
+	defer func() {
+		r := recover()
+		if r == nil {
+			t.Fatalf("expected panic on duplicate name")
+		}
+		msg, ok := r.(string)
+		if !ok {
+			t.Fatalf("expected string panic, got %T: %v", r, r)
+		}
+		assert.Contains(t, msg, "duplicate error name: PROTOCOL_CONNECTION_TIMEOUT")
+		assert.Contains(t, msg, "existing code 1001")
+	}()
+	registerError(&LavaError{Code: 99999, Name: "PROTOCOL_CONNECTION_TIMEOUT"})
+}
+
+// ---------------------------------------------------------------------------
+// Lookup helper tests
+// ---------------------------------------------------------------------------
+
+func TestErrorRegistry_GetLavaErrorByCode(t *testing.T) {
+	le := getLavaError(1001)
+	assert.Equal(t, "PROTOCOL_CONNECTION_TIMEOUT", le.Name)
+
+	le = getLavaError(99999)
+	assert.Equal(t, LavaErrorUnknown, le)
+}
+
+func TestErrorRegistry_GetLavaErrorByName(t *testing.T) {
+	le := getLavaErrorByName("PROTOCOL_CONNECTION_TIMEOUT")
+	assert.Equal(t, uint32(1001), le.Code)
+
+	le = getLavaErrorByName("NONEXISTENT")
+	assert.Equal(t, LavaErrorUnknown, le)
+}
+
+func TestErrorRegistry_IsRetryableStates(t *testing.T) {
+	assert.True(t, isRetryable(1001))  // CONNECTION_TIMEOUT
+	assert.False(t, isRetryable(1004)) // TLS_MISMATCH
+	assert.False(t, isRetryable(3001)) // NONCE_TOO_LOW
+	assert.True(t, isRetryable(0))     // UNKNOWN — retryable by default
+}
+
+func TestErrorRegistry_IsInternalExternalFlags(t *testing.T) {
+	assert.True(t, IsInternal(1001))  // PROTOCOL_CONNECTION_TIMEOUT
+	assert.False(t, IsExternal(1001)) // not external
+
+	assert.True(t, IsExternal(2001))  // NODE_METHOD_NOT_FOUND
+	assert.False(t, IsInternal(2001)) // not internal
+
+	assert.True(t, IsExternal(3001)) // CHAIN_NONCE_TOO_LOW
+	assert.True(t, IsExternal(4001)) // USER_PARSE_ERROR
+	assert.True(t, IsExternal(0))    // UNKNOWN — external
+}
+
+// ---------------------------------------------------------------------------
+// LavaError as error interface + errors.Is tests
+// ---------------------------------------------------------------------------
+
+func TestLavaError_String(t *testing.T) {
+	assert.Equal(t, "[3001] CHAIN_NONCE_TOO_LOW", LavaErrorChainNonceTooLow.String())
+}
+
+func TestLavaError_ABCICode(t *testing.T) {
+	assert.Equal(t, uint32(3001), LavaErrorChainNonceTooLow.ABCICode())
+	assert.Equal(t, uint32(1001), LavaErrorConnectionTimeout.ABCICode())
+	assert.Equal(t, uint32(0), LavaErrorUnknown.ABCICode())
+}
+
+func TestLavaError_IsNonMatch(t *testing.T) {
+	// Is returns false for non-LavaError targets
+	assert.False(t, LavaErrorChainNonceTooLow.Is(errors.New("not a LavaError")))
+}
+
+func TestLavaWrappedError_EmptyContext(t *testing.T) {
+	wrapped := NewLavaError(LavaErrorChainNonceTooLow, "")
+	assert.Contains(t, wrapped.Error(), "CHAIN_NONCE_TOO_LOW")
+}
+
+func TestLavaWrappedError_IsNonMatch(t *testing.T) {
+	wrapped := NewLavaError(LavaErrorChainNonceTooLow, "context")
+	// Is returns false for non-LavaError targets
+	assert.False(t, errors.Is(wrapped, errors.New("not a LavaError")))
+}
+
+func TestLavaWrappedError_Unwrap(t *testing.T) {
+	wrapped := NewLavaError(LavaErrorChainNonceTooLow, "context")
+	unwrapped := errors.Unwrap(wrapped)
+	require.NotNil(t, unwrapped)
+	assert.Equal(t, LavaErrorChainNonceTooLow, unwrapped)
+}
+
+func TestLavaError_ErrorInterface(t *testing.T) {
+	var err error = LavaErrorChainNonceTooLow
+	assert.Contains(t, err.Error(), "CHAIN_NONCE_TOO_LOW")
+}
+
+func TestLavaError_ErrorsIs(t *testing.T) {
+	// Direct match
+	assert.True(t, errors.Is(LavaErrorChainNonceTooLow, LavaErrorChainNonceTooLow))
+	assert.False(t, errors.Is(LavaErrorChainNonceTooLow, LavaErrorConnectionTimeout))
+
+	// Wrapped with NewLavaError
+	wrapped := NewLavaError(LavaErrorChainNonceTooLow, "tx failed")
+	assert.True(t, errors.Is(wrapped, LavaErrorChainNonceTooLow))
+	assert.False(t, errors.Is(wrapped, LavaErrorConnectionTimeout))
+	assert.Contains(t, wrapped.Error(), "tx failed")
+
+	// Wrapped with fmt.Errorf %w
+	doubleWrapped := fmt.Errorf("relay error: %w", wrapped)
+	assert.True(t, errors.Is(doubleWrapped, LavaErrorChainNonceTooLow))
+}
+
+// ---------------------------------------------------------------------------
+// ErrorCategory / ErrorSubCategory tests
+// ---------------------------------------------------------------------------
+
+func TestErrorCategoryString(t *testing.T) {
+	assert.Equal(t, "internal", CategoryInternal.String())
+	assert.Equal(t, "external", CategoryExternal.String())
+	assert.Equal(t, "unknown", ErrorCategory(99).String())
+}
+
+func TestErrorSubCategoryString(t *testing.T) {
+	assert.Equal(t, "none", SubCategoryNone.String())
+	assert.Equal(t, "unsupported_method", SubCategoryUnsupportedMethod.String())
+	assert.Equal(t, "user_error", SubCategoryUserError.String())
+}
+
+func TestUnsupportedMethodSubCategory(t *testing.T) {
+	// 2002 (NODE_METHOD_NOT_SUPPORTED) is intentionally excluded: it's retryable on another provider
+	unsupportedCodes := []uint32{2001, 2008, 2009, 2010}
+	for _, code := range unsupportedCodes {
+		le := getLavaError(code)
+		require.NotEqual(t, LavaErrorUnknown, le, "code %d not registered", code)
+		assert.True(t, le.SubCategory.IsUnsupportedMethod(),
Evidence
PR Compliance ID 5 requires test names to follow TestComponent_Scenario. The added tests include
names like TestRegistryNoDuplicateCodes, TestRegistryNoDuplicateNames, and
TestErrorCategoryString, which do not contain the _ separator between component and scenario.

AGENTS.md
protocol/common/error_registry_test.go[18-245]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New tests in `protocol/common/error_registry_test.go` do not follow the `TestComponent_Scenario` naming convention (they omit the `_` separator).

## Issue Context
The repository standard requires consistent test naming for readability and discoverability.

## Fix Focus Areas
- protocol/common/error_registry_test.go[18-245]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. JSON-RPC code clobbered 🐞
Description
ExtractNodeErrorDetails correctly extracts JSON-RPC .error.code/.error.message from an
rpcclient.HTTPError body, but then unconditionally overwrites errorCode with the HTTP status code.
This breaks downstream classification (ClassifyNodeErrorWithDetails → ClassifyError) for
gateways/proxies that return JSON-RPC error bodies with non-2xx HTTP statuses.
Code

protocol/chainlib/node_error_handler.go[R51-70]

+	// Extract error code from HTTP-wrapped JSON-RPC errors
+	if jsonMsg := TryRecoverNodeErrorFromClientError(nodeError); jsonMsg != nil && jsonMsg.Error != nil {
+		errorCode = jsonMsg.Error.Code
+		if jsonMsg.Error.Message != "" {
+			errorMessage = jsonMsg.Error.Message
+		}
+	}

-// NewUnsupportedMethodError creates a new UnsupportedMethodError with optional method name
-func NewUnsupportedMethodError(originalError error, methodName string) *UnsupportedMethodError {
-	return &UnsupportedMethodError{
-		originalError: originalError,
-		methodName:    methodName,
+	// Extract gRPC status code
+	if st, ok := status.FromError(nodeError); ok {
+		errorCode = int(st.Code())
+		if st.Message() != "" {
+			errorMessage = st.Message()
+		}
	}
-}

-// SolanaNonRetryableError represents a Solana error that should not be retried.
-// Currently covers error code -32009 ("missing in long-term storage") which indicates
-// the slot data is permanently unavailable.
-// Note: -32007 (ledger jump) IS retryable as another provider may have the data.
-type SolanaNonRetryableError struct {
-	originalError error
-}
+	// Extract HTTP status code
+	if httpError, ok := nodeError.(rpcclient.HTTPError); ok {
+		errorCode = httpError.StatusCode
+	}
Evidence
The function documents a priority order (JSON-RPC body first, HTTP status last) but the final HTTP
status extraction overwrites the previously extracted JSON-RPC code. ClassifyNodeErrorWithDetails
then uses this overwritten code for error classification.

protocol/chainlib/node_error_handler.go[47-73]
protocol/chainlib/node_error_handler.go[349-360]
protocol/chainlib/node_error_handler.go[92-99]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`ExtractNodeErrorDetails` extracts a JSON-RPC error code from an `rpcclient.HTTPError` body, but then overwrites it with `httpError.StatusCode`. This violates the function’s documented precedence and causes misclassification.

### Issue Context
This affects `ClassifyNodeErrorWithDetails`, which relies on `ExtractNodeErrorDetails` outputs to drive `common.ClassifyError`.

### Fix Focus Areas
- protocol/chainlib/node_error_handler.go[47-73]

### Suggested fix
Refactor the extraction logic to respect precedence, e.g.:
- If a JSON-RPC body was successfully parsed (and `.error` exists), **do not** overwrite `errorCode` with HTTP status.
- Prefer explicit control flow (`if/else if/else`) or guard the HTTP status assignment with `if errorCode == 0` (or another sentinel indicating “no structured code found”).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

4. Duplicate failure logging 🐞
Description
RelayProcessor.buildFailureResult calls LogCodedError and then immediately returns
utils.LavaFormatError with the same description/error, producing duplicate error logs for one
failure. Because LogCodedError also emits the Prometheus metric, this path can double-increment
lava_errors_total for a single relay failure.
Code

protocol/relaycore/relay_processor.go[R861-869]

+	// Log with classified error code for metrics/observability
+	if bestLavaError != nil {
+		chainID, _ := rp.chainIdAndApiInterfaceGetter.GetChainIdAndApiInterface()
+		common.LogCodedError("failed relay, insufficient results", processingError, bestLavaError,
+			chainID, 0, "", utils.LogAttr("GUID", rp.guid))
+	}
+
	return returnedResult, utils.LavaFormatError("failed relay, insufficient results", processingError, utils.LogAttr("GUID", rp.guid))
}
Evidence
buildFailureResult explicitly logs via LogCodedError and then logs again via LavaFormatError;
LogCodedError itself logs by calling utils.LavaFormatError internally (and emits the metric).

protocol/relaycore/relay_processor.go[833-869]
protocol/common/error_logging.go[65-88]
utils/lavalog.go[446-452]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`buildFailureResult` logs the same failure twice: once via `common.LogCodedError` and again via `utils.LavaFormatError`. This duplicates logs and can double-count `lava_errors_total`.

### Issue Context
`LogCodedError` already logs (via `utils.LavaFormatError`) and emits the error metric.

### Fix Focus Areas
- protocol/relaycore/relay_processor.go[833-869]
- protocol/common/error_logging.go[65-88]

### Suggested fix
Choose a single logging path:
- Prefer returning the `LogCodedError(...)` result directly (include GUID as an attribute there) and remove the subsequent `utils.LavaFormatError(...)` call.
 - i.e., replace the current `LogCodedError(...)` + `return ..., LavaFormatError(...)` with `return ..., LogCodedError(...)`.
Or, if you must keep `LavaFormatError` for legacy behavior:
- Remove the `LogCodedError` call and use `common.EmitErrorMetric(...)` to increment metrics without a second log.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Metrics errors ignored 🐞
Description
InitErrorMetrics ignores Prometheus registration errors except AlreadyRegisteredError, so if
registration fails for another reason the system silently loses lava_errors_total. The callback
still increments a counter that may not be registered in the default registry, making missing
metrics hard to detect.
Code

protocol/metrics/error_metrics.go[R24-31]

+		// Best-effort registration — if already registered (e.g., in tests), reuse.
+		if err := prometheus.Register(counter); err != nil {
+			if existing, ok := err.(prometheus.AlreadyRegisteredError); ok {
+				if reused, ok := existing.ExistingCollector.(*prometheus.CounterVec); ok {
+					counter = reused
+				}
+			}
+		}
Evidence
The code only handles AlreadyRegisteredError and drops all other registration failures on the floor,
without logging or failing fast.

protocol/metrics/error_metrics.go[15-45]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`InitErrorMetrics` silently ignores Prometheus registration failures (other than AlreadyRegistered), which can lead to missing `lava_errors_total` with no signal.

### Issue Context
This runs during server init and is foundational observability; silent failure makes production debugging harder.

### Fix Focus Areas
- protocol/metrics/error_metrics.go[24-31]

### Suggested fix
On `prometheus.Register(counter)` error:
- If `AlreadyRegisteredError`, reuse the existing collector (as today).
- Otherwise, log and/or fail fast (panic/return error) so startup makes the issue visible.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


6. Errno unwrap too narrow 🐞
Description
DetectConnectionError’s net.OpError errno path uses errors.As into *syscall.Errno, which only
matches when the underlying error is a pointer to Errno. This can miss common syscall error shapes
and makes Layer-3 detection unreliable unless the string-fallback happens to match.
Code

protocol/common/error_classifier.go[R518-531]

+	// Layer 3: net.OpError with a raw syscall errno.
+	var opErr *net.OpError
+	if errors.As(err, &opErr) {
+		var syscallErr *syscall.Errno
+		if errors.As(opErr.Err, &syscallErr) {
+			switch *syscallErr {
+			case syscall.ECONNREFUSED:
+				return LavaErrorConnectionRefused
+			case syscall.ECONNRESET:
+				return LavaErrorConnectionReset
+			case syscall.ENETUNREACH, syscall.EHOSTUNREACH:
+				return LavaErrorNetworkUnreachable
+			}
+		}
Evidence
The code only attempts to extract *syscall.Errno, and tests construct pointer-shaped errno values
(e.g., Err: &errno) to exercise this path; other errno shapes won’t be covered by Layer-3 and will
depend on brittle string matching.

protocol/common/error_classifier.go[518-533]
protocol/rpcsmartrouter/error_mapper_test.go[21-38]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Layer-3 connection detection in `DetectConnectionError` only extracts `*syscall.Errno`. This is unnecessarily strict and can miss syscall errors that aren’t pointer-shaped.

### Issue Context
This path is intended to be a structured fallback after string matching; it should be robust across typical Go errno wrapping shapes.

### Fix Focus Areas
- protocol/common/error_classifier.go[518-533]

### Suggested fix
Update the errno extraction to handle both pointer and value forms, e.g.:
- Try `var errno syscall.Errno; errors.As(opErr.Err, &errno)` and switch on `errno`.
- Optionally also keep a pointer-form check for compatibility.
Add/adjust a unit test that passes a `net.OpError{Err: syscall.ECONNREFUSED}` (non-pointer) and asserts it’s detected.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

ⓘ The new review experience is currently in Beta. Learn more

Grey Divider

Qodo Logo

Comment thread error-registry-design.md
Comment thread protocol/common/error_registry_test.go Outdated
Comment thread protocol/chainlib/node_error_handler.go Outdated
@codecov

codecov Bot commented Apr 9, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 62.18612% with 256 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
protocol/rpcsmartrouter/rpcsmartrouter_server.go 0.00% 48 Missing ⚠️
protocol/rpcconsumer/rpcconsumer_server.go 9.75% 35 Missing and 2 partials ⚠️
protocol/common/error_registry.go 83.12% 27 Missing ⚠️
protocol/rpcprovider/rpcprovider_server.go 0.00% 27 Missing ⚠️
protocol/common/error_classifier.go 80.50% 21 Missing and 2 partials ⚠️
protocol/metrics/error_metrics.go 0.00% 22 Missing ⚠️
protocol/relaycore/relay_processor.go 25.92% 13 Missing and 7 partials ⚠️
protocol/chainlib/node_error_handler.go 83.95% 13 Missing ⚠️
protocol/common/error_logging.go 68.29% 13 Missing ⚠️
protocol/lavasession/used_providers.go 50.00% 6 Missing ⚠️
... and 10 more
Flag Coverage Δ
consensus 8.96% <80.35%> (+0.22%) ⬆️
protocol 34.67% <62.18%> (+0.46%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...lib/chainproxy/rpcInterfaceMessages/restMessage.go 48.75% <100.00%> (-0.64%) ⬇️
protocol/chainlib/grpc.go 45.94% <100.00%> (ø)
protocol/chainlib/jsonRPC.go 44.84% <100.00%> (ø)
protocol/chainlib/rest.go 42.99% <100.00%> (ø)
protocol/chainlib/tendermintRPC.go 40.58% <100.00%> (ø)
protocol/common/endpoints.go 0.00% <ø> (ø)
protocol/common/legacy_error_mapping.go 100.00% <100.00%> (ø)
protocol/lavasession/consumer_types.go 74.10% <100.00%> (ø)
...otocol/rpcconsumer/consumer_relay_state_machine.go 76.58% <100.00%> (ø)
protocol/rpcprovider/provider_state_machine.go 62.08% <100.00%> (+1.15%) ⬆️
... and 20 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@NadavLevi NadavLevi force-pushed the feature/error-registry-poc branch from fec73ab to 21dc454 Compare April 9, 2026 11:39
@github-actions

github-actions Bot commented Apr 9, 2026

Copy link
Copy Markdown

Test Results

0 tests  ±0   0 ✅ ±0   0s ⏱️ ±0s
0 suites ±0   0 💤 ±0 
7 files   ±0   0 ❌ ±0 

Results for commit 2c64c23. ± Comparison against base commit 5c22939.

♻️ This comment has been updated with latest results.

@avitenzer avitenzer left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Location: lava_errors_total Prometheus counter definition
Issue: Labels {error_code, error_name, error_category, retryable, chain_id} — with 100+ codes × N chains, cardinality could explode. error_name is redundant with error_code (1:1 mapping).
Suggestion: Drop error_name label from the counter (keep it in structured logs). Consider bucketing chain_id or using an exemplar instead.

Comment thread protocol/common/error_registry.go
Comment thread protocol/common/error_logging.go
Comment thread protocol/metrics/error_metrics.go
Comment thread protocol/chainlib/node_error_handler.go
@NadavLevi NadavLevi force-pushed the feature/error-registry-poc branch 2 times, most recently from 86ebf88 to c99fb4a Compare April 9, 2026 12:39
@NadavLevi NadavLevi requested a review from avitenzer April 9, 2026 12:39
…ics, and observability

Introduces a centralized error classification system that replaces ad-hoc
error handling across the relay path with structured, two-tier classification
(chain-specific Tier 2, generic/transport Tier 1) and Prometheus metrics.

Core components:
- Error registry with named error codes (error_codes.go), matchers
  (error_classifier.go), and chain family mappings for EVM, Solana,
  Cosmos, Bitcoin, Starknet, NEAR, and Aptos
- Structured logging via LogCodedError/LogCodedWarning with error_code,
  error_name, error_category, retryable, and chain_id fields
- lava_errors_total Prometheus counter with full label cardinality
- Legacy sdkerrors → LavaError mapping for backward compatibility
- DetectConnectionError for connection-level failures (timeout, refused,
  reset, GOAWAY, RST_STREAM, Envoy connection termination, ECONNRESET)

Classification coverage:
- JSON-RPC standard codes (-32700 to -32000) and EIP-1474 codes
- HTTP status codes (4xx, 5xx) including Cloudflare 520-530
- gRPC status codes (Unimplemented, Unavailable)
- Chain-specific errors: EVM tx errors, Solana/Bitcoin/Starknet/NEAR
  node errors, Cosmos tx errors
- Transport-level: connection reset, truncated JSON, rate limiting,
  method not found/supported variants

Integration:
- All relay path error handlers (JSON-RPC, REST, gRPC, Tendermint)
  classify and log with chain_id via error handler structs
- ResultsManager extracts JSON-RPC error codes from response bodies
  for accurate classification with chain_id context
- Smart router direct RPC path classifies with chain_id
- Provider session errors log with chain_id

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@NadavLevi NadavLevi force-pushed the feature/error-registry-poc branch from c99fb4a to 2c64c23 Compare April 9, 2026 13:15
@nimrod-teich nimrod-teich merged commit bec08b8 into main Apr 9, 2026
30 checks passed
@nimrod-teich nimrod-teich deleted the feature/error-registry-poc branch April 9, 2026 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants