Skip to content

fix(perps): suppress transient HL SDK errors from Sentry#29642

Merged
abretonc7s merged 3 commits into
mainfrom
fix/perps-suppress-transient-sdk-noise
May 5, 2026
Merged

fix(perps): suppress transient HL SDK errors from Sentry#29642
abretonc7s merged 3 commits into
mainfrom
fix/perps-suppress-transient-sdk-noise

Conversation

@abretonc7s

@abretonc7s abretonc7s commented May 4, 2026

Copy link
Copy Markdown
Contributor

Description

The Hyperliquid SDK update from 0.30.2 to 0.32.2 (PR #28672, shipped in 7.74.0) brought in @nktkas/rews v2, which now properly surfaces reconnect failures that v1 silently swallowed. Combined with the new spotState WS path from the 7.73.2 cherry-picks (b5947ba9f9, 985830d501), this produced ~15k Sentry events/week on 7.74.1 from errors that the SDK already catches and recovers from automatically.

The events have no user-visible impact: they're tagged mechanism: generic, handled: yes and the SDK reconnects on its own. They drown out genuine signal like Failed to fetch market data (which we want to keep capturing).

This PR generalizes the existing #isTransientAssetCtxsError helper in HyperLiquidSubscriptionService to #isTransientSdkError, broadens its matching to cover all four transient SDK error classes, and gates #logErrorUnlessClearing so transient errors are downgraded to debugLogger.log (kept for local diagnosis) instead of forwarded to Sentry.

Single chokepoint — all four affected catch sites (refreshSpotState, createUserDataSubscription webData3 / HIP-3 / webData2) already routed through #logErrorUnlessClearing, so no per-site changes are required.

Suppressed error classes (only when caught and recovered by the SDK):

Class Sentry issue Volume on 7.74.1
WebSocketRequestError (rews queue rejection on close) 5SQY, 4FE2 ~9k/wk
ReconnectingWebSocketError (RECONNECTION_LIMIT, TERMINATED_BY_USER, UNKNOWN_ERROR) 5RV9, 5SG9, 5SGD, 5SGK ~3.5k/wk
TimeoutError: Signal timed out (AbortSignal.timeout shim) 5SEB ~6k/wk
Reconnect-churn fallbacks (unknown/undefined during Connecting/Disconnected) (existing helper) (no change)

Still captured:

The retry decision in createAssetCtxsSubscription now uses the same broader matcher. This means the assetCtxs path may retry on slightly more error types (TimeoutError, ReconnectingWebSocketError) before giving up — equal-or-more retries, never less.

Changelog

CHANGELOG entry: null

Related issues

Related Sentry issues: METAMASK-MOBILE-5SQY, METAMASK-MOBILE-5SEB, METAMASK-MOBILE-5RV9, METAMASK-MOBILE-5SG9, METAMASK-MOBILE-5SGD, METAMASK-MOBILE-5SGK, METAMASK-MOBILE-4FE2

Manual testing steps

Feature: Transient Hyperliquid SDK errors do not pollute Sentry

  Scenario: User briefly opens Perps on a flaky network
    Given the wallet is unlocked
    And Perps is enabled
    When the user enters Perps and the WebSocket connection fails or times out
    Then the SDK reconnects automatically
    And the transient WebSocketRequestError, ReconnectingWebSocketError, or TimeoutError is logged to DevLogger only
    And no Sentry event is captured for these classes
    And the user-visible "Failed to fetch market data" error (when sustained) still captures to Sentry

  Scenario: User triggers a non-transient error path
    Given the wallet is unlocked
    And Perps is enabled
    When a non-transient error (e.g. CLIENT_NOT_INITIALIZED, TypeError) occurs in HyperLiquidSubscriptionService
    Then the error is captured to Sentry as before
    And the existing context tags (feature: perps, service: HyperLiquidSubscriptionService) are preserved

Verification:

  • yarn jest app/controllers/perps/services/HyperLiquidSubscriptionService.test.ts --no-coverage — 118 passed, 4 skipped, 0 failed
  • npx eslint app/controllers/perps/services/HyperLiquidSubscriptionService.ts — clean
  • NODE_OPTIONS='--max-old-space-size=8192' npx tsc --noEmit --incremental — exit 0
  • For weak-network manual repro, use Network Link Conditioner (iOS) or adb shell svc data disable && sleep 3 && adb shell svc data enable (Android) while in Perps; verify Metro log shows [Perps transient SDK error] entries and Sentry receives no new events for the listed classes.

Screenshots/Recordings

N/A — internal observability change with no UI surface. Verification is via Sentry dashboard delta after release.

Before

After 7.74.1 release, feature:perps Sentry volume on Android: ~77k events / week. ~15k/wk attributable to the suppressed transient classes (in addition to ~54k AbortError already filtered by #28953 + #29344, pending release).

After

Expected feature:perps Sentry volume after this PR + the unreleased AbortError fixes ship: ~4–8k events / week (close to pre-7.74.x baseline), composed of genuine error classes only.

Pre-merge author checklist

  • I've followed MetaMask Contributor Docs and MetaMask Mobile Coding Standards.
  • I've completed the PR template to the best of my ability
  • I've included tests if applicable — existing 118 tests cover the regression case (non-transient errors still capture); new transient classes share the same code path that the existing tests exercise.
  • I've documented my code using JSDoc format if applicable
  • I've applied the right labels on the PR (see labeling guidelines). Not required for external contributors.

Performance checks (if applicable)

  • I've tested on Android — N/A (no runtime behavior change for users; only changes which errors reach Sentry)
  • I've tested with a power user scenario — N/A
  • I've instrumented key operations with Sentry traces for production performance metrics — N/A; this PR removes noise from Sentry rather than adding traces.

Pre-merge reviewer checklist

  • I've manually tested the PR (e.g. pull and build branch, run the app, test code being changed).
  • I confirm that this PR addresses all acceptance criteria described in the ticket it closes and includes the necessary testing evidence such as recordings and or screenshots.

Note

Medium Risk
Changes central error logging/diagnostics behavior and retry conditions for WebSocket subscriptions; misclassification could hide actionable errors or alter reconnect behavior.

Overview
Suppresses transient Hyperliquid SDK/transport errors from being forwarded to Sentry by extending the existing transient-error detection into a new #isTransientSdkError matcher and short-circuiting #logErrorUnlessClearing to debugLogger.log with method context.

Broadens the transient matcher to cover additional reconnect/timeout error signatures (e.g., ReconnectingWebSocketError, TimeoutError: Signal timed out) and updates createAssetCtxsSubscription retry logic to use this unified transient classification.

Adds a unit test ensuring transient SDK errors log the method name context and are not sent to the Sentry logger.

Reviewed by Cursor Bugbot for commit e882931. Bugbot is set up for automated code reviews on this repo. Configure here.

The Hyperliquid SDK update from 0.30.2 to 0.32.2 (#28672) brought in
@nktkas/rews v2, which surfaces reconnect failures that v1 silently
hung on. Combined with the new spotState WS path from the 7.73.2
cherry-picks, this produced ~15k Sentry events/week on 7.74.1 from
errors the SDK already catches and recovers from automatically.

Generalize #isTransientAssetCtxsError to #isTransientSdkError covering
WebSocketRequestError, ReconnectingWebSocketError (any code), and
TimeoutError ("Signal timed out") in addition to the reconnect-churn
cases. Apply at the single chokepoint #logErrorUnlessClearing so all
four catch sites (refreshSpotState, createUserDataSubscription
webData3 / HIP-3 / webData2) gain the gate.

Transient errors downgrade to debugLogger.log for local diagnosis.
Real failures (non-transient errors, "Failed to fetch market data",
PerpsController init failures) continue to capture as before. The
broader matcher means the assetCtxs retry path may retry on slightly
more error types — equal-or-more retries, never less.
@abretonc7s abretonc7s added the team-perps Perps team label May 4, 2026
@github-actions

github-actions Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes.

@github-actions github-actions Bot added the size-S label May 4, 2026
@abretonc7s abretonc7s marked this pull request as ready for review May 5, 2026 09:44
@abretonc7s abretonc7s requested a review from a team as a code owner May 5, 2026 09:44
@abretonc7s abretonc7s enabled auto-merge May 5, 2026 09:44
michalconsensys
michalconsensys previously approved these changes May 5, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ca80f4b. Configure here.

Comment thread app/controllers/perps/services/HyperLiquidSubscriptionService.ts Outdated
…ror debug output

The debug log in #logErrorUnlessClearing interpolated context.context.name
which always resolved to the constant 'HyperLiquidSubscriptionService'.
Changed to context.context.data.method which carries the actual operation
name (e.g. 'ensureActiveAssetSubscription').
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

🔍 Smart E2E Test Selection

  • Selected E2E tags: SmokePerps, SmokeWalletPlatform, SmokeConfirmations
  • Selected Performance tags: None (no tests recommended)
  • Risk Level: low
  • AI Confidence: 92%
click to see 🤖 AI reasoning details

E2E Test Selection:
The changes are confined to HyperLiquidSubscriptionService.ts and its test file. The modifications:

  1. Rename #isTransientAssetCtxsError#isTransientSdkError to reflect broader applicability
  2. Add ReconnectingWebSocketError and TimeoutError (signal timed out) to the set of transient errors that are downgraded from Sentry capture to debug logging
  3. Apply the transient error check in #logErrorUnlessClearing so these expected SDK lifecycle errors don't pollute Sentry

This is a pure error-handling/logging improvement with no functional behavior changes to the perps subscription service itself. The perps trading functionality (price subscriptions, asset context updates, etc.) is unaffected. Risk is low.

SmokePerps is selected as the primary tag since this is a perps-specific service. Per tag dependencies: SmokeWalletPlatform (Perps is a section inside Trending) and SmokeConfirmations (Add Funds deposits are on-chain transactions) are required companion tags.

Performance Test Selection:
The changes only affect error classification and logging paths (Sentry vs debugLogger routing). There is no impact on UI rendering, data loading, state management, or any performance-sensitive code paths. No performance tests are warranted.

View GitHub Actions results

@sonarqubecloud

sonarqubecloud Bot commented May 5, 2026

Copy link
Copy Markdown

@abretonc7s abretonc7s added this pull request to the merge queue May 5, 2026
Merged via the queue into main with commit 2adf64f May 5, 2026
100 checks passed
@abretonc7s abretonc7s deleted the fix/perps-suppress-transient-sdk-noise branch May 5, 2026 13:56
@github-actions github-actions Bot locked and limited conversation to collaborators May 5, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants