fix(perps): suppress transient HL SDK errors from Sentry#29642
Conversation
The Hyperliquid SDK update from 0.30.2 to 0.32.2 (#28672) brought in @nktkas/rews v2, which surfaces reconnect failures that v1 silently hung on. Combined with the new spotState WS path from the 7.73.2 cherry-picks, this produced ~15k Sentry events/week on 7.74.1 from errors the SDK already catches and recovers from automatically. Generalize #isTransientAssetCtxsError to #isTransientSdkError covering WebSocketRequestError, ReconnectingWebSocketError (any code), and TimeoutError ("Signal timed out") in addition to the reconnect-churn cases. Apply at the single chokepoint #logErrorUnlessClearing so all four catch sites (refreshSpotState, createUserDataSubscription webData3 / HIP-3 / webData2) gain the gate. Transient errors downgrade to debugLogger.log for local diagnosis. Real failures (non-transient errors, "Failed to fetch market data", PerpsController init failures) continue to capture as before. The broader matcher means the assetCtxs retry path may retry on slightly more error types — equal-or-more retries, never less.
|
CLA Signature Action: All authors have signed the CLA. You may need to manually re-run the blocking PR check if it doesn't pass in a few minutes. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ca80f4b. Configure here.
…ror debug output The debug log in #logErrorUnlessClearing interpolated context.context.name which always resolved to the constant 'HyperLiquidSubscriptionService'. Changed to context.context.data.method which carries the actual operation name (e.g. 'ensureActiveAssetSubscription').
🔍 Smart E2E Test Selection
click to see 🤖 AI reasoning detailsE2E Test Selection:
This is a pure error-handling/logging improvement with no functional behavior changes to the perps subscription service itself. The perps trading functionality (price subscriptions, asset context updates, etc.) is unaffected. Risk is low. SmokePerps is selected as the primary tag since this is a perps-specific service. Per tag dependencies: SmokeWalletPlatform (Perps is a section inside Trending) and SmokeConfirmations (Add Funds deposits are on-chain transactions) are required companion tags. Performance Test Selection: |
|




Description
The Hyperliquid SDK update from 0.30.2 to 0.32.2 (PR #28672, shipped in 7.74.0) brought in
@nktkas/rewsv2, which now properly surfaces reconnect failures that v1 silently swallowed. Combined with the new spotState WS path from the 7.73.2 cherry-picks (b5947ba9f9,985830d501), this produced ~15k Sentry events/week on 7.74.1 from errors that the SDK already catches and recovers from automatically.The events have no user-visible impact: they're tagged
mechanism: generic, handled: yesand the SDK reconnects on its own. They drown out genuine signal likeFailed to fetch market data(which we want to keep capturing).This PR generalizes the existing
#isTransientAssetCtxsErrorhelper inHyperLiquidSubscriptionServiceto#isTransientSdkError, broadens its matching to cover all four transient SDK error classes, and gates#logErrorUnlessClearingso transient errors are downgraded todebugLogger.log(kept for local diagnosis) instead of forwarded to Sentry.Single chokepoint — all four affected catch sites (
refreshSpotState,createUserDataSubscriptionwebData3 / HIP-3 / webData2) already routed through#logErrorUnlessClearing, so no per-site changes are required.Suppressed error classes (only when caught and recovered by the SDK):
WebSocketRequestError(rews queue rejection on close)ReconnectingWebSocketError(RECONNECTION_LIMIT, TERMINATED_BY_USER, UNKNOWN_ERROR)TimeoutError: Signal timed out(AbortSignal.timeout shim)Still captured:
TypeError,RangeError, etc.)Failed to fetch market data — wsState=disconnected(5GMV) — real downstream user-visible signalCLIENT_NOT_INITIALIZED(5R41) — real init failuresValiError(TAT-3093 path, fixed separately by fix(perps): non-EVM address passed to HyperLiquid validator via usePerpsPositionForAsset #29420)The retry decision in
createAssetCtxsSubscriptionnow uses the same broader matcher. This means the assetCtxs path may retry on slightly more error types (TimeoutError, ReconnectingWebSocketError) before giving up — equal-or-more retries, never less.Changelog
CHANGELOG entry: null
Related issues
Related Sentry issues: METAMASK-MOBILE-5SQY, METAMASK-MOBILE-5SEB, METAMASK-MOBILE-5RV9, METAMASK-MOBILE-5SG9, METAMASK-MOBILE-5SGD, METAMASK-MOBILE-5SGK, METAMASK-MOBILE-4FE2
Manual testing steps
Verification:
yarn jest app/controllers/perps/services/HyperLiquidSubscriptionService.test.ts --no-coverage— 118 passed, 4 skipped, 0 failednpx eslint app/controllers/perps/services/HyperLiquidSubscriptionService.ts— cleanNODE_OPTIONS='--max-old-space-size=8192' npx tsc --noEmit --incremental— exit 0adb shell svc data disable && sleep 3 && adb shell svc data enable(Android) while in Perps; verify Metro log shows[Perps transient SDK error]entries and Sentry receives no new events for the listed classes.Screenshots/Recordings
N/A — internal observability change with no UI surface. Verification is via Sentry dashboard delta after release.
Before
After 7.74.1 release,
feature:perpsSentry volume on Android: ~77k events / week. ~15k/wk attributable to the suppressed transient classes (in addition to ~54k AbortError already filtered by #28953 + #29344, pending release).After
Expected
feature:perpsSentry volume after this PR + the unreleased AbortError fixes ship: ~4–8k events / week (close to pre-7.74.x baseline), composed of genuine error classes only.Pre-merge author checklist
Performance checks (if applicable)
Pre-merge reviewer checklist
Note
Medium Risk
Changes central error logging/diagnostics behavior and retry conditions for WebSocket subscriptions; misclassification could hide actionable errors or alter reconnect behavior.
Overview
Suppresses transient Hyperliquid SDK/transport errors from being forwarded to Sentry by extending the existing transient-error detection into a new
#isTransientSdkErrormatcher and short-circuiting#logErrorUnlessClearingtodebugLogger.logwith method context.Broadens the transient matcher to cover additional reconnect/timeout error signatures (e.g.,
ReconnectingWebSocketError,TimeoutError: Signal timed out) and updatescreateAssetCtxsSubscriptionretry logic to use this unified transient classification.Adds a unit test ensuring transient SDK errors log the method name context and are not sent to the Sentry logger.
Reviewed by Cursor Bugbot for commit e882931. Bugbot is set up for automated code reviews on this repo. Configure here.