Skip to content

fix: add retry logic for transient wallet errors in parallel store creation#1521

Merged
TheLastCicada merged 18 commits into
developfrom
fix/parallel-store-creation-retry
Mar 12, 2026
Merged

fix: add retry logic for transient wallet errors in parallel store creation#1521
TheLastCicada merged 18 commits into
developfrom
fix/parallel-store-creation-retry

Conversation

@TheLastCicada

@TheLastCicada TheLastCicada commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

Summary

  • _createStoresInParallel had no retry logic for transient wallet errors, causing V2 org creation to fail permanently when the Chia wallet's DataLayer wallet was in a transitional state
  • Added per-store retry logic (10 attempts, 30s delay) matching the existing pattern in addV2ToExistingGovernanceBody, treating DataLayer Wallet already exists, DataLayerWallet not available, Wallet needs to be fully synced, and No spendable coins as retryable transient errors
  • The parallel creation of 3+ stores simultaneously compounds the Chia wallet race condition where get_dl_wallet() fails but the wallet actually exists, making this retry logic critical

Context

Observed in CI run #22746247842 — all 3 parallel create_new_dl RPC calls failed at the same moment with DataLayer Wallet already exists for this key, a race condition in the Chia wallet where the DataLayer wallet exists but isn't accessible via get_dl_wallet() yet. The org creation failed permanently with no retry, while the wallet recovered moments later.

Test plan

  • V2 integration tests pass (simulator mode — no wallet involved, fast path unchanged)
  • V2 live API tests pass (exercises the actual retry path against real Chia wallet)
  • Verify retry logging appears in CADT logs when transient wallet errors occur

Note

Medium Risk
Touches wallet/DataLayer operational flows (store creation, syncing, coin splitting, mirror creation), where timing and balance edge cases can affect production behavior. Changes are bounded to retry/timeout/guard logic but could impact org-creation latency and background tasks under failure modes.

Overview
Hardens V2 org creation against transient Chia wallet/DataLayer states. Parallel store creation now retries per-store (10 attempts, 30s delay) when known transient wallet errors occur, and reuses a centralized transient-error matcher.

Adds guardrails around DataLayer operations. syncService now enforces a 10-minute sync wait timeout with a configurable polling interval, and mirror creation’s “insufficient funds” path downgrades to a warning when a coin split is in progress.

Refactors coin management to avoid dust/spam-filter issues and expose split state. Coin size is derived from DEFAULT_COIN_AMOUNT + fee with a dust floor, split execution is centralized with a splitInProgress flag exported via isSplitInProgress(), and related integration tests/docs are updated; Dependabot pip reviewers are also adjusted.

Written by Cursor Bugbot for commit ffbf4a3. This will update automatically on new commits. Configure here.

The reclaim-home endpoint documentation was present in both V1 and V2
API docs but missing from their tables of contents, making it
undiscoverable when browsing.
V1: add Get Organization Creation Status, Commit all projects in
STAGING, and Commit specific STAGING records by UUID to the TOC.

V2: add List/Filter project and unit GET examples (by orgUid,
program data, marketplace units, tokenized) and Create tokenized
unit on Chia POST example to the TOC.
…eation

_createStoresInParallel had no retry logic for transient wallet errors,
causing V2 org creation to fail permanently when the Chia wallet's
DataLayer wallet was in a transitional state. This was especially
likely with parallel store creation since multiple simultaneous
create_new_dl RPC calls compound the race condition.

Add per-store retry logic (10 attempts, 30s delay) matching the
existing pattern in addV2ToExistingGovernanceBody. Transient errors
including "DataLayer Wallet already exists" (downstream symptom of
"DataLayerWallet not available" race) are now retried instead of
causing immediate failure.
Comment thread src/models/v2/organizations-v2.model.js Outdated
ChiaAutomation and others added 5 commits March 11, 2026 16:16
The waitForSync loop in getSubscribedStoreData() could block
indefinitely if a store never finishes syncing. Add a 10-minute
deadline (matching the timeout used in the v2
getRegistryStoreIdFromSingleton) so a stuck store throws instead
of causing an infinite blocking loop.
fix: add 10-minute timeout to getSubscribedStoreData sync wait loop
…r checks

The case-sensitive includes('wallet') check didn't match any of the
specific Chia error messages (which all use uppercase 'Wallet') and
instead acted as a catch-all for unrelated errors containing the
substring, causing unnecessary retries of up to 5 minutes.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Missing DataLayer Wallet already exists in upgradeFromV1 retry
    • Added DataLayer Wallet already exists to upgradeFromV1 transient retry checks so store creation now retries this known transitional wallet error.

Create PR

Or push these changes by commenting:

@cursor push 559f3cde55
Preview (559f3cde55)
diff --git a/src/models/v2/organizations-v2.model.js b/src/models/v2/organizations-v2.model.js
--- a/src/models/v2/organizations-v2.model.js
+++ b/src/models/v2/organizations-v2.model.js
@@ -922,6 +922,7 @@
             const isTransient =
               error.message?.includes('Wallet needs to be fully synced') ||
               error.message?.includes('DataLayerWallet not available') ||
+              error.message?.includes('DataLayer Wallet already exists') ||
               error.message?.includes('No spendable coins');
 
             if (isTransient && attempt < maxStoreCreateRetries) {

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Comment thread src/models/v2/organizations-v2.model.js
TheLastCicada and others added 4 commits March 11, 2026 14:18
docs: add reclaim-home endpoint to table of contents
COIN_SIZE was hardcoded to 1,000,000 mojos while CADT operations
require DEFAULT_COIN_AMOUNT + DEFAULT_FEE (typically 600,000,000
mojos). This caused a perpetual splitting loop where coins were
created 600x too small to be usable, wasting fees and temporarily
draining spendable balance.

Set COIN_SIZE = DEFAULT_COIN_AMOUNT + DEFAULT_FEE so each split coin
can independently fund one full DataLayer operation. Add a
splitInProgress flag so mirror-check tasks log a warning instead of
an error when balance is temporarily reduced during a split.
fix: size split coins to match operational requirements
…etry check

The isTransient check in upgradeFromV1 was missing this error string
after the overly broad includes('wallet') was removed, causing the
v1-to-v2 upgrade path to fail permanently on this transient error
instead of retrying.
Comment thread src/models/v2/organizations-v2.model.js
…nParallel

If the for loop somehow exhausted without returning (e.g. maxRetries
changed to 0), the async callback would return undefined, causing a
TypeError when downstream code accesses result.success.
Chia's default xch_spam_amount is 1,000,000 mojos. Coins smaller than
this may be filtered out by the wallet's spam filter. Since this setting
isn't available via RPC and CADT may run on a different host, use the
default as a floor so split coins are never below the dust threshold.
Comment thread tests/v2/integration/coin-management.spec.js Outdated
Test used COIN_SIZE = MIN_USABLE_COIN_SIZE (3,300) but production
computes COIN_SIZE = Math.max(MIN_USABLE_COIN_SIZE, DUST_FILTER_FLOOR)
which equals 1,000,000. Updated constants and assertions to match
the actual coin-splitting arithmetic.
Comment thread src/models/v2/organizations-v2.model.js Outdated
Both _createStoresInParallel and upgradeFromV1 maintained independent
copies of the transient wallet error list. This duplication already
caused a drift bug caught during this PR. Extract to a single
module-level helper.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread src/models/v2/organizations-v2.model.js
@TheLastCicada TheLastCicada merged commit 70afd41 into develop Mar 12, 2026
23 checks passed
@TheLastCicada TheLastCicada deleted the fix/parallel-store-creation-retry branch March 12, 2026 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants