fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp by TheLastCicada · Pull Request #1531 · Chia-Network/cadt

TheLastCicada · 2026-03-13T19:27:13Z

Summary

Add exponential backoff for sync registry mismatch polling to reduce unnecessary retries
Add in-memory concurrency guard for organization operations (create, import, sync) to prevent duplicate parallel operations
Add token-based ownership to org-operation lock with staleness TTL
Mirror governance stores on subscriber nodes
Skip wallet availability check for GET requests in global middleware
Subscribe to org store before sync check in v1 importOrganization
Add wallet readiness check and retry logic to V1 store creation
Revamp test-v2-remote-sync CI job: replace manual subscribe/import with automatic governance-based org discovery, add faucet funding, mirror creation validation, and mirror cleanup
New test-v1-remote-sync CI job: parallel job with identical structure using V1 config, governance body, and data validation against the V1 participant test instance

Test plan

test-v2-remote-sync CI job passes: governance sync, org discovery, data validation (21 entity types), mirror creation, mirror cleanup
test-v1-remote-sync CI job passes: governance sync, org discovery, data validation (10 projects, 10 units, nested counts), mirror creation, mirror cleanup
Both remote sync jobs run concurrently with no needs: dependencies
Existing CI jobs unaffected (v1/v2 integration, live-api, governance tests)
Org-operation lock integration tests pass
Sync mismatch backoff prevents tight polling loops

Note

Medium Risk
Touches organization create/upgrade/reclaim flows and background tasks, adding new locking and retry behavior that could block operations or change failure modes if incorrect. CI workflow changes are large but isolated to test infrastructure.

Overview
Adds a global, token-based in-memory lock to serialize home-organization operations (V1/V2 create, upgrade, reclaim, and recovery), returning 409 Conflict with a structured operationStatus when another operation is running and wiring lock progress into /v2/organizations/creation-status via a new liveStatus field.

Hardens org creation/upgrade reliability by requiring spendable coin availability before starting/resuming, retrying transient wallet errors during parallel store creation, and improving V1 import by subscribing to org stores before checking sync status.

Reduces noisy sync polling by introducing per-organization exponential backoff for datalayer generation/target mismatches in sync-registries (v1/v2).

Mirroring and ops tweaks: mirroring is now conditional on AUTO_MIRROR_EXTERNAL_STORES and a valid DATALAYER_FILE_SERVER_URL, mirror-check tasks can mirror governance stores on subscriber nodes using configured GOVERNANCE_BODY_ID, and middleware skips wallet-availability assertions for GET requests.

CI revamp: updates remote sync workflows (disables auto-mirroring by default in tests, adds faucet funding, governance-driven org discovery/validation, and adds a new test-v1-remote-sync job) and removes the prior “delete all mirrors” cleanup step; docs updated to reflect upgrade status and 409 conflict behavior.

^{Written by Cursor Bugbot for commit 0fbd6ee. This will update automatically on new commits. Configure here.}

When a data layer store is stuck syncing (e.g., a delta file missing from all mirrors), the sync-registries task would detect the root history mismatch every 5 seconds, log a warn, and return -- repeating identically forever. This caused massive log noise, unnecessary CPU from repeated root history + sync status queries, and no useful signal after the first log entry. Add per-org mismatch tracking with exponential backoff (30s initial, 2x multiplier, 10min cap) and log throttling (warn on first hit, debug on retries, info summary every 5 minutes). When the mismatch resolves, the tracker clears and normal 5s polling resumes immediately.

Update Managed Files

Prevent simultaneous execution of organization create, upgrade, and reclaim operations with a process-level in-memory lock. Conflicting requests now receive 409 Conflict with live operation status including the operation name, current step, start time, and elapsed seconds. - Add org-operation-lock module with tryAcquire/release/status tracking - Guard V1 and V2 create, upgrade, and reclaim controller endpoints - Add updateOrgLockStatus milestone calls in model creation/upgrade flows - Integrate lock acquire/release into V1 and V2 startup recovery tasks - Enhance GET /v2/organizations/creation-status with liveStatus field - Add unit tests for lock module and integration tests for endpoint guards - Document 409 behavior and liveStatus in cadt_rpc_api_v2.md

Move duplicated per-org exponential backoff logic from both sync-registries.js and sync-registries-v2.js into a shared SyncMismatchBackoff class. Algorithm and log output are identical; future tuning only needs to change one place.

Add a 1-hour staleness TTL to the org operation lock so that a hung background operation (promise that never resolves/rejects) cannot permanently block all future organization operations. When a lock exceeds MAX_LOCK_AGE_MS, tryAcquireOrgLock force-releases it with a console warning and allows the new acquisition. Also add the missing releaseOrgLock() call in the V2 create endpoint's outer catch block, which could permanently hold the lock if a database query threw between lock acquisition and the async creation branches.

The outer catch blocks in all 6 guarded handlers unconditionally called releaseOrgLock(), but the lock is acquired partway through the try block (after assertions like assertWalletIsSynced). If a pre-lock assertion throws while another operation holds the lock, the catch would release that other operation's lock, defeating the concurrency guard. Add a lockAcquired flag to each handler, set after successful tryAcquireOrgLock(), and gate the outer catch release on it.

Replace all bare releaseOrgLock() calls in handler try blocks with a local releaseLock() closure that atomically checks and clears the lockAcquired flag before releasing. This eliminates two classes of bugs: 1. Early-release paths (e.g. V1 org detected) left lockAcquired=true. Any subsequent await could yield, letting another request acquire the lock, and if an error then reached the outer catch it would release that other operation's lock. 2. Reclaim handlers had an inner try/finally that released the lock, then the outer catch also released — a double-release that could free another operation's lock if one was acquired in between. The only remaining bare releaseOrgLock() calls are inside async .finally() callbacks on background operations, where lockAcquired is explicitly set to false at handoff time.

releaseOrgLock() had no concept of who held the lock. When a background .finally() ran after its operation's lock was force-released due to TTL expiry, it silently released a different operation's lock, defeating the concurrency guard. tryAcquireOrgLock() now returns an opaque ownership token (or null on failure). releaseOrgLock(token) only releases if the token matches the current holder, making stale releases a safe no-op. All controllers and recovery tasks pass captured tokens through their async .finally() and try/finally blocks.

…k release updateOrgLockStatus() unconditionally wrote to currentStatus without verifying the caller owns the lock. After a stale lock force-release (1-hour TTL), background code from the old operation could overwrite the new holder's status on the creation-status endpoint. Change updateOrgLockStatus(status) to updateOrgLockStatus(token, status) so only the current lock holder can update progress. Thread the lock token through createHomeOrganization, _resumeOrganizationCreation, _executeOrganizationCreation, and upgradeFromV1 in both V1 and V2 models, controllers, and recovery tasks. Also move releaseLock() in the V2 create handler's V1-org check block from before the async singleton data fetch to just before each return, closing a window where a concurrent request could acquire the lock while async I/O was still in flight. Remove getOrgLockOperation and isOrgLocked exports (only consumed by tests, not production code). Refactor tests to use getOrgLockStatus() and add coverage for stale-token rejection.

When a lock-held operation (e.g. upgrade) was running but no Meta-based creation existed, the creation-status response had inProgress: true alongside message: "No organization creation in progress" because the lock-status spread overrode inProgress but not message. Include a coherent message derived from the lock operation name and current step. Update the API docs example and add a test assertion for the message field.

…ency-guard feat: add in-memory concurrency guard for organization operations

V1 _createStoresInParallel called createDataLayerStore without checking wallet sync status, causing permanent failure when the wallet temporarily desyncs. The V2 equivalent already has this protection. Add waitForSpendableCoins() before V1 store creation and resume paths, and per-store retry with exponential backoff for transient wallet errors (matching the V2 implementation). Extract TRANSIENT_WALLET_ERRORS and isTransientWalletError into wallet.js so both V1 and V2 models share a single definition. Replace hardcoded [v2]: log prefix in wallet.js with [wallet]: since the function is now called from both V1 and V2 paths. Use actual needed coin count from getStoresToCreate(state) in resume paths instead of hardcoded 4.

The v1 importOrganization had a chicken-and-egg bug where it checked datalayer sync status before subscribing to the store. For stores that were never subscribed, get_sync_status returns an error (store unknown to datalayer), which was interpreted as "not synced" causing an early return. The subscription call was never reached, so the store was never subscribed, and every retry hit the same wall. Port the subscribe-first pattern already used in v2 importOrganization: subscribe to the store first (no-op if already subscribed), then check sync status. This ensures new stores begin syncing on the first pass and subsequent retries find them progressively more synced.

…et-retry fix: add wallet readiness check and retry logic to V1 store creation

…e-first fix: subscribe to org store before sync check in v1 importOrganization

The mirror-check task only mirrored governance stores for governance body owners (nodes with governanceBodyId/mainGoveranceBodyId in meta). Subscriber nodes subscribe to governance stores via GOVERNANCE_BODY_ID in config but the mirror task had no code path to mirror them. Read GOVERNANCE_BODY_ID from config when meta entries are absent, mirror the main governance store, then resolve and mirror its version store via the data model version key.

…ware The global middleware asserted wallet sync status on every non-health request, including read-only GETs. When the wallet transiently desyncs while processing new datalayer block confirmations, all GET endpoints return 400 even though they only read from the local database. Skip assertWalletIsAvailable() for GET requests, matching the existing pattern where the home org sync check already allows GETs through. Write operations still require the wallet to be synced.

…-on-get fix: skip wallet availability check for GET requests in global middleware

Replace manual subscribe/import flow in test-v2-remote-sync with automatic governance-based discovery. Add faucet funding, mirror creation validation, and mirror cleanup. Create new parallel test-v1-remote-sync job with identical structure using V1 config, governance, and data validation against the V1 participant instance.

Mirrors created on shared governance stores during CI runs leave orphaned coins that can never be deleted (the CI wallet key is destroyed after each run). These accumulate over time and contribute to wallet sync delays in subsequent runs. Decouple inline store mirroring from the background mirror-check task: createDataLayerStore now mirrors based on DATALAYER_FILE_SERVER_URL being configured rather than AUTO_MIRROR_EXTERNAL_STORES. This lets CI create mirrors on ephemeral org stores (safe) while keeping the background task disabled so governance stores are never mirrored. - Set AUTO_MIRROR_EXTERNAL_STORES=false in all CI jobs - Set DATALAYER_FILE_SERVER_URL=null for sync, upgrade, and governance jobs (no mirrors at all) - Keep DATALAYER_FILE_SERVER_URL set for v1/v2 live-api jobs (mirrors only on org-created stores, validated by existing test helpers) - Remove mirror verify and delete steps from sync jobs - Remove mirror delete step from v2 live-api job

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

Replace nested try/finally inside try/catch with a flat try/catch/finally in both V1 and V2 reclaimHome handlers. The inner finally nulled the lock token, making the outer catch release a no-op. A single finally block handles lock release on all exit paths.

subscribeToStoreOnDataLayer unconditionally created a mirror on every subscribe, ignoring AUTO_MIRROR_EXTERNAL_STORES. Gate the addMirror call on the config setting (defaulting to true) so users who set it to false do not get mirrors for external stores. Also update the README to clarify that both AUTO_MIRROR_EXTERNAL_STORES and DATALAYER_FILE_SERVER_URL must be configured for external store mirroring to function.

ChiaAutomation and others added 3 commits March 12, 2026 16:21

chore: Update dependabot

ebc5907

Merge pull request #1527 from Chia-Network/managed-files

d0632db

Update Managed Files

cursor Bot reviewed Mar 13, 2026

View reviewed changes

Comment thread src/tasks/sync-registries-v2.js

TheLastCicada and others added 15 commits March 13, 2026 13:01

refactor: extract shared SyncMismatchBackoff utility

38ee5cc

Move duplicated per-org exponential backoff logic from both sync-registries.js and sync-registries-v2.js into a shared SyncMismatchBackoff class. Algorithm and log output are identical; future tuning only needs to change one place.

Merge pull request #1532 from Chia-Network/feat/org-operation-concurr…

a62252a

…ency-guard feat: add in-memory concurrency guard for organization operations

Merge pull request #1533 from Chia-Network/fix/v1-store-creation-wall…

b545e25

…et-retry fix: add wallet readiness check and retry logic to V1 store creation

Merge pull request #1534 from Chia-Network/fix/v1-import-org-subscrib…

83fd225

…e-first fix: subscribe to org store before sync check in v1 importOrganization

TheLastCicada mentioned this pull request Mar 16, 2026

fix: skip wallet availability check for GET requests in global middleware #1535

Merged

3 tasks

TheLastCicada and others added 2 commits March 16, 2026 14:51

Merge pull request #1535 from Chia-Network/fix/skip-wallet-sync-check…

b913b87

…-on-get fix: skip wallet availability check for GET requests in global middleware

TheLastCicada changed the title ~~fix: add exponential backoff for sync registry mismatch polling~~ fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp Mar 16, 2026

cursor Bot reviewed Mar 17, 2026

View reviewed changes

Comment thread src/controllers/organization.controller.js Outdated

Comment thread src/datalayer/writeService.js

TheLastCicada added 2 commits March 16, 2026 22:46

TheLastCicada merged commit 9abd65c into develop Mar 17, 2026
24 checks passed

TheLastCicada deleted the fix/sync-registry-mismatch-backoff branch March 17, 2026 15:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp#1531

fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp#1531
TheLastCicada merged 23 commits into
developfrom
fix/sync-registry-mismatch-backoff

TheLastCicada commented Mar 13, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheLastCicada commented Mar 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TheLastCicada commented Mar 13, 2026 •

edited by cursor Bot

Loading