fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp#1531
Merged
Merged
Conversation
When a data layer store is stuck syncing (e.g., a delta file missing from all mirrors), the sync-registries task would detect the root history mismatch every 5 seconds, log a warn, and return -- repeating identically forever. This caused massive log noise, unnecessary CPU from repeated root history + sync status queries, and no useful signal after the first log entry. Add per-org mismatch tracking with exponential backoff (30s initial, 2x multiplier, 10min cap) and log throttling (warn on first hit, debug on retries, info summary every 5 minutes). When the mismatch resolves, the tracker clears and normal 5s polling resumes immediately.
Update Managed Files
Prevent simultaneous execution of organization create, upgrade, and reclaim operations with a process-level in-memory lock. Conflicting requests now receive 409 Conflict with live operation status including the operation name, current step, start time, and elapsed seconds. - Add org-operation-lock module with tryAcquire/release/status tracking - Guard V1 and V2 create, upgrade, and reclaim controller endpoints - Add updateOrgLockStatus milestone calls in model creation/upgrade flows - Integrate lock acquire/release into V1 and V2 startup recovery tasks - Enhance GET /v2/organizations/creation-status with liveStatus field - Add unit tests for lock module and integration tests for endpoint guards - Document 409 behavior and liveStatus in cadt_rpc_api_v2.md
Move duplicated per-org exponential backoff logic from both sync-registries.js and sync-registries-v2.js into a shared SyncMismatchBackoff class. Algorithm and log output are identical; future tuning only needs to change one place.
Add a 1-hour staleness TTL to the org operation lock so that a hung background operation (promise that never resolves/rejects) cannot permanently block all future organization operations. When a lock exceeds MAX_LOCK_AGE_MS, tryAcquireOrgLock force-releases it with a console warning and allows the new acquisition. Also add the missing releaseOrgLock() call in the V2 create endpoint's outer catch block, which could permanently hold the lock if a database query threw between lock acquisition and the async creation branches.
The outer catch blocks in all 6 guarded handlers unconditionally called releaseOrgLock(), but the lock is acquired partway through the try block (after assertions like assertWalletIsSynced). If a pre-lock assertion throws while another operation holds the lock, the catch would release that other operation's lock, defeating the concurrency guard. Add a lockAcquired flag to each handler, set after successful tryAcquireOrgLock(), and gate the outer catch release on it.
Replace all bare releaseOrgLock() calls in handler try blocks with a local releaseLock() closure that atomically checks and clears the lockAcquired flag before releasing. This eliminates two classes of bugs: 1. Early-release paths (e.g. V1 org detected) left lockAcquired=true. Any subsequent await could yield, letting another request acquire the lock, and if an error then reached the outer catch it would release that other operation's lock. 2. Reclaim handlers had an inner try/finally that released the lock, then the outer catch also released — a double-release that could free another operation's lock if one was acquired in between. The only remaining bare releaseOrgLock() calls are inside async .finally() callbacks on background operations, where lockAcquired is explicitly set to false at handoff time.
releaseOrgLock() had no concept of who held the lock. When a background .finally() ran after its operation's lock was force-released due to TTL expiry, it silently released a different operation's lock, defeating the concurrency guard. tryAcquireOrgLock() now returns an opaque ownership token (or null on failure). releaseOrgLock(token) only releases if the token matches the current holder, making stale releases a safe no-op. All controllers and recovery tasks pass captured tokens through their async .finally() and try/finally blocks.
…k release updateOrgLockStatus() unconditionally wrote to currentStatus without verifying the caller owns the lock. After a stale lock force-release (1-hour TTL), background code from the old operation could overwrite the new holder's status on the creation-status endpoint. Change updateOrgLockStatus(status) to updateOrgLockStatus(token, status) so only the current lock holder can update progress. Thread the lock token through createHomeOrganization, _resumeOrganizationCreation, _executeOrganizationCreation, and upgradeFromV1 in both V1 and V2 models, controllers, and recovery tasks. Also move releaseLock() in the V2 create handler's V1-org check block from before the async singleton data fetch to just before each return, closing a window where a concurrent request could acquire the lock while async I/O was still in flight. Remove getOrgLockOperation and isOrgLocked exports (only consumed by tests, not production code). Refactor tests to use getOrgLockStatus() and add coverage for stale-token rejection.
When a lock-held operation (e.g. upgrade) was running but no Meta-based creation existed, the creation-status response had inProgress: true alongside message: "No organization creation in progress" because the lock-status spread overrode inProgress but not message. Include a coherent message derived from the lock operation name and current step. Update the API docs example and add a test assertion for the message field.
…ency-guard feat: add in-memory concurrency guard for organization operations
V1 _createStoresInParallel called createDataLayerStore without checking wallet sync status, causing permanent failure when the wallet temporarily desyncs. The V2 equivalent already has this protection. Add waitForSpendableCoins() before V1 store creation and resume paths, and per-store retry with exponential backoff for transient wallet errors (matching the V2 implementation). Extract TRANSIENT_WALLET_ERRORS and isTransientWalletError into wallet.js so both V1 and V2 models share a single definition. Replace hardcoded [v2]: log prefix in wallet.js with [wallet]: since the function is now called from both V1 and V2 paths. Use actual needed coin count from getStoresToCreate(state) in resume paths instead of hardcoded 4.
The v1 importOrganization had a chicken-and-egg bug where it checked datalayer sync status before subscribing to the store. For stores that were never subscribed, get_sync_status returns an error (store unknown to datalayer), which was interpreted as "not synced" causing an early return. The subscription call was never reached, so the store was never subscribed, and every retry hit the same wall. Port the subscribe-first pattern already used in v2 importOrganization: subscribe to the store first (no-op if already subscribed), then check sync status. This ensures new stores begin syncing on the first pass and subsequent retries find them progressively more synced.
…et-retry fix: add wallet readiness check and retry logic to V1 store creation
…e-first fix: subscribe to org store before sync check in v1 importOrganization
The mirror-check task only mirrored governance stores for governance body owners (nodes with governanceBodyId/mainGoveranceBodyId in meta). Subscriber nodes subscribe to governance stores via GOVERNANCE_BODY_ID in config but the mirror task had no code path to mirror them. Read GOVERNANCE_BODY_ID from config when meta entries are absent, mirror the main governance store, then resolve and mirror its version store via the data model version key.
…ware The global middleware asserted wallet sync status on every non-health request, including read-only GETs. When the wallet transiently desyncs while processing new datalayer block confirmations, all GET endpoints return 400 even though they only read from the local database. Skip assertWalletIsAvailable() for GET requests, matching the existing pattern where the home org sync check already allows GETs through. Write operations still require the wallet to be synced.
3 tasks
…-on-get fix: skip wallet availability check for GET requests in global middleware
Replace manual subscribe/import flow in test-v2-remote-sync with automatic governance-based discovery. Add faucet funding, mirror creation validation, and mirror cleanup. Create new parallel test-v1-remote-sync job with identical structure using V1 config, governance, and data validation against the V1 participant instance.
Mirrors created on shared governance stores during CI runs leave orphaned coins that can never be deleted (the CI wallet key is destroyed after each run). These accumulate over time and contribute to wallet sync delays in subsequent runs. Decouple inline store mirroring from the background mirror-check task: createDataLayerStore now mirrors based on DATALAYER_FILE_SERVER_URL being configured rather than AUTO_MIRROR_EXTERNAL_STORES. This lets CI create mirrors on ephemeral org stores (safe) while keeping the background task disabled so governance stores are never mirrored. - Set AUTO_MIRROR_EXTERNAL_STORES=false in all CI jobs - Set DATALAYER_FILE_SERVER_URL=null for sync, upgrade, and governance jobs (no mirrors at all) - Keep DATALAYER_FILE_SERVER_URL set for v1/v2 live-api jobs (mirrors only on org-created stores, validated by existing test helpers) - Remove mirror verify and delete steps from sync jobs - Remove mirror delete step from v2 live-api job
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Replace nested try/finally inside try/catch with a flat try/catch/finally in both V1 and V2 reclaimHome handlers. The inner finally nulled the lock token, making the outer catch release a no-op. A single finally block handles lock release on all exit paths.
subscribeToStoreOnDataLayer unconditionally created a mirror on every subscribe, ignoring AUTO_MIRROR_EXTERNAL_STORES. Gate the addMirror call on the config setting (defaulting to true) so users who set it to false do not get mirrors for external stores. Also update the README to clarify that both AUTO_MIRROR_EXTERNAL_STORES and DATALAYER_FILE_SERVER_URL must be configured for external store mirroring to function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
test-v2-remote-syncCI job: replace manual subscribe/import with automatic governance-based org discovery, add faucet funding, mirror creation validation, and mirror cleanuptest-v1-remote-syncCI job: parallel job with identical structure using V1 config, governance body, and data validation against the V1 participant test instanceTest plan
test-v2-remote-syncCI job passes: governance sync, org discovery, data validation (21 entity types), mirror creation, mirror cleanuptest-v1-remote-syncCI job passes: governance sync, org discovery, data validation (10 projects, 10 units, nested counts), mirror creation, mirror cleanupneeds:dependenciesNote
Medium Risk
Touches organization create/upgrade/reclaim flows and background tasks, adding new locking and retry behavior that could block operations or change failure modes if incorrect. CI workflow changes are large but isolated to test infrastructure.
Overview
Adds a global, token-based in-memory lock to serialize home-organization operations (V1/V2 create, upgrade, reclaim, and recovery), returning
409 Conflictwith a structuredoperationStatuswhen another operation is running and wiring lock progress into/v2/organizations/creation-statusvia a newliveStatusfield.Hardens org creation/upgrade reliability by requiring spendable coin availability before starting/resuming, retrying transient wallet errors during parallel store creation, and improving V1 import by subscribing to org stores before checking sync status.
Reduces noisy sync polling by introducing per-organization exponential backoff for datalayer generation/target mismatches in
sync-registries(v1/v2).Mirroring and ops tweaks: mirroring is now conditional on
AUTO_MIRROR_EXTERNAL_STORESand a validDATALAYER_FILE_SERVER_URL, mirror-check tasks can mirror governance stores on subscriber nodes using configuredGOVERNANCE_BODY_ID, and middleware skips wallet-availability assertions forGETrequests.CI revamp: updates remote sync workflows (disables auto-mirroring by default in tests, adds faucet funding, governance-driven org discovery/validation, and adds a new
test-v1-remote-syncjob) and removes the prior “delete all mirrors” cleanup step; docs updated to reflect upgrade status and 409 conflict behavior.Written by Cursor Bugbot for commit 0fbd6ee. This will update automatically on new commits. Configure here.