Skip to content

fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp#1531

Merged
TheLastCicada merged 23 commits into
developfrom
fix/sync-registry-mismatch-backoff
Mar 17, 2026
Merged

fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp#1531
TheLastCicada merged 23 commits into
developfrom
fix/sync-registry-mismatch-backoff

Conversation

@TheLastCicada

@TheLastCicada TheLastCicada commented Mar 13, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add exponential backoff for sync registry mismatch polling to reduce unnecessary retries
  • Add in-memory concurrency guard for organization operations (create, import, sync) to prevent duplicate parallel operations
  • Add token-based ownership to org-operation lock with staleness TTL
  • Mirror governance stores on subscriber nodes
  • Skip wallet availability check for GET requests in global middleware
  • Subscribe to org store before sync check in v1 importOrganization
  • Add wallet readiness check and retry logic to V1 store creation
  • Revamp test-v2-remote-sync CI job: replace manual subscribe/import with automatic governance-based org discovery, add faucet funding, mirror creation validation, and mirror cleanup
  • New test-v1-remote-sync CI job: parallel job with identical structure using V1 config, governance body, and data validation against the V1 participant test instance

Test plan

  • test-v2-remote-sync CI job passes: governance sync, org discovery, data validation (21 entity types), mirror creation, mirror cleanup
  • test-v1-remote-sync CI job passes: governance sync, org discovery, data validation (10 projects, 10 units, nested counts), mirror creation, mirror cleanup
  • Both remote sync jobs run concurrently with no needs: dependencies
  • Existing CI jobs unaffected (v1/v2 integration, live-api, governance tests)
  • Org-operation lock integration tests pass
  • Sync mismatch backoff prevents tight polling loops

Note

Medium Risk
Touches organization create/upgrade/reclaim flows and background tasks, adding new locking and retry behavior that could block operations or change failure modes if incorrect. CI workflow changes are large but isolated to test infrastructure.

Overview
Adds a global, token-based in-memory lock to serialize home-organization operations (V1/V2 create, upgrade, reclaim, and recovery), returning 409 Conflict with a structured operationStatus when another operation is running and wiring lock progress into /v2/organizations/creation-status via a new liveStatus field.

Hardens org creation/upgrade reliability by requiring spendable coin availability before starting/resuming, retrying transient wallet errors during parallel store creation, and improving V1 import by subscribing to org stores before checking sync status.

Reduces noisy sync polling by introducing per-organization exponential backoff for datalayer generation/target mismatches in sync-registries (v1/v2).

Mirroring and ops tweaks: mirroring is now conditional on AUTO_MIRROR_EXTERNAL_STORES and a valid DATALAYER_FILE_SERVER_URL, mirror-check tasks can mirror governance stores on subscriber nodes using configured GOVERNANCE_BODY_ID, and middleware skips wallet-availability assertions for GET requests.

CI revamp: updates remote sync workflows (disables auto-mirroring by default in tests, adds faucet funding, governance-driven org discovery/validation, and adds a new test-v1-remote-sync job) and removes the prior “delete all mirrors” cleanup step; docs updated to reflect upgrade status and 409 conflict behavior.

Written by Cursor Bugbot for commit 0fbd6ee. This will update automatically on new commits. Configure here.

ChiaAutomation and others added 3 commits March 12, 2026 16:21
When a data layer store is stuck syncing (e.g., a delta file missing
from all mirrors), the sync-registries task would detect the root
history mismatch every 5 seconds, log a warn, and return -- repeating
identically forever. This caused massive log noise, unnecessary CPU
from repeated root history + sync status queries, and no useful signal
after the first log entry.

Add per-org mismatch tracking with exponential backoff (30s initial,
2x multiplier, 10min cap) and log throttling (warn on first hit, debug
on retries, info summary every 5 minutes). When the mismatch resolves,
the tracker clears and normal 5s polling resumes immediately.
Comment thread src/tasks/sync-registries-v2.js
TheLastCicada and others added 15 commits March 13, 2026 13:01
Prevent simultaneous execution of organization create, upgrade, and
reclaim operations with a process-level in-memory lock. Conflicting
requests now receive 409 Conflict with live operation status including
the operation name, current step, start time, and elapsed seconds.

- Add org-operation-lock module with tryAcquire/release/status tracking
- Guard V1 and V2 create, upgrade, and reclaim controller endpoints
- Add updateOrgLockStatus milestone calls in model creation/upgrade flows
- Integrate lock acquire/release into V1 and V2 startup recovery tasks
- Enhance GET /v2/organizations/creation-status with liveStatus field
- Add unit tests for lock module and integration tests for endpoint guards
- Document 409 behavior and liveStatus in cadt_rpc_api_v2.md
Move duplicated per-org exponential backoff logic from both
sync-registries.js and sync-registries-v2.js into a shared
SyncMismatchBackoff class. Algorithm and log output are identical;
future tuning only needs to change one place.
Add a 1-hour staleness TTL to the org operation lock so that a hung
background operation (promise that never resolves/rejects) cannot
permanently block all future organization operations. When a lock
exceeds MAX_LOCK_AGE_MS, tryAcquireOrgLock force-releases it with a
console warning and allows the new acquisition.

Also add the missing releaseOrgLock() call in the V2 create endpoint's
outer catch block, which could permanently hold the lock if a database
query threw between lock acquisition and the async creation branches.
The outer catch blocks in all 6 guarded handlers unconditionally called
releaseOrgLock(), but the lock is acquired partway through the try
block (after assertions like assertWalletIsSynced). If a pre-lock
assertion throws while another operation holds the lock, the catch
would release that other operation's lock, defeating the concurrency
guard.

Add a lockAcquired flag to each handler, set after successful
tryAcquireOrgLock(), and gate the outer catch release on it.
Replace all bare releaseOrgLock() calls in handler try blocks with a
local releaseLock() closure that atomically checks and clears the
lockAcquired flag before releasing. This eliminates two classes of bugs:

1. Early-release paths (e.g. V1 org detected) left lockAcquired=true.
   Any subsequent await could yield, letting another request acquire
   the lock, and if an error then reached the outer catch it would
   release that other operation's lock.

2. Reclaim handlers had an inner try/finally that released the lock,
   then the outer catch also released — a double-release that could
   free another operation's lock if one was acquired in between.

The only remaining bare releaseOrgLock() calls are inside async
.finally() callbacks on background operations, where lockAcquired is
explicitly set to false at handoff time.
releaseOrgLock() had no concept of who held the lock. When a background
.finally() ran after its operation's lock was force-released due to TTL
expiry, it silently released a different operation's lock, defeating the
concurrency guard.

tryAcquireOrgLock() now returns an opaque ownership token (or null on
failure). releaseOrgLock(token) only releases if the token matches the
current holder, making stale releases a safe no-op. All controllers and
recovery tasks pass captured tokens through their async .finally() and
try/finally blocks.
…k release

updateOrgLockStatus() unconditionally wrote to currentStatus without
verifying the caller owns the lock. After a stale lock force-release
(1-hour TTL), background code from the old operation could overwrite
the new holder's status on the creation-status endpoint.

Change updateOrgLockStatus(status) to updateOrgLockStatus(token, status)
so only the current lock holder can update progress. Thread the lock
token through createHomeOrganization, _resumeOrganizationCreation,
_executeOrganizationCreation, and upgradeFromV1 in both V1 and V2
models, controllers, and recovery tasks.

Also move releaseLock() in the V2 create handler's V1-org check block
from before the async singleton data fetch to just before each return,
closing a window where a concurrent request could acquire the lock
while async I/O was still in flight.

Remove getOrgLockOperation and isOrgLocked exports (only consumed by
tests, not production code). Refactor tests to use getOrgLockStatus()
and add coverage for stale-token rejection.
When a lock-held operation (e.g. upgrade) was running but no Meta-based
creation existed, the creation-status response had inProgress: true
alongside message: "No organization creation in progress" because the
lock-status spread overrode inProgress but not message.

Include a coherent message derived from the lock operation name and
current step. Update the API docs example and add a test assertion
for the message field.
…ency-guard

feat: add in-memory concurrency guard for organization operations
V1 _createStoresInParallel called createDataLayerStore without checking
wallet sync status, causing permanent failure when the wallet
temporarily desyncs. The V2 equivalent already has this protection.

Add waitForSpendableCoins() before V1 store creation and resume paths,
and per-store retry with exponential backoff for transient wallet errors
(matching the V2 implementation).

Extract TRANSIENT_WALLET_ERRORS and isTransientWalletError into
wallet.js so both V1 and V2 models share a single definition. Replace
hardcoded [v2]: log prefix in wallet.js with [wallet]: since the
function is now called from both V1 and V2 paths. Use actual needed
coin count from getStoresToCreate(state) in resume paths instead of
hardcoded 4.
The v1 importOrganization had a chicken-and-egg bug where it checked
datalayer sync status before subscribing to the store. For stores that
were never subscribed, get_sync_status returns an error (store unknown
to datalayer), which was interpreted as "not synced" causing an early
return. The subscription call was never reached, so the store was never
subscribed, and every retry hit the same wall.

Port the subscribe-first pattern already used in v2 importOrganization:
subscribe to the store first (no-op if already subscribed), then check
sync status. This ensures new stores begin syncing on the first pass
and subsequent retries find them progressively more synced.
…et-retry

fix: add wallet readiness check and retry logic to V1 store creation
…e-first

fix: subscribe to org store before sync check in v1 importOrganization
The mirror-check task only mirrored governance stores for governance
body owners (nodes with governanceBodyId/mainGoveranceBodyId in meta).
Subscriber nodes subscribe to governance stores via GOVERNANCE_BODY_ID
in config but the mirror task had no code path to mirror them.

Read GOVERNANCE_BODY_ID from config when meta entries are absent,
mirror the main governance store, then resolve and mirror its version
store via the data model version key.
…ware

The global middleware asserted wallet sync status on every non-health
request, including read-only GETs. When the wallet transiently desyncs
while processing new datalayer block confirmations, all GET endpoints
return 400 even though they only read from the local database.

Skip assertWalletIsAvailable() for GET requests, matching the existing
pattern where the home org sync check already allows GETs through.
Write operations still require the wallet to be synced.
TheLastCicada and others added 2 commits March 16, 2026 14:51
…-on-get

fix: skip wallet availability check for GET requests in global middleware
Replace manual subscribe/import flow in test-v2-remote-sync with
automatic governance-based discovery. Add faucet funding, mirror
creation validation, and mirror cleanup. Create new parallel
test-v1-remote-sync job with identical structure using V1 config,
governance, and data validation against the V1 participant instance.
@TheLastCicada TheLastCicada changed the title fix: add exponential backoff for sync registry mismatch polling fix: sync registry mismatch backoff, org concurrency guard, and remote sync test revamp Mar 16, 2026
Mirrors created on shared governance stores during CI runs leave
orphaned coins that can never be deleted (the CI wallet key is
destroyed after each run). These accumulate over time and contribute
to wallet sync delays in subsequent runs.

Decouple inline store mirroring from the background mirror-check task:
createDataLayerStore now mirrors based on DATALAYER_FILE_SERVER_URL
being configured rather than AUTO_MIRROR_EXTERNAL_STORES. This lets
CI create mirrors on ephemeral org stores (safe) while keeping the
background task disabled so governance stores are never mirrored.

- Set AUTO_MIRROR_EXTERNAL_STORES=false in all CI jobs
- Set DATALAYER_FILE_SERVER_URL=null for sync, upgrade, and governance
  jobs (no mirrors at all)
- Keep DATALAYER_FILE_SERVER_URL set for v1/v2 live-api jobs (mirrors
  only on org-created stores, validated by existing test helpers)
- Remove mirror verify and delete steps from sync jobs
- Remove mirror delete step from v2 live-api job

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Comment thread src/controllers/organization.controller.js Outdated
Comment thread src/datalayer/writeService.js
Replace nested try/finally inside try/catch with a flat
try/catch/finally in both V1 and V2 reclaimHome handlers. The inner
finally nulled the lock token, making the outer catch release a no-op.
A single finally block handles lock release on all exit paths.
subscribeToStoreOnDataLayer unconditionally created a mirror on every
subscribe, ignoring AUTO_MIRROR_EXTERNAL_STORES. Gate the addMirror
call on the config setting (defaulting to true) so users who set it
to false do not get mirrors for external stores.

Also update the README to clarify that both AUTO_MIRROR_EXTERNAL_STORES
and DATALAYER_FILE_SERVER_URL must be configured for external store
mirroring to function.
@TheLastCicada TheLastCicada merged commit 9abd65c into develop Mar 17, 2026
24 checks passed
@TheLastCicada TheLastCicada deleted the fix/sync-registry-mismatch-backoff branch March 17, 2026 15:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants