Skip to content

fix: make background sync non-blocking on unsynced stores#1600

Merged
TheLastCicada merged 6 commits into
v2-rc2from
fix/sync-blocking-regression
Apr 23, 2026
Merged

fix: make background sync non-blocking on unsynced stores#1600
TheLastCicada merged 6 commits into
v2-rc2from
fix/sync-blocking-regression

Conversation

@TheLastCicada

@TheLastCicada TheLastCicada commented Apr 23, 2026

Copy link
Copy Markdown
Contributor

Summary

Eliminate long blocking waits in background reconciliation and governance sync without changing current subscribe/resync request-path semantics.

  • Background reconciliation pre-checks. Both V1 Organization.reconcileOrganization and V2 OrganizationsV2.reconcileOrganization now subscribe to the org store first, then check getDataLayerStoreSyncStatus before entering any blocking subscription flow. If either the org store or its derived singleton store is unsynced, the task logs at INFO level and returns — the periodic task retries on the next run.
  • Governance sync is now a single attempt per task run. Both V1 Governance.sync and V2 GovernanceV2.sync check the governance body store and versioned governance store sync status up front and return immediately if either is unsynced. On any other failure the function logs and returns; the scheduler re-runs the task on its normal cadence instead of an in-task retry loop. Cached governance data is preserved across skips.
  • Governance sync interval standardized to 5 minutes (300 s) across src/utils/defaultConfig.js, both sync-governance-body*.js fallbacks, docker-compose-example.yml, and tests/v2/config/test-config.yaml. The sync-status pre-check is a single lightweight RPC, so running every 5 minutes is safe.
  • Request-path getRegistryStoreIdFromSingleton wait-loop preserved. The production wait-loop in V2's getRegistryStoreIdFromSingleton is kept so POST /v2/organizations/subscribe and the org-import flow still work on newly-discovered stores. The reconcile pre-check makes the wait a no-op on the background path.
  • Focused tests. 22 new tests across three files covering the simulator fast-path, cached-data preservation across skips, method contracts, upsertGovernanceDownload idempotency, and the default interval constant.

Behavior change callouts

  • Default GOVERNANCE_SYNC_TASK_INTERVAL changes from 86400 (tasks fallback) / 3600 (defaultConfig) to 300. Operators who pinned the env var explicitly are unaffected; operators relying on the default will see a governance sync every 5 minutes instead of every 1 or 24 hours.
  • Governance.sync / GovernanceV2.sync no longer accept a retryCounter parameter and no longer throw on missing GOVERNANCE_BODY_ID — they log and return.

Test plan

  • npm run test:v2 — 3 new spec files (22 tests) pass; full v2 suite passes except for two pre-existing flaky tests (project-v2 cascade, issuance-v2 DELETE) that also fail on v2-rc2 without these changes.
  • Verify on a real CADT deployment that reconcile tasks complete in seconds on nodes that haven't finished syncing subscribed stores.
  • Verify governance sync picks up governance-body updates within ~5 minutes on a healthy node.

Note

Medium Risk
Changes core background sync/reconciliation control flow to return early on unsynced stores and removes in-function retry loops, which could affect freshness/timing of governance/org data if the new skip conditions are too aggressive. Also increases default governance sync frequency to every 5 minutes, impacting task load and log volume.

Overview
Makes background governance sync and org reconciliation non-blocking when datalayer stores aren’t ready. Governance.sync/GovernanceV2.sync now follow a subscribe-first + sync-status precheck pattern and do a single best-effort attempt per scheduled run (no internal retry loop); failures log and return while preserving cached governance rows.

Adds a skipOnUnsynced option to org reconciliation and uses it in periodic validation tasks. Organization.reconcileOrganization and OrganizationsV2.reconcileOrganization now pre-subscribe and check org + singleton store sync status and either skip (background tasks) or throw (request paths) to avoid long waits and prevent destructive follow-on operations from running on stale state.

Standardizes governance sync cadence to 300s (5 min). Updates defaults and fallbacks (defaultConfig, task jobs, example compose/test config), removes “kick governance sync” behavior from default-org sync tasks, and adds integration tests covering schedule defaults, subscribe-first ordering, idempotent upserts, and simulator fast paths.

Reviewed by Cursor Bugbot for commit d98198f. Bugbot is set up for automated code reviews on this repo. Configure here.

Background reconciliation and governance sync tasks could block for up
to 10 minutes per call on known-unsynced datalayer stores, and the V1
governance sync recursively retried up to 50 times on every failure.
This caused visible startup and task-cadence stalls.

- Add sync-status pre-checks to V1 and V2 reconcileOrganization: subscribe
  to the org store so it starts syncing, then skip reconcile on the next
  task run if either the org store or its derived singleton is unsynced.
- Replace recursive retry in V1/V2 governance sync with a single attempt
  per task invocation.  On failure the function logs and returns; the
  scheduler re-runs the task on its normal cadence, so the coroutine
  never blocks its scheduler slot.
- Skip governance sync immediately when the body store or its versioned
  governance store reports unsynced, preserving cached governance data.
- Keep the production wait-loop in getRegistryStoreIdFromSingleton so
  request-path callers (subscribe controller, import flow) still work
  on newly-discovered stores; the reconcile pre-check makes the wait a
  no-op on the background path.
- Standardize the governance sync interval to 300 seconds (5 minutes)
  across defaults, task fallbacks, example env, and test config.  The
  sync pre-check is a single lightweight RPC so running every 5 minutes
  is safe.
- Add focused tests for reconcile and governance non-blocking behavior,
  idempotency, and schedule configuration.
Comment thread src/models/governance/governance.model.js
Bugbot flagged a permanent-skip loop in Governance.sync (both v1 and v2):
getDataLayerStoreSyncStatus returns false for a store the DataLayer node
has no record of, isDlStoreSynced(undefined) is false, sync() early-exits,
and because sync() was the only place that called
subscribeToStoreOnDataLayer (via getSubscribedStoreData), the store never
got subscribed and the skip repeated forever on every scheduled run.

Mirror the subscribe-first pattern already used in
OrganizationsV2.reconcileOrganization: subscribe first (idempotent), then
check sync status.  Applied to both the governance body store and the
versioned-governance store in both v1 and v2 models.

subscribeToStoreOnDataLayer returns false on the common
datalayer-unreachable / RPC-failure paths (it does not throw), so guard
on both a falsy return and thrown exceptions.  Without the falsy-return
check, the dominant failure mode would slip past the try/catch and the
subsequent status-check skip would fire with a misleading log.

Adds tests/v2/integration/governance-sync-subscribe-first.spec.js
covering five branches of GovernanceV2.sync:
- subscribeToStoreOnDataLayer runs before getDataLayerStoreSyncStatus on
  the body store
- sync() returns cleanly when the body-store subscribe throws
- sync() returns cleanly when the body-store subscribe returns falsy
- sync() subscribes to the versioned store before its status check (call
  order: body-subscribe -> body-status -> body-fetch -> versioned-subscribe
  -> versioned-status), and does not upsert when the versioned store is
  unsynced
- sync() returns cleanly when the versioned-store subscribe returns falsy

The v1 model captures USE_SIMULATOR at module import time so the same
test pattern cannot reach its production path; the v1 change is
structurally identical to v2 and is reviewed by code reading.
Comment thread src/models/organizations/organizations.model.js
…falsy return

Bugbot flagged an inconsistency introduced by af06a65: governance sync
now correctly treats both a thrown exception and a falsy return from
subscribeToStoreOnDataLayer as a failure, but V1/V2 reconcileOrganization
still only guarded thrown exceptions.

subscribeToStoreOnDataLayer in src/datalayer/persistance.js returns
false (without throwing) on the two most common failure paths:
  - No storeId provided
  - getSubscriptions() RPC failure (datalayer unreachable)

Without the falsy-return check, the dominant datalayer-unreachable
failure slips past the try/catch, and the subsequent
getDataLayerStoreSyncStatus call (which also fails when the store is
not subscribed) produces a misleading "not yet synced" log instead of
a clear "could not subscribe" message.

Mirrors the subscribe-first pattern af06a65 added to governance sync.
Both subscribe-first and the falsy-return guard are no-ops in simulator
mode (wrapped in !USE_SIMULATOR for V2, skipped entirely for V1 via the
function's top-level simulator guard).
Comment thread src/models/organizations/organizations.model.js
TheLastCicada added a commit that referenced this pull request Apr 23, 2026
…zationMeta

Mirrors the subscribe-first pattern established in #1600 af06a65 (for
governance sync) and 6d41b71 (for reconcileOrganization).

Without subscribe-first, the getDataLayerStoreSyncStatus pre-check added
to syncOrganizationMeta has a permanent-skip-loop failure mode: if the
DataLayer node has no record of an org store (e.g. after a DataLayer DB
reset), the status check returns falsy, the loop iteration skips the
org, and no subsequent run ever subscribes — the org's metadata stops
syncing forever.

Also adds a falsy-return guard on subscribeToStoreOnDataLayer: the
wrapper in src/datalayer/persistance.js returns false (without throwing)
on the common failure paths (no storeId, getSubscriptions RPC failure).
Without the falsy check, a datalayer-unreachable failure would slip past
the try/catch and produce a misleading "not yet synced" log.

Applies to both V1 Organization.syncOrganizationMeta and V2
OrganizationsV2.syncOrganizationMeta.
TheLastCicada added a commit that referenced this pull request Apr 23, 2026
…sy return

Mirrors the pattern established in #1600 af06a65 (for governance sync)
and 6d41b71 (for reconcileOrganization).

subscribeToStoreOnDataLayer in src/datalayer/persistance.js returns
false (without throwing) on the common failure paths:
  - No storeId provided
  - getSubscriptions() RPC failure (datalayer unreachable)

Without the falsy-return check, the dominant datalayer-unreachable
failure would slip past the try/catch, and the subsequent
getDataLayerStoreSyncStatus call (which also fails when the store is
not subscribed) would produce a misleading "not yet synced" log.

Applies to both V1 and V2 importOrganization at two subscribe call
sites each: the pre-existing orgUid subscribe at the top of the
function, and the new singleton-store subscribe added in this PR.
Both are adjacent in the same function; fixing only one would leave
an internal inconsistency that a reviewer would flag.
…ta loss

Bugbot flagged a High Severity regression on 6d41b71: reconcileOrganization
now silently returns (no throw) when a store is unsynced or the subscribe
call fails, but the user-facing resyncOrganization controller
(POST /organizations/resync and POST /v2/organizations/resync) calls
reconcileOrganization and then unconditionally:

  1. Resets registryHash / registry_hash to '0'
  2. Destroys ModelKeys rows (V1) / AuditV2 rows (V2) for the orgUid

Before this PR, reconcileOrganization would either complete or throw; the
throw would hit the controller's catch, roll back the transaction, and
return 400 without destroying data.  After 6d41b71, a transient datalayer
hiccup (subscribe returns false, sync-status RPC fails, etc.) causes the
resync endpoint to "succeed" with silent data deletion.

Fix: add a { skipOnUnsynced } option to both V1 and V2
reconcileOrganization.
  - Default (false): throw on any subscribe-failed / store-unsynced /
    status-check-failed condition.  Restores pre-6e76235 behavior for
    request-path callers.
  - true: silent skip with an INFO log.  Preserves the sync-blocking
    regression fix for background task callers
    (validate-organization-table-and-subscriptions*.js).

Background task callers updated to pass { skipOnUnsynced: true }.  The
two resync controllers are unchanged — they continue to call
reconcileOrganization with the (now throwing) default behavior, which
hits their existing catch block and rolls back the transaction.

Adds six contract tests covering the new option in simulator mode
(signature acceptance, no-option back-compat, no signature errors for
either value).  Production throw/skip behavior remains covered by live
integration tests; simulator mode cannot reach the pre-check block
(wrapped in !USE_SIMULATOR for V2, guarded at the top of the function
for V1).

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 9cf6cb7. Configure here.

Comment thread src/models/organizations/organizations.model.js Outdated
…ned helper

Bugbot flagged a High Severity ReferenceError in my previous commit 9cf6cb7:
stale refactor code sitting in my worktree from an earlier exploration of a
`subscribeOrSkip` helper got swept into `git add -A` and landed in the
Organization.importOrganization method.  The helper was never defined,
never imported, and never exported — so the commit compiled but would
crash at runtime with `ReferenceError: subscribeOrSkip is not defined`
on the first non-simulator-mode call to importOrganization.

importOrganization is out of scope for this PR; its improvements live in
#1603.  Revert that function's "Subscribe to the org store first, then
check sync status" block to match origin/v2-rc2 verbatim.  All
intentional #1600 changes (reconcileOrganization + skipOnUnsynced,
governance sync, task intervals, tests) are untouched.

Verified by `git diff origin/v2-rc2 -- src/models/organizations/organizations.model.js`
now showing only reconcileOrganization changes.
Adopt the orthogonal pieces of #1590 (which supersedes that PR's governance
sync rewrite with #1600's simpler single-attempt-per-run design while
keeping the #1600 5-minute cadence).

- tasks/sync-default-organizations{,-v2}.js: remove the nested
  Governance.sync() trigger. Governance sync is now owned exclusively by
  sync-governance-body{,-v2}.js on its scheduler cadence; this task just
  reads whatever the local DB has and no-ops if orgList hasn't landed
  yet. Fixes a race where the task could block on governance sync before
  doing its own work.
- utils/data-loaders{,-v2}.js: drop the 50-retry recursive wait loop in
  getDefaultOrganizationList{,V2}. On a fresh install where governance
  data hasn't been synced, return [] instead of retrying internally; the
  scheduler will re-run us on cadence. Also move the USE_SIMULATOR /
  USE_DEVELOPMENT_MODE destructure into the functions so config reloads
  aren't stale.
- controllers/governance{,-v2}.controller.js: harden POST
  /governance/sync and POST /v2/governance/sync as fire-and-forget. The
  model sync() methods already return cleanly on any failure (subscribe-
  first pattern, scheduler retries on cadence); attach .catch defensively
  so future exceptions are logged instead of becoming unhandled
  rejections.

Does NOT adopt #1590's two-tier bootstrap/steady cadence (keeping #1600's
flat 300s default), the src/tasks/index.js scheduler refactor (not needed
without the promotion machinery), or the importOrganization
{ allowLongWait } option (that belongs to #1603).

Closes #1590 by folding in its compatible pieces.
@TheLastCicada TheLastCicada merged commit 1ac4cae into v2-rc2 Apr 23, 2026
26 checks passed
@TheLastCicada TheLastCicada deleted the fix/sync-blocking-regression branch April 23, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant