fix: make background sync non-blocking on unsynced stores#1600
Merged
Conversation
Background reconciliation and governance sync tasks could block for up to 10 minutes per call on known-unsynced datalayer stores, and the V1 governance sync recursively retried up to 50 times on every failure. This caused visible startup and task-cadence stalls. - Add sync-status pre-checks to V1 and V2 reconcileOrganization: subscribe to the org store so it starts syncing, then skip reconcile on the next task run if either the org store or its derived singleton is unsynced. - Replace recursive retry in V1/V2 governance sync with a single attempt per task invocation. On failure the function logs and returns; the scheduler re-runs the task on its normal cadence, so the coroutine never blocks its scheduler slot. - Skip governance sync immediately when the body store or its versioned governance store reports unsynced, preserving cached governance data. - Keep the production wait-loop in getRegistryStoreIdFromSingleton so request-path callers (subscribe controller, import flow) still work on newly-discovered stores; the reconcile pre-check makes the wait a no-op on the background path. - Standardize the governance sync interval to 300 seconds (5 minutes) across defaults, task fallbacks, example env, and test config. The sync pre-check is a single lightweight RPC so running every 5 minutes is safe. - Add focused tests for reconcile and governance non-blocking behavior, idempotency, and schedule configuration.
2 tasks
2 tasks
Bugbot flagged a permanent-skip loop in Governance.sync (both v1 and v2): getDataLayerStoreSyncStatus returns false for a store the DataLayer node has no record of, isDlStoreSynced(undefined) is false, sync() early-exits, and because sync() was the only place that called subscribeToStoreOnDataLayer (via getSubscribedStoreData), the store never got subscribed and the skip repeated forever on every scheduled run. Mirror the subscribe-first pattern already used in OrganizationsV2.reconcileOrganization: subscribe first (idempotent), then check sync status. Applied to both the governance body store and the versioned-governance store in both v1 and v2 models. subscribeToStoreOnDataLayer returns false on the common datalayer-unreachable / RPC-failure paths (it does not throw), so guard on both a falsy return and thrown exceptions. Without the falsy-return check, the dominant failure mode would slip past the try/catch and the subsequent status-check skip would fire with a misleading log. Adds tests/v2/integration/governance-sync-subscribe-first.spec.js covering five branches of GovernanceV2.sync: - subscribeToStoreOnDataLayer runs before getDataLayerStoreSyncStatus on the body store - sync() returns cleanly when the body-store subscribe throws - sync() returns cleanly when the body-store subscribe returns falsy - sync() subscribes to the versioned store before its status check (call order: body-subscribe -> body-status -> body-fetch -> versioned-subscribe -> versioned-status), and does not upsert when the versioned store is unsynced - sync() returns cleanly when the versioned-store subscribe returns falsy The v1 model captures USE_SIMULATOR at module import time so the same test pattern cannot reach its production path; the v1 change is structurally identical to v2 and is reviewed by code reading.
…falsy return Bugbot flagged an inconsistency introduced by af06a65: governance sync now correctly treats both a thrown exception and a falsy return from subscribeToStoreOnDataLayer as a failure, but V1/V2 reconcileOrganization still only guarded thrown exceptions. subscribeToStoreOnDataLayer in src/datalayer/persistance.js returns false (without throwing) on the two most common failure paths: - No storeId provided - getSubscriptions() RPC failure (datalayer unreachable) Without the falsy-return check, the dominant datalayer-unreachable failure slips past the try/catch, and the subsequent getDataLayerStoreSyncStatus call (which also fails when the store is not subscribed) produces a misleading "not yet synced" log instead of a clear "could not subscribe" message. Mirrors the subscribe-first pattern af06a65 added to governance sync. Both subscribe-first and the falsy-return guard are no-ops in simulator mode (wrapped in !USE_SIMULATOR for V2, skipped entirely for V1 via the function's top-level simulator guard).
TheLastCicada
added a commit
that referenced
this pull request
Apr 23, 2026
…zationMeta Mirrors the subscribe-first pattern established in #1600 af06a65 (for governance sync) and 6d41b71 (for reconcileOrganization). Without subscribe-first, the getDataLayerStoreSyncStatus pre-check added to syncOrganizationMeta has a permanent-skip-loop failure mode: if the DataLayer node has no record of an org store (e.g. after a DataLayer DB reset), the status check returns falsy, the loop iteration skips the org, and no subsequent run ever subscribes — the org's metadata stops syncing forever. Also adds a falsy-return guard on subscribeToStoreOnDataLayer: the wrapper in src/datalayer/persistance.js returns false (without throwing) on the common failure paths (no storeId, getSubscriptions RPC failure). Without the falsy check, a datalayer-unreachable failure would slip past the try/catch and produce a misleading "not yet synced" log. Applies to both V1 Organization.syncOrganizationMeta and V2 OrganizationsV2.syncOrganizationMeta.
TheLastCicada
added a commit
that referenced
this pull request
Apr 23, 2026
…sy return Mirrors the pattern established in #1600 af06a65 (for governance sync) and 6d41b71 (for reconcileOrganization). subscribeToStoreOnDataLayer in src/datalayer/persistance.js returns false (without throwing) on the common failure paths: - No storeId provided - getSubscriptions() RPC failure (datalayer unreachable) Without the falsy-return check, the dominant datalayer-unreachable failure would slip past the try/catch, and the subsequent getDataLayerStoreSyncStatus call (which also fails when the store is not subscribed) would produce a misleading "not yet synced" log. Applies to both V1 and V2 importOrganization at two subscribe call sites each: the pre-existing orgUid subscribe at the top of the function, and the new singleton-store subscribe added in this PR. Both are adjacent in the same function; fixing only one would leave an internal inconsistency that a reviewer would flag.
…ta loss Bugbot flagged a High Severity regression on 6d41b71: reconcileOrganization now silently returns (no throw) when a store is unsynced or the subscribe call fails, but the user-facing resyncOrganization controller (POST /organizations/resync and POST /v2/organizations/resync) calls reconcileOrganization and then unconditionally: 1. Resets registryHash / registry_hash to '0' 2. Destroys ModelKeys rows (V1) / AuditV2 rows (V2) for the orgUid Before this PR, reconcileOrganization would either complete or throw; the throw would hit the controller's catch, roll back the transaction, and return 400 without destroying data. After 6d41b71, a transient datalayer hiccup (subscribe returns false, sync-status RPC fails, etc.) causes the resync endpoint to "succeed" with silent data deletion. Fix: add a { skipOnUnsynced } option to both V1 and V2 reconcileOrganization. - Default (false): throw on any subscribe-failed / store-unsynced / status-check-failed condition. Restores pre-6e76235 behavior for request-path callers. - true: silent skip with an INFO log. Preserves the sync-blocking regression fix for background task callers (validate-organization-table-and-subscriptions*.js). Background task callers updated to pass { skipOnUnsynced: true }. The two resync controllers are unchanged — they continue to call reconcileOrganization with the (now throwing) default behavior, which hits their existing catch block and rolls back the transaction. Adds six contract tests covering the new option in simulator mode (signature acceptance, no-option back-compat, no signature errors for either value). Production throw/skip behavior remains covered by live integration tests; simulator mode cannot reach the pre-check block (wrapped in !USE_SIMULATOR for V2, guarded at the top of the function for V1).
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 9cf6cb7. Configure here.
…ned helper Bugbot flagged a High Severity ReferenceError in my previous commit 9cf6cb7: stale refactor code sitting in my worktree from an earlier exploration of a `subscribeOrSkip` helper got swept into `git add -A` and landed in the Organization.importOrganization method. The helper was never defined, never imported, and never exported — so the commit compiled but would crash at runtime with `ReferenceError: subscribeOrSkip is not defined` on the first non-simulator-mode call to importOrganization. importOrganization is out of scope for this PR; its improvements live in #1603. Revert that function's "Subscribe to the org store first, then check sync status" block to match origin/v2-rc2 verbatim. All intentional #1600 changes (reconcileOrganization + skipOnUnsynced, governance sync, task intervals, tests) are untouched. Verified by `git diff origin/v2-rc2 -- src/models/organizations/organizations.model.js` now showing only reconcileOrganization changes.
Adopt the orthogonal pieces of #1590 (which supersedes that PR's governance sync rewrite with #1600's simpler single-attempt-per-run design while keeping the #1600 5-minute cadence). - tasks/sync-default-organizations{,-v2}.js: remove the nested Governance.sync() trigger. Governance sync is now owned exclusively by sync-governance-body{,-v2}.js on its scheduler cadence; this task just reads whatever the local DB has and no-ops if orgList hasn't landed yet. Fixes a race where the task could block on governance sync before doing its own work. - utils/data-loaders{,-v2}.js: drop the 50-retry recursive wait loop in getDefaultOrganizationList{,V2}. On a fresh install where governance data hasn't been synced, return [] instead of retrying internally; the scheduler will re-run us on cadence. Also move the USE_SIMULATOR / USE_DEVELOPMENT_MODE destructure into the functions so config reloads aren't stale. - controllers/governance{,-v2}.controller.js: harden POST /governance/sync and POST /v2/governance/sync as fire-and-forget. The model sync() methods already return cleanly on any failure (subscribe- first pattern, scheduler retries on cadence); attach .catch defensively so future exceptions are logged instead of becoming unhandled rejections. Does NOT adopt #1590's two-tier bootstrap/steady cadence (keeping #1600's flat 300s default), the src/tasks/index.js scheduler refactor (not needed without the promotion machinery), or the importOrganization { allowLongWait } option (that belongs to #1603). Closes #1590 by folding in its compatible pieces.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
Eliminate long blocking waits in background reconciliation and governance sync without changing current subscribe/resync request-path semantics.
Organization.reconcileOrganizationand V2OrganizationsV2.reconcileOrganizationnow subscribe to the org store first, then checkgetDataLayerStoreSyncStatusbefore entering any blocking subscription flow. If either the org store or its derived singleton store is unsynced, the task logs at INFO level and returns — the periodic task retries on the next run.Governance.syncand V2GovernanceV2.synccheck the governance body store and versioned governance store sync status up front and return immediately if either is unsynced. On any other failure the function logs and returns; the scheduler re-runs the task on its normal cadence instead of an in-task retry loop. Cached governance data is preserved across skips.src/utils/defaultConfig.js, bothsync-governance-body*.jsfallbacks,docker-compose-example.yml, andtests/v2/config/test-config.yaml. The sync-status pre-check is a single lightweight RPC, so running every 5 minutes is safe.getRegistryStoreIdFromSingletonwait-loop preserved. The production wait-loop in V2'sgetRegistryStoreIdFromSingletonis kept soPOST /v2/organizations/subscribeand the org-import flow still work on newly-discovered stores. The reconcile pre-check makes the wait a no-op on the background path.upsertGovernanceDownloadidempotency, and the default interval constant.Behavior change callouts
GOVERNANCE_SYNC_TASK_INTERVALchanges from86400(tasks fallback) /3600(defaultConfig) to300. Operators who pinned the env var explicitly are unaffected; operators relying on the default will see a governance sync every 5 minutes instead of every 1 or 24 hours.Governance.sync/GovernanceV2.syncno longer accept aretryCounterparameter and no longer throw on missingGOVERNANCE_BODY_ID— they log and return.Test plan
npm run test:v2— 3 new spec files (22 tests) pass; full v2 suite passes except for two pre-existing flaky tests (project-v2 cascade,issuance-v2 DELETE) that also fail onv2-rc2without these changes.Note
Medium Risk
Changes core background sync/reconciliation control flow to return early on unsynced stores and removes in-function retry loops, which could affect freshness/timing of governance/org data if the new skip conditions are too aggressive. Also increases default governance sync frequency to every 5 minutes, impacting task load and log volume.
Overview
Makes background governance sync and org reconciliation non-blocking when datalayer stores aren’t ready.
Governance.sync/GovernanceV2.syncnow follow a subscribe-first + sync-status precheck pattern and do a single best-effort attempt per scheduled run (no internal retry loop); failures log and return while preserving cached governance rows.Adds a
skipOnUnsyncedoption to org reconciliation and uses it in periodic validation tasks.Organization.reconcileOrganizationandOrganizationsV2.reconcileOrganizationnow pre-subscribe and check org + singleton store sync status and either skip (background tasks) or throw (request paths) to avoid long waits and prevent destructive follow-on operations from running on stale state.Standardizes governance sync cadence to 300s (5 min). Updates defaults and fallbacks (
defaultConfig, task jobs, example compose/test config), removes “kick governance sync” behavior from default-org sync tasks, and adds integration tests covering schedule defaults, subscribe-first ordering, idempotent upserts, and simulator fast paths.Reviewed by Cursor Bugbot for commit d98198f. Bugbot is set up for automated code reviews on this repo. Configure here.