Skip to content

Commit 47481a0

Browse files
committed
fix: init mirror models eagerly and recover lost writes on reconnect
Fixes a silent startup race in CADT's MySQL mirror: if MySQL was not yet reachable when CADT started, every mirror Model.init() call was permanently skipped. Subsequent mirror writes threw "Cannot read properties of undefined (reading 'constructor')" from deep inside Sequelize and the mirror stayed non-functional for the life of the process. Observed on the cadt-testneta-observer k8s pod. Core fix (V1 + V2): - safeMirrorDbHandler[V2] no longer wraps Model.init(). A new initMirrorModel[V2] helper initialises models synchronously at module-load time, which is safe because Sequelize.init is pure metadata registration and does not require a live connection. All 34 V1+V2 *.model.mirror.js files updated. - safeMirrorDbHandler[V2] now detects disconnect->reconnect transitions (and "setup never ran at startup") and runs prepareMysqlMirror[V2] + backfillMirror[V2] before the current callback, so writes dropped during an outage are recovered without requiring a CADT restart. - backfillMirror[V2] also sweeps orphan rows (mirror rows whose PK no longer exists in source) so DELETEs issued during an outage eventually propagate. Pagination now has an ORDER BY on the PK so concurrent writes don't cause skipped rows. V1 previously had no backfill at all - that is added. - validateMirrorDbNames now fails loud instead of being swallowed by the mirror-setup try/catch, so a V1/V2 DB_NAME collision aborts startup the same way it did before this PR's refactor. Tests: - New integration tests: assert every mirror model is initialised at module load (with a single-column-PK invariant guarding the orphan sweep); orphan sweep removes source-deleted rows from the mirror; reconnect detection runs backfill; steady-state connects do NOT run backfill. - V1 live-api data-short.js now verifies records land in the V1 MySQL mirror (symmetric with V2). Ports mysql-mirror-helpers to V1 and fixes a date-then-numeric comparison fallback in both helpers so '2024-01-01' no longer equals '2024-06-15' by parseFloat. CI: - test-v2-live-api and test-v1-live-api now stop MariaDB mid-setup and restart it ~30s after CADT starts, exercising the startup race plus the reconnect+backfill recovery path end-to-end. Both jobs hard-fail if MariaDB is still reachable after service mariadb stop so a degraded race test cannot silently pass. V1 Start CADT now polls the V1 API endpoint with exit-on-timeout, matching V2. Related fix (out of scope for the bug but same test surface): - wallet-health.live.spec.js and datalayer-test-helpers.js now honour CERTIFICATE_FOLDER_PATH via a new shared live-api helper, matching src/datalayer/wallet.js. Previously hardcoded \${chiaRoot}/config/ssl would false-negative in deployments using a non-default Chia SSL directory.
1 parent 374c9f3 commit 47481a0

46 files changed

Lines changed: 2167 additions & 141 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/tests.yaml

Lines changed: 191 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,34 @@ jobs:
161161
}
162162
echo "✓ MariaDB (MySQL-compatible) mirror database is ready for testing"
163163
164+
# Stop MariaDB so CADT starts against a DOWN mirror. This exercises
165+
# the startup race that k8s observers hit: CADT up before MySQL is
166+
# reachable. The subsequent "Delay-start MariaDB (exercise mirror
167+
# reconnect recovery)" step will restart MariaDB ~30s after CADT,
168+
# and the V2 mirror reconnect path must transparently run
169+
# migrations + backfill and catch up without losing any writes.
170+
# If the test fails, either the init race fix or the
171+
# reconnect-backfill logic has regressed.
172+
#
173+
# The race test is load-bearing on MariaDB actually being down
174+
# when CADT starts - fail the job if mysqladmin can still ping it
175+
# after 10 seconds of retries rather than letting the test
176+
# silently degrade into a no-op.
177+
echo "Stopping MariaDB so CADT starts against a down mirror..."
178+
service mariadb stop
179+
for STOP_ATTEMPT in $(seq 1 10); do
180+
if ! mysqladmin ping -h localhost --silent 2>/dev/null; then
181+
echo "✓ MariaDB stopped after ${STOP_ATTEMPT}s - CADT startup will race against delayed restart"
182+
break
183+
fi
184+
if [ "$STOP_ATTEMPT" -eq 10 ]; then
185+
echo "ERROR: MariaDB still responding 10s after 'service mariadb stop'."
186+
echo "The race test's precondition (MariaDB down when CADT starts) cannot be met."
187+
exit 1
188+
fi
189+
sleep 1
190+
done
191+
164192
- name: Configure Chia
165193
shell: bash
166194
run: |
@@ -174,7 +202,11 @@ jobs:
174202
- name: Configure CADT
175203
shell: bash
176204
run: |
177-
# Run cadt momentarily to create config file
205+
# Run cadt momentarily to create config file. Note: MariaDB is
206+
# intentionally stopped at this point to exercise the mirror
207+
# init race on first boot - CADT must log the setup failure
208+
# non-fatally and continue, then recover automatically once
209+
# MariaDB restarts.
178210
pm2 start npm --no-autorestart --name "cadt" -- start
179211
sleep 10
180212
pm2 logs cadt --nostream
@@ -283,6 +315,10 @@ jobs:
283315
- name: Start CADT
284316
shell: bash
285317
run: |
318+
# MariaDB is still stopped at this point (see "Configure MySQL").
319+
# CADT must start successfully, log the mirror setup failure as
320+
# non-fatal, and expose its health endpoint. This exercises the
321+
# mirror init race that broke production observers on k8s.
286322
pm2 start npm --no-autorestart --name "cadt" -- start
287323
echo "Waiting for CADT health endpoint..."
288324
MAX_CADT_HEALTH_ATTEMPTS=30
@@ -299,6 +335,37 @@ jobs:
299335
sleep 2
300336
done
301337
338+
- name: Delay-start MariaDB (exercise mirror reconnect recovery)
339+
shell: bash
340+
run: |
341+
# Restart MariaDB now that CADT has been running without it for a
342+
# while. The next V2 mirror operation should:
343+
# 1. Detect authenticate success following previous failures.
344+
# 2. Retry CREATE DATABASE + migrations via prepareMysqlMirrorV2
345+
# (they never ran at startup because MariaDB was down).
346+
# 3. Run backfillMirrorV2 to catch up any writes + deletes that
347+
# happened while MariaDB was unavailable.
348+
# The live-api tests' verifyMirrorRecordsBatch assertions will
349+
# fail if any step regresses.
350+
echo "Waiting a few seconds so CADT logs multiple mirror-down messages..."
351+
sleep 10
352+
echo "Starting MariaDB (CADT should reconnect and recover)..."
353+
service mariadb start
354+
for i in {1..30}; do
355+
if mysqladmin ping -h localhost --silent 2>/dev/null; then
356+
echo "MariaDB is back up"
357+
break
358+
fi
359+
sleep 1
360+
done
361+
# Re-verify the DB and test user persist - they were created
362+
# before the stop, and the data directory is unchanged.
363+
mysql -u cadt_test -pcadt_test_password -e "SELECT 1 as test;" || {
364+
echo "Failed to reconnect to MariaDB after restart"
365+
exit 1
366+
}
367+
echo "✓ MariaDB back online - CADT mirror recovery path will engage"
368+
302369
- name: Wait for wallet sync and datalayer readiness
303370
shell: bash
304371
run: |
@@ -593,7 +660,63 @@ jobs:
593660
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/chia.gpg] https://repo.chia.net/chia-tools/debian/ stable main" | tee /etc/apt/sources.list.d/chia-tools.list > /dev/null
594661
595662
apt-get update
596-
apt-get install -y yq jq bc iproute2 chia-blockchain-cli chia-tools
663+
apt-get install -y yq jq bc iproute2 default-mysql-server default-mysql-client chia-blockchain-cli chia-tools
664+
665+
- name: Configure MySQL for V1 mirror database testing
666+
shell: bash
667+
run: |
668+
echo "Starting MariaDB service..."
669+
service mariadb start
670+
671+
echo "Waiting for MariaDB to be ready..."
672+
for i in {1..30}; do
673+
if mysqladmin ping -h localhost --silent 2>/dev/null; then
674+
echo "MariaDB is ready!"
675+
break
676+
fi
677+
echo "Waiting for MariaDB... ($i/30)"
678+
sleep 1
679+
done
680+
681+
echo "Creating V1 test database and user..."
682+
mysql -u root <<EOF
683+
CREATE DATABASE IF NOT EXISTS cadt_mirror_test;
684+
CREATE USER IF NOT EXISTS 'cadt_test'@'localhost' IDENTIFIED BY 'cadt_test_password';
685+
GRANT ALL PRIVILEGES ON cadt_mirror_test.* TO 'cadt_test'@'localhost';
686+
FLUSH PRIVILEGES;
687+
EOF
688+
689+
echo "Verifying MariaDB setup..."
690+
mysql -u cadt_test -pcadt_test_password -e "SELECT 1 as test;" || {
691+
echo "Failed to connect as test user"
692+
exit 1
693+
}
694+
echo "✓ MariaDB (V1 mirror database) is ready for testing"
695+
696+
# Stop MariaDB so CADT starts against a DOWN mirror. This
697+
# exercises the V1 mirror init race (same bug class as V2: the
698+
# fire-and-forget handler would skip model.init() when MySQL was
699+
# unreachable at module load). The "Delay-start MariaDB" step
700+
# below restarts MariaDB ~30s after CADT, and the V1 mirror
701+
# reconnect path must transparently run migrations + backfill.
702+
#
703+
# Fail the job if mysqladmin can still ping MariaDB after 10
704+
# seconds of retries rather than letting the race test silently
705+
# degrade into a no-op against an always-up DB.
706+
echo "Stopping MariaDB so CADT starts against a down mirror..."
707+
service mariadb stop
708+
for STOP_ATTEMPT in $(seq 1 10); do
709+
if ! mysqladmin ping -h localhost --silent 2>/dev/null; then
710+
echo "✓ MariaDB stopped after ${STOP_ATTEMPT}s - CADT V1 startup will race against delayed restart"
711+
break
712+
fi
713+
if [ "$STOP_ATTEMPT" -eq 10 ]; then
714+
echo "ERROR: MariaDB still responding 10s after 'service mariadb stop'."
715+
echo "The race test's precondition (MariaDB down when CADT starts) cannot be met."
716+
exit 1
717+
fi
718+
sleep 1
719+
done
597720
598721
- name: Configure Chia
599722
shell: bash
@@ -608,7 +731,9 @@ jobs:
608731
- name: Configure CADT
609732
shell: bash
610733
run: |
611-
# Run cadt momentarily to create config file
734+
# Run cadt momentarily to create config file. MariaDB is
735+
# intentionally stopped at this point to exercise the V1 mirror
736+
# init race on first boot.
612737
pm2 start npm --no-autorestart --name "cadt" -- start
613738
sleep 10
614739
pm2 logs cadt --nostream
@@ -636,6 +761,16 @@ jobs:
636761
yq -yi '.V1.ENABLE = true' ~/.chia/mainnet/cadt/config.yaml
637762
yq -yi '.V2.ENABLE = false' ~/.chia/mainnet/cadt/config.yaml
638763
764+
# Configure MySQL mirror database for V1 testing.
765+
# V1.MIRROR_DB is the V1 analog of V2.MIRROR_DB; the V1 live-api
766+
# test runner (tests/v1/live-api/data-short.js) reads this path
767+
# and asserts each record is mirrored correctly.
768+
echo "Configuring MySQL mirror database for V1..."
769+
yq -yi '.V1.MIRROR_DB.DB_HOST = "localhost"' ~/.chia/mainnet/cadt/config.yaml
770+
yq -yi '.V1.MIRROR_DB.DB_USERNAME = "cadt_test"' ~/.chia/mainnet/cadt/config.yaml
771+
yq -yi '.V1.MIRROR_DB.DB_PASSWORD = "cadt_test_password"' ~/.chia/mainnet/cadt/config.yaml
772+
yq -yi '.V1.MIRROR_DB.DB_NAME = "cadt_mirror_test"' ~/.chia/mainnet/cadt/config.yaml
773+
639774
# Show the config file
640775
echo "Showing the config file after configuration..."
641776
cat ~/.chia/mainnet/cadt/config.yaml
@@ -708,8 +843,60 @@ jobs:
708843
- name: Start CADT
709844
shell: bash
710845
run: |
846+
# MariaDB is still stopped at this point (see "Configure MySQL
847+
# for V1 mirror database testing"). CADT must start successfully,
848+
# log the mirror setup failure as non-fatal, and expose its API.
849+
# The subsequent "Delay-start MariaDB" step restarts MariaDB
850+
# shortly after so the V1 mirror reconnect path is exercised
851+
# while CADT is still booting.
711852
pm2 start npm --no-autorestart --name "cadt" -- start
712-
sleep 30
853+
echo "Waiting for CADT API endpoint (V1 port ${CW_PORT:-31310})..."
854+
MAX_CADT_HEALTH_ATTEMPTS=30
855+
for CADT_ATTEMPT in $(seq 1 "$MAX_CADT_HEALTH_ATTEMPTS"); do
856+
# V1 API doesn't expose /health - probe /v1/organizations instead
857+
if curl -fsS --max-time 5 http://127.0.0.1:31310/v1/organizations >/dev/null 2>&1; then
858+
echo "CADT V1 API is ready"
859+
break
860+
fi
861+
if [ "$CADT_ATTEMPT" -eq "$MAX_CADT_HEALTH_ATTEMPTS" ]; then
862+
echo "ERROR: CADT V1 API did not become ready in time"
863+
pm2 logs cadt --lines 200 --nostream || true
864+
exit 1
865+
fi
866+
sleep 2
867+
done
868+
869+
- name: Delay-start MariaDB (exercise V1 mirror reconnect recovery)
870+
shell: bash
871+
run: |
872+
# Restart MariaDB now that CADT has been running without it.
873+
# The next V1 mirror operation should run prepareMysqlMirror +
874+
# backfillMirror via the reconnect path, catching up any writes
875+
# or deletes that would otherwise be lost during the outage
876+
# window. The V1 live-api test runner
877+
# (tests/v1/live-api/data-short.js) asserts each record lands
878+
# in the MySQL mirror, so a regression in the init race or
879+
# reconnect/backfill logic will fail this job.
880+
#
881+
# Small delay first so CADT accumulates multiple "mirror down"
882+
# log entries - matches the V2 job's sequencing and gives the
883+
# reconnect path a non-trivial backlog to catch up on.
884+
echo "Waiting a few seconds so CADT logs multiple mirror-down messages..."
885+
sleep 10
886+
echo "Starting MariaDB (CADT should reconnect and recover)..."
887+
service mariadb start
888+
for i in {1..30}; do
889+
if mysqladmin ping -h localhost --silent 2>/dev/null; then
890+
echo "MariaDB is back up"
891+
break
892+
fi
893+
sleep 1
894+
done
895+
mysql -u cadt_test -pcadt_test_password -e "SELECT 1 as test;" || {
896+
echo "Failed to reconnect to MariaDB after restart"
897+
exit 1
898+
}
899+
echo "✓ MariaDB back online - CADT V1 mirror recovery path will engage"
713900
714901
- name: Wait for wallet sync and datalayer readiness
715902
shell: bash

0 commit comments

Comments
 (0)