test(blockchain/sql): share one Postgres test container instead of one per call#1045
Conversation
…e per call SetupTestPostgresContainer started a new postgres:16-alpine container on every call, so packages with many Postgres tests spun up dozens of containers per run (each a ~30s readiness wait). Under -race the cumulative startups blew the package timeout and container-creation contention caused transient connection errors — a major unit-test flakiness/slowness source. Start one container per test binary lazily (sync.Once) and hand each caller its own database (testdb_<atomic counter>) created on the postgres system DB and serialized behind a mutex (template-lock-safe even if a future caller adds t.Parallel). Cleanup drops the per-test DB WITH (FORCE); the shared container is reaped by Ryuk at process exit. Public signature unchanged; skip-on-no-Docker preserved. Audited: 8 callers, none use t.Parallel or assume container identity.
|
🤖 Claude Code Review Status: Complete All previously reported issues have been addressed. The implementation is solid with proper error handling, concurrency controls, and resource management. Key strengths:
History: |
There was a problem hiding this comment.
Pull request overview
This PR optimizes Postgres-backed Go tests by starting a single postgres:16-alpine testcontainers instance per test binary (lazy sync.Once) and creating an isolated per-call database on that shared container, reducing CI timeouts/flakiness from repeated container startups.
Changes:
- Introduces a shared container lifecycle (initialized once) and per-call unique DB creation via an atomic counter.
- Adds admin helpers to swap DB in the connection string, execute short-lived admin SQL, and ensure
sslmode=disable. - Updates readiness/validation logic and retry/backoff behavior for more reliable container startup.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- forbidigo: replace fmt.Errorf with errors.NewProcessingError (5 sites) - validateDatabaseConnection: wrap lastErr as the cause instead of fmt.Sprintf into the message, preserving the error chain (Copilot) - correct the CREATE DATABASE comment: connecting the admin session to the postgres system DB matters because that session isn't counted as an active template1 user; CREATE DATABASE still copies template1 by default (Copilot) - terminate the container if ConnectionString fails after creation, so a failed startup can't leak a stray container (Copilot) - document that the shared container is intentionally not retained and is reaped by the testcontainers Ryuk reaper at process exit
|
Addressed review feedback in 7d2b802: golangci-lint (forbidigo): replaced all Copilot (3 inline, resolved):
Claude Code Review: finding 1 (error formatting) and finding 2 (Ryuk cleanup documentation) are both covered by the above. On the |
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-06-08 14:27 UTC |
ordishs
left a comment
There was a problem hiding this comment.
Approve with comments. The design is correct and the isolation guarantees hold — one container per test binary via sync.Once plus a per-call testdb_N, public signature unchanged, and I verified all callers immediately parse the connStr/build a store with 0 use of t.Parallel().
A few non-blocking items to address before merge:
1. sync.Once poisons the binary on first-call failure. Previously each call ran its own 3-attempt retry; now if startSharedContainer() fails on the first caller (transient port contention, slow image pull), containerErr is permanent and every subsequent test in that binary fails. The 3 internal attempts mitigate this, but a transient first-start failure now takes out a whole package's tests instead of one. Worth a note in the PR body.
2. Error-handling inconsistency. New code mostly uses stdlib fmt.Errorf(... %w ...), while the file also uses errors.NewProcessingError (the repo's github.com/bsv-blockchain/teranode/errors). Mixing both in one file is inconsistent — prefer the custom errors package for the new wrap sites to match project convention.
3. Misleading comments (minor):
- The
CREATE DATABASEcomment says connecting topostgresavoids template access "since postgres system DB is never used as a template source" — butCREATE DATABASEstill copiestemplate1by default regardless of connection DB. The code is correct; the real protections are (a) nobody connects to template1, (b) creates are serialized bycreateMu. The reasoning just misstates why. - The
varblock comment says "sharedContainer holds the single shared PostgreSQL container" but there is no container variable — onlysharedConnStr string. The container is reaped by Ryuk, never explicitly terminated. Fix the comment to match.
4. Verification. The test-plan checkboxes are unchecked, and this PR's entire justification is empirical (~100s→6s, removes the -race SIGTERM class). Please confirm the listed go test -race commands were actually run and paste timings before merge.
Address review (Simon #3b): the var-block comment described a 'sharedContainer' that does not exist — the retained state is sharedConnStr (a connStr); the container handle is not kept (lifecycle bound to the test process / CI runner teardown, Ryuk disabled). Also notes the sync.Once first-start poisoning tradeoff.
…shared-test-container
|
@ordishs all four addressed:
|
|



Summary
SetupTestPostgresContainer()started a brand-newpostgres:16-alpinecontainer on every call. Packages with many Postgres tests (e.g.stores/blockchain/sql) spun up dozens of containers per run, each with a ~30s readiness wait. Under-racethe cumulative startups blew the per-package timeout and the container-creation contention produced transient connection errors — a major source of unit-test flakiness and slow CI.This shares one container per test binary (lazy
sync.Once) and gives each caller its own isolated database on it. The public signature is unchanged, so no callers need updating.Change (
test/utils/postgres/container.go)postgres:16-alpinecontainer lazily viasync.Once(keeps the 3-attempt startup retry + readiness validation).SetupTestPostgresContainer()call creates a unique database (testdb_<atomic counter>) on that container and returns a connection string pointing at it → full per-test isolation, identical to today's fresh-container behaviour (each store runs fresh migrations/genesis on an empty DB).CREATE DATABASEis issued against thepostgressystem DB (never a template) and serialized behind a mutex, so it cannot hittemplate1 is being accessed by other userseven if a future caller addst.Parallel().WITH (FORCE)(terminates the store's lingering backends; Postgres 16). The shared container handle is not retained — its lifecycle is bound to the test process / CI runner teardown (Ryuk is disabled in CI; locally, if enabled, the testcontainers reaper cleans it up at process exit).t.Skipf).Blast radius
Audited all callers repo-wide: 8 caller files, 0 use
t.Parallel(), none assume container identity/port/lifecycle — all immediately parse the connStr and build a store on a fresh DB. Fully compatible with shared-container + per-test-DB.Known tradeoff —
sync.OncepoisoningPreviously each call ran its own 3-attempt retry, so a transient first-start failure (slow image pull, port contention) took out only that one test. Now the first caller's
startSharedContainer()runs undersync.Once; if it fails,containerErris permanent and every subsequent test in that binary fails rather than just one. The 3 internal start attempts mitigate this (the failure has to persist across all three), but the blast radius of a genuinely stuck first start is wider. Accepted as the cost of not paying a fresh ~30s container start per test; ephemeral CI runners make a poisoned binary a re-run, not a stuck state.Impact
stores/blockchain/sqlnon-fork Postgres subtests: ~100s → ~6s for 8 iterations (one container instead of eight). Removes the-racetimeout/SIGTERM class on the unit-test job.Test plan
go test -race -tags testtxmetacache -count=1 ./stores/blockchain/sql/...(primary shared-container caller), run locally on this branch:-race. Previously the per-call container startups blew the per-package timeout — this is the SIGTERM class this PR removes.template1 is being accessed/ stale-DB errors across the run.GetLatestBlockHeaderFromBlockLocator/fork_scenario, which is the pre-existingon_main_chainstartup-rebuild race fixed in fix(blockchain/sql): snapshot-isolate on_main_chain rebuild to prevent startup race #1044 — not caused by this change. Confirmed by A/B: it fails on this (racy) branch and passes 5/5 with the fix(blockchain/sql): snapshot-isolate on_main_chain rebuild to prevent startup race #1044 fix applied. This PR goes green once fix(blockchain/sql): snapshot-isolate on_main_chain rebuild to prevent startup race #1044 lands and this branch picks upmain.