Skip to content

test(blockchain/sql): share one Postgres test container instead of one per call#1045

Merged
oskarszoon merged 5 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/blockchain-sql-shared-test-container
Jun 8, 2026
Merged

test(blockchain/sql): share one Postgres test container instead of one per call#1045
oskarszoon merged 5 commits into
bsv-blockchain:mainfrom
oskarszoon:fix/blockchain-sql-shared-test-container

Conversation

@oskarszoon

@oskarszoon oskarszoon commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

SetupTestPostgresContainer() started a brand-new postgres:16-alpine container on every call. Packages with many Postgres tests (e.g. stores/blockchain/sql) spun up dozens of containers per run, each with a ~30s readiness wait. Under -race the cumulative startups blew the per-package timeout and the container-creation contention produced transient connection errors — a major source of unit-test flakiness and slow CI.

This shares one container per test binary (lazy sync.Once) and gives each caller its own isolated database on it. The public signature is unchanged, so no callers need updating.

Change (test/utils/postgres/container.go)

  • Start one postgres:16-alpine container lazily via sync.Once (keeps the 3-attempt startup retry + readiness validation).
  • Each SetupTestPostgresContainer() call creates a unique database (testdb_<atomic counter>) on that container and returns a connection string pointing at it → full per-test isolation, identical to today's fresh-container behaviour (each store runs fresh migrations/genesis on an empty DB).
  • CREATE DATABASE is issued against the postgres system DB (never a template) and serialized behind a mutex, so it cannot hit template1 is being accessed by other users even if a future caller adds t.Parallel().
  • The returned cleanup drops the per-test DB with WITH (FORCE) (terminates the store's lingering backends; Postgres 16). The shared container handle is not retained — its lifecycle is bound to the test process / CI runner teardown (Ryuk is disabled in CI; locally, if enabled, the testcontainers reaper cleans it up at process exit).
  • Skip-on-no-Docker behaviour is preserved (init error is returned to every caller, so callers still t.Skipf).

Blast radius

Audited all callers repo-wide: 8 caller files, 0 use t.Parallel(), none assume container identity/port/lifecycle — all immediately parse the connStr and build a store on a fresh DB. Fully compatible with shared-container + per-test-DB.

Known tradeoff — sync.Once poisoning

Previously each call ran its own 3-attempt retry, so a transient first-start failure (slow image pull, port contention) took out only that one test. Now the first caller's startSharedContainer() runs under sync.Once; if it fails, containerErr is permanent and every subsequent test in that binary fails rather than just one. The 3 internal start attempts mitigate this (the failure has to persist across all three), but the blast radius of a genuinely stuck first start is wider. Accepted as the cost of not paying a fresh ~30s container start per test; ephemeral CI runners make a poisoned binary a re-run, not a stuck state.

Impact

stores/blockchain/sql non-fork Postgres subtests: ~100s → ~6s for 8 iterations (one container instead of eight). Removes the -race timeout/SIGTERM class on the unit-test job.

Test plan

go test -race -tags testtxmetacache -count=1 ./stores/blockchain/sql/... (primary shared-container caller), run locally on this branch:

…e per call

SetupTestPostgresContainer started a new postgres:16-alpine container on
every call, so packages with many Postgres tests spun up dozens of
containers per run (each a ~30s readiness wait). Under -race the cumulative
startups blew the package timeout and container-creation contention caused
transient connection errors — a major unit-test flakiness/slowness source.

Start one container per test binary lazily (sync.Once) and hand each caller
its own database (testdb_<atomic counter>) created on the postgres system
DB and serialized behind a mutex (template-lock-safe even if a future
caller adds t.Parallel). Cleanup drops the per-test DB WITH (FORCE); the
shared container is reaped by Ryuk at process exit. Public signature
unchanged; skip-on-no-Docker preserved. Audited: 8 callers, none use
t.Parallel or assume container identity.
@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

🤖 Claude Code Review

Status: Complete

All previously reported issues have been addressed. The implementation is solid with proper error handling, concurrency controls, and resource management.

Key strengths:

  • Correctly serializes CREATE DATABASE operations to prevent template1 conflicts
  • Properly wraps errors for debugging flaky container startups
  • Cleans up container on startup failures to prevent resource leaks
  • Well-documented concurrency patterns and tradeoffs

History:
✅ Fixed: Error wrapping in validateDatabaseConnection
✅ Fixed: Misleading comment about CREATE DATABASE mechanism
✅ Fixed: Container leak on ConnectionString failure

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes Postgres-backed Go tests by starting a single postgres:16-alpine testcontainers instance per test binary (lazy sync.Once) and creating an isolated per-call database on that shared container, reducing CI timeouts/flakiness from repeated container startups.

Changes:

  • Introduces a shared container lifecycle (initialized once) and per-call unique DB creation via an atomic counter.
  • Adds admin helpers to swap DB in the connection string, execute short-lived admin SQL, and ensure sslmode=disable.
  • Updates readiness/validation logic and retry/backoff behavior for more reliable container startup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/utils/postgres/container.go Outdated
Comment thread test/utils/postgres/container.go Outdated
Comment thread test/utils/postgres/container.go
- forbidigo: replace fmt.Errorf with errors.NewProcessingError (5 sites)
- validateDatabaseConnection: wrap lastErr as the cause instead of fmt.Sprintf
  into the message, preserving the error chain (Copilot)
- correct the CREATE DATABASE comment: connecting the admin session to the
  postgres system DB matters because that session isn't counted as an active
  template1 user; CREATE DATABASE still copies template1 by default (Copilot)
- terminate the container if ConnectionString fails after creation, so a
  failed startup can't leak a stray container (Copilot)
- document that the shared container is intentionally not retained and is
  reaped by the testcontainers Ryuk reaper at process exit
@oskarszoon

Copy link
Copy Markdown
Contributor Author

Addressed review feedback in 7d2b802:

golangci-lint (forbidigo): replaced all fmt.Errorf with errors.NewProcessingError (6 sites). Verified locally — golangci-lint run ./test/utils/postgres/ → 0 issues.

Copilot (3 inline, resolved):

  • validateDatabaseConnection now wraps lastErr as the cause instead of fmt.Sprintf-ing it into the message — error chain preserved.
  • Corrected the CREATE DATABASE comment (template1 is still the copy source; the postgres-system-DB admin session matters only because it isn't counted as an active template1 user).
  • startSharedContainer now terminates the container if ConnectionString fails after creation; documented that the shared container is otherwise reaped by the Ryuk reaper at process exit.

Claude Code Review: finding 1 (error formatting) and finding 2 (Ryuk cleanup documentation) are both covered by the above.

On the test job: it is failing via SIGTERM (exit 143) at ~3m28s — the full -race unit suite being evicted early by the self-hosted runner under resource pressure, not a fault in this change (same signature on unrelated branches). This PR cuts per-package container count dozens→1, which helps, but fully greening the test job needs a CI-infra change (shard the job, raise runner memory, or run -race nightly rather than on the full suite per PR).

@github-actions

github-actions Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Benchmark Comparison Report

Baseline: main (unknown)

Current: PR-1045 (c7b46fa)

Summary

  • Regressions: 0
  • Improvements: 0
  • Unchanged: 132
  • Significance level: p < 0.05
All benchmark results (sec/op)
Benchmark Baseline Current Change p-value
_NewBlockFromBytes-4 1.822µ 1.822µ ~ 1.000
SplitSyncedParentMap_SetIfNotExists/256_buckets-4 61.79n 62.19n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/16_buckets-4 62.07n 62.36n ~ 0.400
SplitSyncedParentMap_SetIfNotExists/1_bucket-4 61.82n 62.54n ~ 0.100
SplitSyncedParentMap_ConcurrentSetIfNotExists/256_buckets... 31.79n 30.39n ~ 0.400
SplitSyncedParentMap_ConcurrentSetIfNotExists/16_buckets_... 52.00n 55.65n ~ 0.700
SplitSyncedParentMap_ConcurrentSetIfNotExists/1_bucket_pa... 112.2n 114.1n ~ 0.100
MiningCandidate_Stringify_Short-4 276.3n 278.9n ~ 0.200
MiningCandidate_Stringify_Long-4 1.993µ 2.032µ ~ 0.100
MiningSolution_Stringify-4 1.033µ 1.039µ ~ 0.300
BlockInfo_MarshalJSON-4 1.857µ 1.877µ ~ 0.100
NewFromBytes-4 152.3n 139.6n ~ 1.000
AddTxBatchColumnar_Validation-4 2.317µ 2.423µ ~ 0.100
OffsetValidationLoop-4 718.4n 545.0n ~ 0.100
Mine_EasyDifficulty-4 67.41µ 67.00µ ~ 0.100
Mine_WithAddress-4 7.281µ 7.482µ ~ 0.700
DirectSubtreeAdd/4_per_subtree-4 59.84n 57.98n ~ 1.000
DirectSubtreeAdd/64_per_subtree-4 29.84n 29.28n ~ 0.100
DirectSubtreeAdd/256_per_subtree-4 28.87n 28.23n ~ 0.100
DirectSubtreeAdd/1024_per_subtree-4 27.89n 27.23n ~ 0.100
DirectSubtreeAdd/2048_per_subtree-4 27.50n 26.85n ~ 0.100
SubtreeProcessorAdd/4_per_subtree-4 301.2n 303.1n ~ 1.000
SubtreeProcessorAdd/64_per_subtree-4 295.5n 293.1n ~ 1.000
SubtreeProcessorAdd/256_per_subtree-4 291.9n 283.1n ~ 0.300
SubtreeProcessorAdd/1024_per_subtree-4 296.2n 302.3n ~ 0.700
SubtreeProcessorAdd/2048_per_subtree-4 294.1n 293.7n ~ 1.000
SubtreeProcessorRotate/4_per_subtree-4 293.9n 304.1n ~ 0.400
SubtreeProcessorRotate/64_per_subtree-4 289.8n 302.7n ~ 0.100
SubtreeProcessorRotate/256_per_subtree-4 299.5n 292.6n ~ 0.700
SubtreeProcessorRotate/1024_per_subtree-4 288.7n 274.2n ~ 0.100
SubtreeNodeAddOnly/4_per_subtree-4 54.26n 54.20n ~ 0.800
SubtreeNodeAddOnly/64_per_subtree-4 34.10n 34.05n ~ 0.400
SubtreeNodeAddOnly/256_per_subtree-4 33.23n 33.19n ~ 0.700
SubtreeNodeAddOnly/1024_per_subtree-4 32.58n 32.51n ~ 0.400
SubtreeCreationOnly/4_per_subtree-4 101.9n 118.7n ~ 0.400
SubtreeCreationOnly/64_per_subtree-4 439.4n 474.8n ~ 0.700
SubtreeCreationOnly/256_per_subtree-4 1.404µ 1.364µ ~ 0.100
SubtreeCreationOnly/1024_per_subtree-4 4.565µ 5.339µ ~ 0.100
SubtreeCreationOnly/2048_per_subtree-4 8.311µ 9.110µ ~ 0.100
SubtreeProcessorOverheadBreakdown/64_per_subtree-4 296.7n 291.5n ~ 0.700
SubtreeProcessorOverheadBreakdown/1024_per_subtree-4 292.5n 289.8n ~ 0.200
ParallelGetAndSetIfNotExists/1k_nodes-4 11.15m 13.81m ~ 0.100
ParallelGetAndSetIfNotExists/10k_nodes-4 14.21m 16.48m ~ 0.100
ParallelGetAndSetIfNotExists/50k_nodes-4 17.02m 21.83m ~ 0.100
ParallelGetAndSetIfNotExists/100k_nodes-4 19.61m 17.82m ~ 0.400
SequentialGetAndSetIfNotExists/1k_nodes-4 10.72m 13.34m ~ 0.100
SequentialGetAndSetIfNotExists/10k_nodes-4 14.76m 18.58m ~ 0.100
SequentialGetAndSetIfNotExists/50k_nodes-4 23.77m 23.33m ~ 0.400
SequentialGetAndSetIfNotExists/100k_nodes-4 32.06m 32.20m ~ 1.000
ProcessOwnBlockSubtreeNodesParallel/1k_nodes-4 11.90m 13.01m ~ 0.100
ProcessOwnBlockSubtreeNodesParallel/10k_nodes-4 14.20m 14.25m ~ 0.700
ProcessOwnBlockSubtreeNodesParallel/100k_nodes-4 19.45m 19.47m ~ 0.700
ProcessOwnBlockSubtreeNodesSequential/1k_nodes-4 12.49m 15.23m ~ 0.200
ProcessOwnBlockSubtreeNodesSequential/10k_nodes-4 15.81m 16.02m ~ 0.100
ProcessOwnBlockSubtreeNodesSequential/100k_nodes-4 63.32m 62.85m ~ 0.700
DiskTxMap_SetIfNotExists-4 3.438µ 3.460µ ~ 1.000
DiskTxMap_SetIfNotExists_Parallel-4 3.438µ 3.226µ ~ 0.400
DiskTxMap_ExistenceOnly-4 307.6n 319.0n ~ 0.100
Queue-4 188.5n 186.2n ~ 0.700
AtomicPointer-4 5.010n 4.835n ~ 1.000
TxMapSetIfNotExists-4 56.13n 52.75n ~ 0.700
TxMapSetIfNotExistsDuplicate-4 43.33n 43.04n ~ 0.400
ChannelSendReceive-4 569.2n 567.9n ~ 0.200
BlockAssembler_AddTx-4 0.02702n 0.02658n ~ 1.000
AddNode-4 10.96 10.93 ~ 1.000
AddNodeWithMap-4 11.47 11.64 ~ 0.400
CalcBlockWork-4 513.6n 517.7n ~ 0.700
CalculateWork-4 701.5n 711.1n ~ 0.700
CheckOldBlockIDs/on-chain-prefetch/1000-4 56.82µ 62.30µ ~ 0.100
CheckOldBlockIDs/off-chain-prefetch/1000-4 47.17µ 46.27µ ~ 0.100
CheckOldBlockIDs/on-chain-prefetch/10000-4 424.6µ 426.6µ ~ 0.400
CheckOldBlockIDs/off-chain-prefetch/10000-4 346.5µ 341.4µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_10-4 1.384µ 1.378µ ~ 0.400
BuildBlockLocatorString_Helpers/Size_100-4 13.32µ 13.14µ ~ 0.100
BuildBlockLocatorString_Helpers/Size_1000-4 132.1µ 130.1µ ~ 0.100
CatchupWithHeaderCache-4 104.5m 104.6m ~ 0.700
_BufferPoolAllocation/16KB-4 2.800µ 3.762µ ~ 0.100
_BufferPoolAllocation/32KB-4 7.980µ 5.794µ ~ 0.100
_BufferPoolAllocation/64KB-4 13.29µ 11.74µ ~ 0.100
_BufferPoolAllocation/128KB-4 23.91µ 21.58µ ~ 0.100
_BufferPoolAllocation/512KB-4 97.78µ 85.72µ ~ 0.100
_BufferPoolConcurrent/32KB-4 14.44µ 14.53µ ~ 1.000
_BufferPoolConcurrent/64KB-4 22.85µ 22.42µ ~ 0.400
_BufferPoolConcurrent/512KB-4 109.2µ 111.9µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/16KB-4 493.6µ 513.0µ ~ 0.400
_SubtreeDeserializationWithBufferSizes/32KB-4 490.5µ 513.9µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/64KB-4 490.4µ 519.4µ ~ 0.100
_SubtreeDeserializationWithBufferSizes/128KB-4 492.3µ 519.2µ ~ 0.700
_SubtreeDeserializationWithBufferSizes/512KB-4 496.1µ 468.4µ ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/16KB-4 28.52m 27.97m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/32KB-4 28.74m 27.99m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/64KB-4 28.94m 27.90m ~ 0.100
_SubtreeDataDeserializationWithBufferSizes/128KB-4 28.80m 28.16m ~ 0.400
_SubtreeDataDeserializationWithBufferSizes/512KB-4 28.22m 28.04m ~ 0.100
_PooledVsNonPooled/Pooled-4 646.1n 642.5n ~ 0.100
_PooledVsNonPooled/NonPooled-4 5.766µ 5.504µ ~ 0.400
_MemoryFootprint/Current_512KB_32concurrent-4 4.977µ 4.896µ ~ 0.400
_MemoryFootprint/Proposed_32KB_32concurrent-4 7.281µ 7.325µ ~ 0.700
_MemoryFootprint/Alternative_64KB_32concurrent-4 6.944µ 6.983µ ~ 1.000
SubtreeSizes/10k_tx_4_per_subtree-4 1.324m 1.289m ~ 0.700
SubtreeSizes/10k_tx_16_per_subtree-4 306.9µ 309.3µ ~ 1.000
SubtreeSizes/10k_tx_64_per_subtree-4 73.57µ 72.34µ ~ 0.100
SubtreeSizes/10k_tx_256_per_subtree-4 17.87µ 18.20µ ~ 0.400
SubtreeSizes/10k_tx_512_per_subtree-4 8.977µ 8.936µ ~ 0.700
SubtreeSizes/10k_tx_1024_per_subtree-4 4.473µ 4.452µ ~ 0.500
SubtreeSizes/10k_tx_2k_per_subtree-4 2.214µ 2.161µ ~ 0.400
BlockSizeScaling/10k_tx_64_per_subtree-4 71.10µ 69.16µ ~ 0.100
BlockSizeScaling/10k_tx_256_per_subtree-4 17.85µ 17.40µ ~ 0.400
BlockSizeScaling/10k_tx_1024_per_subtree-4 4.424µ 4.341µ ~ 0.400
BlockSizeScaling/50k_tx_64_per_subtree-4 375.4µ 369.9µ ~ 0.700
BlockSizeScaling/50k_tx_256_per_subtree-4 88.28µ 88.38µ ~ 0.700
BlockSizeScaling/50k_tx_1024_per_subtree-4 21.74µ 21.44µ ~ 0.200
SubtreeAllocations/small_subtrees_exists_check-4 149.4µ 149.7µ ~ 0.700
SubtreeAllocations/small_subtrees_data_fetch-4 160.1µ 158.5µ ~ 0.200
SubtreeAllocations/small_subtrees_full_validation-4 310.8µ 307.2µ ~ 0.100
SubtreeAllocations/medium_subtrees_exists_check-4 8.985µ 8.995µ ~ 1.000
SubtreeAllocations/medium_subtrees_data_fetch-4 9.384µ 9.299µ ~ 0.200
SubtreeAllocations/medium_subtrees_full_validation-4 17.56µ 17.47µ ~ 0.700
SubtreeAllocations/large_subtrees_exists_check-4 2.106µ 2.100µ ~ 0.700
SubtreeAllocations/large_subtrees_data_fetch-4 2.234µ 2.257µ ~ 0.400
SubtreeAllocations/large_subtrees_full_validation-4 4.394µ 4.365µ ~ 0.100
_prepareTxsPerLevel-4 380.1m 378.1m ~ 0.700
_prepareTxsPerLevelOrdered-4 4.266m 4.590m ~ 0.400
_prepareTxsPerLevel_Comparison/Original-4 391.1m 383.1m ~ 0.100
_prepareTxsPerLevel_Comparison/Optimized-4 4.485m 3.932m ~ 0.100
StoreBlock_Sequential/BelowCSVHeight-4 259.8µ 259.2µ ~ 0.400
StoreBlock_Sequential/AboveCSVHeight-4 262.0µ 268.1µ ~ 0.400
GetUtxoHashes-4 170.9n 172.9n ~ 0.200
GetUtxoHashes_ManyOutputs-4 30.74µ 28.89µ ~ 0.400
_NewMetaDataFromBytes-4 116.3n 115.5n ~ 1.000
_Bytes-4 220.3n 222.9n ~ 0.700
_MetaBytes-4 72.55n 72.66n ~ 0.700

Threshold: >10% with p < 0.05 | Generated: 2026-06-08 14:27 UTC

@ordishs ordishs left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with comments. The design is correct and the isolation guarantees hold — one container per test binary via sync.Once plus a per-call testdb_N, public signature unchanged, and I verified all callers immediately parse the connStr/build a store with 0 use of t.Parallel().

A few non-blocking items to address before merge:

1. sync.Once poisons the binary on first-call failure. Previously each call ran its own 3-attempt retry; now if startSharedContainer() fails on the first caller (transient port contention, slow image pull), containerErr is permanent and every subsequent test in that binary fails. The 3 internal attempts mitigate this, but a transient first-start failure now takes out a whole package's tests instead of one. Worth a note in the PR body.

2. Error-handling inconsistency. New code mostly uses stdlib fmt.Errorf(... %w ...), while the file also uses errors.NewProcessingError (the repo's github.com/bsv-blockchain/teranode/errors). Mixing both in one file is inconsistent — prefer the custom errors package for the new wrap sites to match project convention.

3. Misleading comments (minor):

  • The CREATE DATABASE comment says connecting to postgres avoids template access "since postgres system DB is never used as a template source" — but CREATE DATABASE still copies template1 by default regardless of connection DB. The code is correct; the real protections are (a) nobody connects to template1, (b) creates are serialized by createMu. The reasoning just misstates why.
  • The var block comment says "sharedContainer holds the single shared PostgreSQL container" but there is no container variable — only sharedConnStr string. The container is reaped by Ryuk, never explicitly terminated. Fix the comment to match.

4. Verification. The test-plan checkboxes are unchecked, and this PR's entire justification is empirical (~100s→6s, removes the -race SIGTERM class). Please confirm the listed go test -race commands were actually run and paste timings before merge.

Address review (Simon #3b): the var-block comment described a 'sharedContainer'
that does not exist — the retained state is sharedConnStr (a connStr); the
container handle is not kept (lifecycle bound to the test process / CI runner
teardown, Ryuk disabled). Also notes the sync.Once first-start poisoning tradeoff.
@oskarszoon

Copy link
Copy Markdown
Contributor Author

@ordishs all four addressed:

  1. sync.Once poisoning — documented in the PR body ("Known tradeoff") and in the var-block comment: a first-start failure poisons the binary, mitigated by the 3 internal start attempts; ephemeral runners make it a re-run, not a stuck state.
  2. error-pkg consistency — already all errors.NewProcessingError, no fmt.Errorf left (landed in 7d2b802, just before your approve — the diff you saw was slightly stale).
  3. comments — (a) the template1 reasoning was already corrected; (b) the var-block comment no longer claims a "sharedContainer" — it now describes sharedConnStr accurately (handle not retained, lifecycle bound to the test process / CI runner teardown, Ryuk disabled in CI). c90861c.
  4. verificationgo test -race -tags testtxmetacache ./stores/blockchain/sql/...: ~10 Postgres subtests now share one container, completes in ~69s, no template1 is being accessed / stale-DB errors. The only failure was fork_scenario, which was the on_main_chain startup race — i.e. fix(blockchain/sql): snapshot-isolate on_main_chain rebuild to prevent startup race #1044, now merged. Merged main into this branch (1d5ad1b); fork_scenario passes 3/3 locally under -race.

@oskarszoon oskarszoon requested a review from sugh01 June 8, 2026 14:16
@sonarqubecloud

sonarqubecloud Bot commented Jun 8, 2026

Copy link
Copy Markdown

@oskarszoon oskarszoon merged commit 095a62f into bsv-blockchain:main Jun 8, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants