feat(uaerospike): expose connection-semaphore multiplier as a setting#933
Conversation
The util/uaerospike.Client wraps the aerospike-client-go client with an
in-process semaphore sized at the policy's ConnectionQueueSize (or
DefaultConnectionQueueSize = 128 when no policy supplies one). Every
Put / PutBins / Delete / Get / Operate / BatchOperate call must acquire
a slot before proceeding; release on return.
Under heavy parallel workloads — especially batch-decorate paths that
fan out hundreds-to-thousands of concurrent goroutines into Aerospike —
the parked-goroutine memory tax dominates: each waiter pins its
NewKey, NewBatchRead, addAbstractedBins, and FieldNamesToStrings
allocations until a permit frees. On scaling-2 a single subtree-validator
pod parked ~21 K goroutines on acquirePermit, holding ~14 GB of closure
state, and OOM-killed at the 180 GiB memory limit.
This change adds an option-pattern knob so the in-process throttle can
be tuned (or removed) without rebuilding the binary:
aerospike_semaphore_multiplier (float64, default 1.0)
0 disables the semaphore entirely. All acquirePermit /
releasePermit calls become no-ops; only the underlying
aerospike client's own connection pool governs concurrency.
Use when the caller already bounds concurrency upstream
(e.g. via a batcher) and the parking overhead is undesirable.
1.0 preserves prior behavior (one semaphore slot per
ConnectionQueueSize-derived permit).
<1 over-restricts — useful when sharing the aerospike server
with other clients.
>1 raises the cap — only safe when the deployment has been
verified to handle more concurrent operations than the
default queue size implies.
Negative values are treated as 0 (semaphore disabled).
Implementation:
* util/uaerospike.WithSemaphoreMultiplier(float64) — a ClientOption
applied via the new variadic-options NewClient(host, port, opts...)
and the slice-args NewClientWithPolicyAndHostOpts(policy, hosts, opts).
The existing NewClientWithPolicyAndHost(policy, hosts...) signature
is preserved verbatim — it now forwards to the opts-aware variant
with a nil options slice.
* buildConnSemaphore(queueSize, multiplier) computes
max(1, round(queueSize * multiplier)) or returns nil when the
multiplier <= 0. acquirePermit / releasePermit no-op when the
underlying channel is nil.
* settings/aerospike_settings.go adds the float64 SemaphoreMultiplier
field with longdesc covering all four value bands.
* util/aerospike.go threads the setting through at the production
NewClientWithPolicyAndHostOpts call site.
Unit tests in util/uaerospike/client_test.go cover:
* buildConnSemaphore capacity math across default / disabled /
fractional / huge / tiny-positive-clamps-to-1 multipliers.
* acquirePermit / releasePermit no-op behavior on a Client whose
semaphore was disabled (1000 sequential cycles).
* newClientConfig option application + nil-option tolerance.
|
🤖 Claude Code Review Status: Complete No critical issues found. The PR implements a well-designed semaphore multiplier feature with comprehensive test coverage and proper handling of edge cases. Copilot Comment Addressed:
|
There was a problem hiding this comment.
Pull request overview
Adds a configurable multiplier (aerospike_semaphore_multiplier) to scale or disable the in-process connection semaphore used by util/uaerospike.Client, enabling operators to reduce goroutine parking/memory pressure under highly parallel workloads without rebuilding.
Changes:
- Introduces
ClientOptionplumbing inutil/uaerospikeand implementsWithSemaphoreMultiplier,buildConnSemaphore, and nil-semaphore no-op behavior inacquirePermit/releasePermit. - Threads the new setting through the production Aerospike client construction path in
util/aerospike.go. - Adds the new setting to
settings/AerospikeSettingsand adds unit tests for the sizing/disable behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| util/uaerospike/client.go | Adds option-driven semaphore sizing/disable behavior and nil-safe permit acquire/release. |
| util/uaerospike/client_test.go | Adds tests covering semaphore sizing, disable/no-op semantics, and option application. |
| util/aerospike.go | Applies the new setting by passing WithSemaphoreMultiplier(...) into client construction. |
| settings/aerospike_settings.go | Exposes aerospike_semaphore_multiplier as a documented float64 setting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Benchmark Comparison ReportBaseline: Current: Summary
All benchmark results (sec/op)
Threshold: >10% with p < 0.05 | Generated: 2026-05-26 09:27 UTC |
oskarszoon
left a comment
There was a problem hiding this comment.
Request changes — default is broken.
Critical: SemaphoreMultiplier float64 has default:"1.0" in the struct tag, but the imperative settings loader at settings/settings.go:158-175 doesn't include a getFloat64("aerospike_semaphore_multiplier", 1.0, ...) entry. Struct tag is documentation-only. Empirically settings.NewSettings() returns SemaphoreMultiplier=0 — every prod deployment without an explicit override gets the semaphore disabled. Opposite of the documented "default preserves prior behavior".
Cascading regression: Client.GetConnectionQueueSize() returns cap(nil chan) = 0. The pruner at stores/utxo/aerospike/pruner/pruner_service.go:399 computes recommendedMax = 0 × threshold = 0 and auto-adjusts chunkGroupLimit to 1 — pruner concurrency collapses everywhere.
One-line fix in the loader block:
SemaphoreMultiplier: getFloat64("aerospike_semaphore_multiplier", 1.0, alternativeContext...),Plus a regression test that asserts the runtime default via settings.NewSettings() (current tests bypass the broken seam by calling newClientConfig / buildConnSemaphore directly).
Worth fixing in the same PR:
GetConnectionQueueSizereturning 0 when semaphore is disabled is conceptually wrong — should return the underlying aerospike client's queue size so the pruner heuristic keeps working when an operator deliberately opts out.- No upper clamp on
buildConnSemaphore.1.0e10typo wouldmake(chan, 1.28e12)≈ 10 GB. - PR #927 (credential redaction in the same file) merges cleanly, no conflict.
… size, clamp NaN/Inf Three correctness fixes from review of the original PR: 1. settings/settings.go was missing the imperative loader entry for aerospike_semaphore_multiplier, so the struct-tag default of "1.0" never took effect — settings.NewSettings() returned 0.0 and every deployment without an explicit override silently ran with the in-process semaphore disabled. Add the getFloat64 call so the documented 1.0 default actually applies. 2. Client.GetConnectionQueueSize returned cap(nil) = 0 when the semaphore was deliberately disabled, which made the pruner's recommendedMax = 0 × threshold and collapsed chunkGroupLimit to 1. Store the resolved underlying connection-queue size on Client and report it when the semaphore is nil so external pool-capacity heuristics survive an opt-out. 3. buildConnSemaphore did not guard NaN/+Inf or runaway scaled values before make(chan, ...). NaN now disables the semaphore (treated as garbage input); +Inf and any scaled value above maxSemaphoreCapacity (1<<20) are clamped so a typo like 1.0e10 cannot allocate a multi-GB channel. Adds a settings.NewSettings() regression test that asserts the runtime default is 1.0 — the existing tests bypass the broken seam by calling newClientConfig / buildConnSemaphore directly. Also extends TestBuildConnSemaphore with NaN/+Inf/clamp cases and adds an explicit fallback test for GetConnectionQueueSize.
|
@oskarszoon thanks for the catch — pushed 1. Broken default (the critical one) 2. 3. No upper clamp / NaN-Inf guard in
Tests added
Verification run locally
|
oskarszoon
left a comment
There was a problem hiding this comment.
Approve — all three prior concerns addressed in 1665f77e.
- Loader default wired at
settings/settings.go:173. Empirical default now1.0(was0). NewTestAerospikeSemaphoreMultiplier_DefaultcallsNewSettings()directly so the runtime-default seam is now protected by a regression test. GetConnectionQueueSizefalls back toconnQueueSize(captured at construction from underlying client policy) when the semaphore is nil. Right architectural shape — pruner heuristic keeps a meaningful pool-capacity hint when an operator opts out.- NaN treated as garbage-input (disabled), positive Inf and runaway scaled values clamped to
maxSemaphoreCapacity = 1<<20. Switch catches the overflow cases beforeint(math.Round(...)). New tests cover NaN, +Inf, 1.0e10, exact-at-max.
No new issues. 191 tests pass with -race (+8 from new tests).
|
Closing briefly to force CI re-trigger on the netsync fixture flake fix (push at 13:09 didn't fire a PR-tests workflow run for some reason). Reopening immediately. |
buildOOBFixture and buildInRangeFixture created their SyncedMap with limit=2 and then inserted exactly 2 entries. The _NilParentTx tests subsequently Set on an existing key; once len == limit, go-tx-map's setUnlocked evicts a random item from the map before writing — which ~50% of the time drops the child the SUT is about to look up, leaving the test failing with "tx <hash> not found in txMap". Reproduced locally with -count=20 under -race. Use an unbounded SyncedMap in the fixtures (the production path sizes by block tx count, not 2) and leave a comment explaining the eviction trap so future fixtures don't fall into the same hole. Duplicates the fix in PR #931 so this PR's CI is unblocked independently; will fold into theirs if #931 lands first. (Amended to force a fresh CI trigger — the prior push at 13:09 didn't fire a pull_request event.)
1064267 to
1ddd8f5
Compare
…ader # Conflicts: # services/legacy/netsync/handle_block_test.go
|



Summary
Adds a knob to scale (or disable) the in-process connection-semaphore inside
util/uaerospike.Client. The semaphore exists to cap concurrent Aerospike operations so over-parallel callers don't oversubscribe the underlyingaerospike-client-goconnection pool — but under heavy parallel workloads it can flip from "helpful throttle" to "memory tax": each parked waiter pins its per-call closure state (NewKey,NewBatchRead,addAbstractedBins,FieldNamesToStringsclones) until a permit frees.On scaling-2 the subtree-validator under cold-cache load was parking ~21 K goroutines on
acquirePermit, holding ~14 GB of closure state, and OOM-killing at the 180 GiB pod limit. This setting lets that throttle be turned down — or off — without rebuilding the binary.Change
New setting:
0acquirePermit/releasePermitcalls become no-ops; only the underlying aerospike client's own connection pool governs concurrency. Use when the caller already bounds concurrency upstream.1.0ConnectionQueueSize-derived permit).<1>1Negative values are treated as
0(semaphore disabled).Implementation
util/uaerospike.WithSemaphoreMultiplier(float64)— aClientOptionapplied via:NewClient(host, port, opts ...ClientOption)— variadic, backwards-compatible.NewClientWithPolicyAndHostOpts(policy, hosts, opts)— slice-args variant (the two variadic slice arguments couldn't coexist on the existing function).NewClientWithPolicyAndHost(policy, hosts...)keeps its existing signature verbatim and now forwards to the opts-aware variant withniloptions.buildConnSemaphore(queueSize, multiplier)— returnsmake(chan struct{}, max(1, round(queueSize * multiplier)))ornilwhen multiplier<= 0.acquirePermit/releasePermitno-op when the underlying channel isnil.settings/aerospike_settings.goadds the float64SemaphoreMultiplierfield with longdesc covering all four value bands.util/aerospike.gothreads the setting through at the productionNewClientWithPolicyAndHostOptscall site.Tests
util/uaerospike/client_test.goadds:TestBuildConnSemaphore— covers default1.0(passthrough),0/-1(disabled), fractional with round-half-up, double, tiny positive (clamps to 1), and large (8×) multipliers.TestWithSemaphoreMultiplier_DisableMakesAcquireNoOp— proves 1000 sequential acquire/release cycles complete without blocking when the semaphore is disabled.TestWithSemaphoreMultiplier_Scaling— confirmsnewClientConfigapplies the option and toleratesnilentries in the option slice.All existing tests in
util/uaerospikeandsettingscontinue to pass under-race.Test plan
go build ./...cleango vet ./util/uaerospike/... ./util/ ./settings/cleango test -race ./util/uaerospike/... ./settings/passaerospike_semaphore_multiplier=0and verify the goroutine count onacquirePermitdrops to zero and the cold-cache OOM cycle breaks.Risk
Low. Default value preserves prior behavior exactly. Callers wanting the new behavior opt in via the setting. The existing
NewClientWithPolicyAndHostsignature is preserved (it forwards to the opts-aware variant withniloptions), so no caller break.