Releases: oxia-db/oxia
v0.16.1
Oxia v0.16.1 Release Notes
Released: March 12, 2026
This patch release focuses on production stability, client resilience, and lays the groundwork for shard splitting. It includes fixes for several failure modes observed in distributed deployments—particularly around leadership transitions, follower recovery, and Kubernetes health checking—along with retry support for streaming client operations.
Highlights
Shard Split Foundations (#911)
This release introduces the protobuf definitions, data model types, and CLI tooling needed for shard splitting. A new oxia admin split-shard command and a SplitShard RPC in the coordination service establish the operational interface, while a SplitPhase enumeration (Init → Bootstrap → CatchUp → Cutover → Cleanup) models the lifecycle of a split operation. This is groundwork for the full shard splitting workflow coming in a future release.
Graceful Metadata Leadership Loss (#935)
Previously, the coordinator would panic on a metadata version conflict during Store() operations. The coordinator now returns a typed ErrMetadataBadVersion error instead of crashing, and WaitToBecomeLeader() exposes a channel that closes when leadership is lost (whether via configmap lease expiry or raft leader change). The gRPC server monitors this channel and cleanly restarts the coordinator, enabling proper failover without panic-induced crashes.
Follower Recovery After Data Cleanup (#929)
When a follower node restarted with clean data (entering NOT_MEMBER status), the leader couldn't recover it—the error was indistinguishable from other status failures and the leader's cursor retained stale offsets. A new dedicated ErrNodeIsNotMember error code (112) lets the leader detect this state, reset the cursor to trigger snapshot transmission, and bring the follower back into the replication group automatically.
Client Resilience
- Retry support for List operations (#937) — List calls previously had no retry logic and would fail immediately when hitting a non-leader node. They now retry with exponential backoff and extract leader hints, matching the pattern used by read/write batches.
- Retry support for RangeScan operations (#938) — Same retry-with-backoff treatment applied to range scans for consistent client behavior across all query types.
Operational Improvements
- gRPC health server starts before leader election (#930) — In Kubernetes StatefulSet deployments, non-leader coordinator pods were stuck in crash loops because the health server only started after winning the leader election. Health probes now succeed immediately on all pods regardless of leadership status.
- GetInfo RPC called only on state transitions (#933) — The coordinator was calling
GetInfoon every health check (~every 2 seconds), flooding logs and generating unnecessary network traffic. It now fires only when a server transitions fromNotRunningtoRunning.
Bug Fixes
- Data race in followerCursor (#946) — Fixed a race between
followerCursor.Close()callingstream.CloseSend()and the entries loop callingstream.Send()on the same gRPC stream. The fix removes the explicitCloseSend()since context cancellation already handles stream teardown. - Test stability — Resolved flakiness in
TestShardController_SwapNodeWithLeaderElectionFailure(#927),TestSyncClientImpl_InternalKeys(#939), andTestLeaderBalancedNodeAdded(#947).
Build & Dependencies
Contributors
@merlimat, @mattisonchao, @dependabot
Full changelog: v0.16.0...v0.16.1
v0.16.0
Oxia v0.16.0 Release Notes
Released: March 2, 2025
This is a significant feature release introducing database integrity checksums, feature negotiation for safe rolling upgrades, leader hints for faster client retries, and WAL-based control request replication. It also includes important crash-safety and data-race fixes across the storage and replication layers.
Highlights
Database Fingerprint Checksum (#877, #890, #891)
Oxia now computes a chained CRC32 checksum over all mutation operations, providing a fingerprint of the database state. This enables operators to verify data consistency across replicas by comparing checksum values. The checksum is exposed as a WAL gauge metric and is computed incrementally — each entry's CRC is derived from the previous entry's CRC plus the current payload, forming a tamper-evident chain.
Negotiated Features for Safe Rolling Upgrades (#878)
Clusters running mixed versions during rolling upgrades can now safely negotiate a common feature set. On leader election, the coordinator collects each node's supported features via a new GetInfo RPC. The system computes the intersection — only capabilities supported by all quorum members become active. Nodes lacking the GetInfo endpoint (older versions) are treated as supporting no new features, ensuring backward compatibility without manual intervention.
Control Request Replication via WAL (#882)
Control-plane commands (such as feature enablement) are now durably replicated through the WAL state machine alongside regular data writes. Previously these commands existed only in-memory and could be lost on failure. A new statemachine package provides a unified Proposal interface where each proposal type (write or control) knows how to serialize and apply itself, ensuring all replicas converge to the same state.
Leader Hint Support (#883)
When a client sends a request to a non-leader node, the error response now includes a LeaderHint indicating which node currently holds leadership. The client extracts this hint and retries directly to the correct leader instead of falling back to potentially stale shard assignments. This significantly reduces retry latency during leadership transitions. Stale write stream connections are also invalidated on hint receipt.
Bug Fixes
- Dangling index files on crash (#920) — Swapped deletion order so the
.idxfile is removed before the.txnfile. Previously, a crash between the two deletions could leave orphaned.idxfiles permanently, since segment discovery scans for.txnfiles. - Panic on empty WAL index files (#912) — Added a length validation before reading the CRC header. A 0-byte
.idxxfile previously triggered a slice-bounds panic; it now returnsErrDataCorruptedso the index is rebuilt from the transaction file. - WAL CRC chain divergence after snapshot (#901) — When a follower installs a snapshot, the WAL is cleared, resetting the CRC chain to zero. A new
previous_entry_crcfield in theAppendmessage now seeds the follower's CRC chain from the leader, maintaining checksum consistency across replicas. - Infinite loop in
balanceHighestNode(#895) — WhenswapShardreturned an error with strict anti-affinity, acontinuestatement skipped the iterator advance, causing the balancer goroutine to retry the same shard forever. The fix always advances the iterator regardless of swap outcome. - Type assertion panic in
Cache.Get()(#881) — Negative cache entries (for non-existent keys) used a double-wrapped optional type that didn't match the type assertion on retrieval. Fixed by aligning the empty entry type with the non-empty entry structure. - EOF not treated as retriable on write streams (#917) — When a server closes a bidirectional write stream before delivering a gRPC status, the client receives
io.EOFwhich wasn't classified as retriable. The client now retries on EOF, allowing leader hints to work correctly during leadership transitions. - Data races (#918, #907, #903, #893) — Fixed multiple concurrent access issues across the codebase, including races in replication, caching, and health-check paths.
Improvements
- Follower state applier retry logic — The follower
stateAppliergoroutine now retries on transient errors instead of failing permanently, improving resilience during temporary storage issues. - Follower controller lifecycle refactor — Cleaner startup/shutdown sequence for the follower controller, reducing the surface for resource leaks and race conditions.
- Dynamic log level changes (#914) — Runtime log level changes (via configuration) now propagate to existing loggers, not just newly created ones.
- Manual Docker build workflow (#919) — Added a GitHub Actions workflow for manual Docker image builds and pushes.
- Test stability (#915, #916) — Reduced flakiness in integration and replication tests.
Build & Dependencies
- OpenTelemetry SDK and other dependency updates via Dependabot
Contributors
@merlimat, @coderzc, @labuladong, @mattisonchao, @dependabot
Full changelog: v0.15.3...v0.16.0
v0.15.3
What's Changed
- feat: introduce the auto release by @mattisonchao in #866
- Fix release workflow by using Makefile target for consistent builds by @Copilot in #868
- Fix: Binary releases missing execute permissions by @Copilot in #870
- feat: add static key file support for OIDC authentication with per-issuer configuration by @Copilot in #874
- Revert "feat: add static key file support for OIDC authentication with per-issuer configuration (#874)" by @mattisonchao in #875
- Feat: add per-issuer OIDC configuration with static key file support by @mattisonchao in #876
- Allow to override version id and modification count by @merlimat in #872
Full Changelog: v0.15.2...v0.15.3
v0.15.2
⚠️ Critical Fix: This release fixes a data loss issue when upgrading from <= v0.14.4.
Versions v0.15.0 and v0.15.1 had a database format conversion bug that could cause data loss during upgrades. If you are on v0.14.4 or earlier, upgrade directly to this version or later. See #846 for details.
⚠️ Critical Fix: Data loss with node expansion (affects <= v0.14.4).
Versions v0.14.4 and earlier had a non-atomic shard moving deletion that could cause data loss when expanding nodes. See #845 for details.
What's Changed
🛠 Bug Fixes
- fix(kvstore): Prevent data loss from crash during DB format conversion by @mattisonchao in #838
- fixes: configurable WAL sync by @mattisonchao in #857
🚀 Features
- feat: Introduce dataserver and coordinator configuration (part.1) by @mattisonchao in #848
- feat: Introduce dataserver and coordinator configuration (part.2) by @mattisonchao in #859
- feat: Introduce dataserver and coordinator configuration (part.3) by @mattisonchao in #860
- feat: make the assignment information become debug level by @mattisonchao in #861
- fix the wrong default cluster config causes breaking by @mattisonchao in #863
- fixes the wrong cluster config parameter name by @mattisonchao in #864
Full Changelog: v0.15.1...v0.15.2
v0.15.1
⚠️ Known Issue: Data loss when upgrading from <= v0.14.4 to this version.
Upgrading from v0.14.4 or earlier to v0.15.0 or v0.15.1 can cause data loss due to a database format conversion issue. Upgrade directly to v0.15.2 or later instead. See #846 for details.
What's Changed
- Set the global Otel meter provider by @merlimat in #830
- Moved server common to oxiad/common by @merlimat in #829
- feat: move load balancer logs to debug level by @mattisonchao in #831
- fixes: wrong package import by @mattisonchao in #832
Full Changelog: v0.15.0...v0.15.1
v0.15.0
What's Changed
- Pin Pebble disk format to a specific version by @merlimat in #748
- Fixed panic in coordinator when not initialized by @merlimat in #751
- Let the coordinator wait to become leader, in case multiple instances come up by @merlimat in #752
- Updated to point to CNCF Code of Conduct by @merlimat in #754
- Switched Copyright to "The Oxia Authors" by @merlimat in #755
- Rename public gRPC to io.oxia.proto.v1.OxiaClient by @merlimat in #756
- Alpine 3.22 by @merlimat in #758
- Upgrade to Go 1.25 and follow newer lint rules by @merlimat in #759
- Use StandaloneServer.ServiceAddr() to reduce boilerplate code in tests by @merlimat in #761
- Use original context if available in Timer.Done() by @merlimat in #762
- Fixed secondary index get when notifications are disabled by @merlimat in #763
- Introduced key encoding for hierarchical sorting by @merlimat in #764
- Fix data race in Zerolog console writer by @merlimat in #765
- Fixed race condition in get sequences test by @merlimat in #766
- Upgrade to Pebble 2.1.0 by @merlimat in #760
- Cache Go deps in CI builds by @merlimat in #768
- Refactored Dockerfile to allow caching of Go deps by @merlimat in #769
- Enhance kv interface to work with different encoders by @merlimat in #767
- Fixed issue with db iterator closed after the callback by @merlimat in #770
- Cache docker builds for Trivy scanner by @merlimat in #771
- Fixed initialization of logger when not using json by @merlimat in #773
- Allow to configure db for different scenarios by @merlimat in #774
- Disable colored log in tests to avoid data race detection in zerolog by @merlimat in #775
- Raft based metadata provider by @merlimat in #757
- Fixed logic for checking metadata version in raft metadata provider by @merlimat in #777
- Coordinator: swapNode() should keep retrying leader election until it succeeds by @merlimat in #781
- Use wait group in more idiomatic way by @merlimat in #784
- Apply timeout to mocked server responses in coordinator test by @merlimat in #785
- Minor clean up of using util methods in shard controller mocked tests by @merlimat in #787
- Handle deletion of swapped out node in a background periodic task by @merlimat in #786
- Removed unnecessary warning in coordinator shutdown by @merlimat in #789
- fixes: flaky test - TestShardController_StartingWithLeaderAlreadyPresent by @mattisonchao in #790
- fixes: deadlock when swap
waitForFollowersToCatchUpby @mattisonchao in #788 - refactor: introduce thread safe shard metadata by @mattisonchao in #791
- Delete charts as it removed to oxia-db/helm-charts by @dao-jun in #782
- refactor: move election leader logic into a separate struct by @mattisonchao in #793
- [feat][admin] Initial Oxia admin commit and add list-namespaces command. by @dao-jun in #792
- fix: using async write for wal benchmark by @mattisonchao in #795
- Fix coordinator NPE by @dao-jun in #797
- Add listnodes admin tool by @dao-jun in #800
- fix: data lost by change ensemble by @mattisonchao in #798
- Add modules for oxia common and client by @merlimat in #802
- Ensure internal keys always sort last for natural ordering by @merlimat in #803
- refactor: rename 'server/node' to 'dataserver' for better semantic clarity by @mattisonchao in #806
- Use range-scan API for CLI list operation by @merlimat in #807
- Fixed server -> dataserver import in client tests by @merlimat in #808
- Revert "Add modules for oxia common and client (#802)" by @merlimat in #809
- Hide internal keys by default in list and range-scan by @merlimat in #810
- refactor: Introduce go workspace by @mattisonchao in #813
- Fixed dependency issue with genproto and workspace by @merlimat in #817
- Allow to configure key-sorting natural/hierarchical on namespaces by @merlimat in #818
- refactor: reorganise the dataserver directory by @mattisonchao in #819
- refactor: rename some plural packages to singular by @mattisonchao in #820
- fixes: fix trivy upload error due to insufficient permission by @mattisonchao in #821
- feat: rename nodeController to dataserverController by @mattisonchao in #822
- Updated jwt, crypto and go-jose dependencies for CVEs by @merlimat in #823
- Removed unused go tools from CI by @merlimat in #825
- Moved
protoundercommon/protoby @merlimat in #824 - fix: rename node to data server in shard controller by @mattisonchao in #826
- Added script to tag the releases by @merlimat in #827
New Contributors
Full Changelog: v0.14.4...v0.15.0
v0.14.4
⚠️ Known Issue: Data loss with node expansion.
This version has a non-atomic shard moving deletion that can cause data loss when expanding nodes. Fixed in v0.15.1. See #845 for details.
⚠️ Known Issue: Data loss when upgrading to v0.15.0 or v0.15.1.
Do not upgrade to v0.15.0 or v0.15.1. Upgrade directly to v0.15.2 or later to avoid data loss from a database format conversion bug. See #846 for details.
What's Changed
- fix(obj): fix the potential pooled object leak by @mattisonchao in #743
- fix(session): avoid complete rpc without response by @mattisonchao in #744
Full Changelog: v0.14.3...v0.14.4
v0.14.3
What's Changed
- build(deps): bump github.com/go-viper/mapstructure/v2 from 2.3.0 to 2.4.0 by @dependabot[bot] in #741
- fix(server): fix the ephemeral key leak by non-atomic operation by @mattisonchao in #742
- feat: introduce committed version IDs to avoid accidentally writing dirty version ID trackers by @mattisonchao in #740
Full Changelog: v0.14.2...v0.14.3
v0.14.2
What's Changed
- feat: support leader rebalance by @mattisonchao in #726
- fixes: fix GHSA-fv92-fjc5-jj9h by @mattisonchao in #735
- fixes: support closable health server to avoid being stuck by @mattisonchao in #734
- feat(coordinator): support node controller graceful close by @mattisonchao in #733
- Always validate shard id not to be null by @merlimat in #736
- fix(leader): fix the concurrent write causes disordered offset by @mattisonchao in #738
Full Changelog: v0.14.1...v0.14.2
v0.14.1
What's Changed
- Removed documentation since it's moved to different repository by @merlimat in #724
- Fixed more usages of old repository name by @merlimat in #725
- fix(tla+): fix wrong name epoch by @mattisonchao in #727
- fix: change doc link to official website by @mattisonchao in #728
- fixes: fixes data race condition by @mattisonchao in #732
- feat: support secondary index name for cmd read operations by @mattisonchao in #731
Full Changelog: v0.14.0...v0.14.1