Oxia v0.16.1 Release Notes

@merlimat

Oxia v0.16.1 Release Notes

Released: March 12, 2026

This patch release focuses on production stability, client resilience, and lays the groundwork for shard splitting. It includes fixes for several failure modes observed in distributed deployments—particularly around leadership transitions, follower recovery, and Kubernetes health checking—along with retry support for streaming client operations.

Highlights

Shard Split Foundations (#911)

This release introduces the protobuf definitions, data model types, and CLI tooling needed for shard splitting. A new oxia admin split-shard command and a SplitShard RPC in the coordination service establish the operational interface, while a SplitPhase enumeration (Init → Bootstrap → CatchUp → Cutover → Cleanup) models the lifecycle of a split operation. This is groundwork for the full shard splitting workflow coming in a future release.

Graceful Metadata Leadership Loss (#935)

Previously, the coordinator would panic on a metadata version conflict during Store() operations. The coordinator now returns a typed ErrMetadataBadVersion error instead of crashing, and WaitToBecomeLeader() exposes a channel that closes when leadership is lost (whether via configmap lease expiry or raft leader change). The gRPC server monitors this channel and cleanly restarts the coordinator, enabling proper failover without panic-induced crashes.

Follower Recovery After Data Cleanup (#929)

When a follower node restarted with clean data (entering NOT_MEMBER status), the leader couldn't recover it—the error was indistinguishable from other status failures and the leader's cursor retained stale offsets. A new dedicated ErrNodeIsNotMember error code (112) lets the leader detect this state, reset the cursor to trigger snapshot transmission, and bring the follower back into the replication group automatically.

Client Resilience

Retry support for List operations (#937) — List calls previously had no retry logic and would fail immediately when hitting a non-leader node. They now retry with exponential backoff and extract leader hints, matching the pattern used by read/write batches.
Retry support for RangeScan operations (#938) — Same retry-with-backoff treatment applied to range scans for consistent client behavior across all query types.

Operational Improvements

gRPC health server starts before leader election (#930) — In Kubernetes StatefulSet deployments, non-leader coordinator pods were stuck in crash loops because the health server only started after winning the leader election. Health probes now succeed immediately on all pods regardless of leadership status.
GetInfo RPC called only on state transitions (#933) — The coordinator was calling GetInfo on every health check (~every 2 seconds), flooding logs and generating unnecessary network traffic. It now fires only when a server transitions from NotRunning to Running.

Bug Fixes

Data race in followerCursor (#946) — Fixed a race between followerCursor.Close() calling stream.CloseSend() and the entries loop calling stream.Send() on the same gRPC stream. The fix removes the explicit CloseSend() since context cancellation already handles stream teardown.
Test stability — Resolved flakiness in TestShardController_SwapNodeWithLeaderElectionFailure (#927), TestSyncClientImpl_InternalKeys (#939), and TestLeaderBalancedNodeAdded (#947).

Build & Dependencies

Upgraded to Go 1.26 (#926)
OpenTelemetry SDK bumped from 1.35.0 to 1.40.0 (#925)

Contributors

@merlimat, @mattisonchao, @dependabot

Full changelog: v0.16.0...v0.16.1

@merlimat

Oxia v0.16.0 Release Notes

Released: March 2, 2025

This is a significant feature release introducing database integrity checksums, feature negotiation for safe rolling upgrades, leader hints for faster client retries, and WAL-based control request replication. It also includes important crash-safety and data-race fixes across the storage and replication layers.

Highlights

Database Fingerprint Checksum (#877, #890, #891)

Oxia now computes a chained CRC32 checksum over all mutation operations, providing a fingerprint of the database state. This enables operators to verify data consistency across replicas by comparing checksum values. The checksum is exposed as a WAL gauge metric and is computed incrementally — each entry's CRC is derived from the previous entry's CRC plus the current payload, forming a tamper-evident chain.

Negotiated Features for Safe Rolling Upgrades (#878)

Clusters running mixed versions during rolling upgrades can now safely negotiate a common feature set. On leader election, the coordinator collects each node's supported features via a new GetInfo RPC. The system computes the intersection — only capabilities supported by all quorum members become active. Nodes lacking the GetInfo endpoint (older versions) are treated as supporting no new features, ensuring backward compatibility without manual intervention.

Control Request Replication via WAL (#882)

Control-plane commands (such as feature enablement) are now durably replicated through the WAL state machine alongside regular data writes. Previously these commands existed only in-memory and could be lost on failure. A new statemachine package provides a unified Proposal interface where each proposal type (write or control) knows how to serialize and apply itself, ensuring all replicas converge to the same state.

Leader Hint Support (#883)

When a client sends a request to a non-leader node, the error response now includes a LeaderHint indicating which node currently holds leadership. The client extracts this hint and retries directly to the correct leader instead of falling back to potentially stale shard assignments. This significantly reduces retry latency during leadership transitions. Stale write stream connections are also invalidated on hint receipt.

Bug Fixes

Dangling index files on crash (#920) — Swapped deletion order so the .idx file is removed before the .txn file. Previously, a crash between the two deletions could leave orphaned .idx files permanently, since segment discovery scans for .txn files.
Panic on empty WAL index files (#912) — Added a length validation before reading the CRC header. A 0-byte .idxx file previously triggered a slice-bounds panic; it now returns ErrDataCorrupted so the index is rebuilt from the transaction file.
WAL CRC chain divergence after snapshot (#901) — When a follower installs a snapshot, the WAL is cleared, resetting the CRC chain to zero. A new previous_entry_crc field in the Append message now seeds the follower's CRC chain from the leader, maintaining checksum consistency across replicas.
Infinite loop in balanceHighestNode (#895) — When swapShard returned an error with strict anti-affinity, a continue statement skipped the iterator advance, causing the balancer goroutine to retry the same shard forever. The fix always advances the iterator regardless of swap outcome.
Type assertion panic in Cache.Get() (#881) — Negative cache entries (for non-existent keys) used a double-wrapped optional type that didn't match the type assertion on retrieval. Fixed by aligning the empty entry type with the non-empty entry structure.
EOF not treated as retriable on write streams (#917) — When a server closes a bidirectional write stream before delivering a gRPC status, the client receives io.EOF which wasn't classified as retriable. The client now retries on EOF, allowing leader hints to work correctly during leadership transitions.
Data races (#918, #907, #903, #893) — Fixed multiple concurrent access issues across the codebase, including races in replication, caching, and health-check paths.

Improvements

Follower state applier retry logic — The follower stateApplier goroutine now retries on transient errors instead of failing permanently, improving resilience during temporary storage issues.
Follower controller lifecycle refactor — Cleaner startup/shutdown sequence for the follower controller, reducing the surface for resource leaks and race conditions.
Dynamic log level changes (#914) — Runtime log level changes (via configuration) now propagate to existing loggers, not just newly created ones.
Manual Docker build workflow (#919) — Added a GitHub Actions workflow for manual Docker image builds and pushes.
Test stability (#915, #916) — Reduced flakiness in integration and replication tests.

Build & Dependencies

OpenTelemetry SDK and other dependency updates via Dependabot

Contributors

@merlimat, @coderzc, @labuladong, @mattisonchao, @dependabot

Full changelog: v0.15.3...v0.16.0

@mattisonchao

What's Changed

feat: introduce the auto release by @mattisonchao in #866
Fix release workflow by using Makefile target for consistent builds by @Copilot in #868
Fix: Binary releases missing execute permissions by @Copilot in #870
feat: add static key file support for OIDC authentication with per-issuer configuration by @Copilot in #874
Revert "feat: add static key file support for OIDC authentication with per-issuer configuration (#874)" by @mattisonchao in #875
Feat: add per-issuer OIDC configuration with static key file support by @mattisonchao in #876
Allow to override version id and modification count by @merlimat in #872

Full Changelog: v0.15.2...v0.15.3

@mattisonchao

⚠️ Critical Fix: This release fixes a data loss issue when upgrading from <= v0.14.4.
Versions v0.15.0 and v0.15.1 had a database format conversion bug that could cause data loss during upgrades. If you are on v0.14.4 or earlier, upgrade directly to this version or later. See #846 for details.

⚠️ Critical Fix: Data loss with node expansion (affects <= v0.14.4).
Versions v0.14.4 and earlier had a non-atomic shard moving deletion that could cause data loss when expanding nodes. See #845 for details.

What's Changed

🛠 Bug Fixes

fix(kvstore): Prevent data loss from crash during DB format conversion by @mattisonchao in #838
fixes: configurable WAL sync by @mattisonchao in #857

🚀 Features

feat: Introduce dataserver and coordinator configuration (part.1) by @mattisonchao in #848
feat: Introduce dataserver and coordinator configuration (part.2) by @mattisonchao in #859
feat: Introduce dataserver and coordinator configuration (part.3) by @mattisonchao in #860
feat: make the assignment information become debug level by @mattisonchao in #861
fix the wrong default cluster config causes breaking by @mattisonchao in #863
fixes the wrong cluster config parameter name by @mattisonchao in #864

Full Changelog: v0.15.1...v0.15.2

@merlimat

⚠️ Known Issue: Data loss when upgrading from <= v0.14.4 to this version.
Upgrading from v0.14.4 or earlier to v0.15.0 or v0.15.1 can cause data loss due to a database format conversion issue. Upgrade directly to v0.15.2 or later instead. See #846 for details.

What's Changed

Set the global Otel meter provider by @merlimat in #830
Moved server common to oxiad/common by @merlimat in #829
feat: move load balancer logs to debug level by @mattisonchao in #831
fixes: wrong package import by @mattisonchao in #832

Full Changelog: v0.15.0...v0.15.1

@merlimat

What's Changed

Pin Pebble disk format to a specific version by @merlimat in #748
Fixed panic in coordinator when not initialized by @merlimat in #751
Let the coordinator wait to become leader, in case multiple instances come up by @merlimat in #752
Updated to point to CNCF Code of Conduct by @merlimat in #754
Switched Copyright to "The Oxia Authors" by @merlimat in #755
Rename public gRPC to io.oxia.proto.v1.OxiaClient by @merlimat in #756
Alpine 3.22 by @merlimat in #758
Upgrade to Go 1.25 and follow newer lint rules by @merlimat in #759
Use StandaloneServer.ServiceAddr() to reduce boilerplate code in tests by @merlimat in #761
Use original context if available in Timer.Done() by @merlimat in #762
Fixed secondary index get when notifications are disabled by @merlimat in #763
Introduced key encoding for hierarchical sorting by @merlimat in #764
Fix data race in Zerolog console writer by @merlimat in #765
Fixed race condition in get sequences test by @merlimat in #766
Upgrade to Pebble 2.1.0 by @merlimat in #760
Cache Go deps in CI builds by @merlimat in #768
Refactored Dockerfile to allow caching of Go deps by @merlimat in #769
Enhance kv interface to work with different encoders by @merlimat in #767
Fixed issue with db iterator closed after the callback by @merlimat in #770
Cache docker builds for Trivy scanner by @merlimat in #771
Fixed initialization of logger when not using json by @merlimat in #773
Allow to configure db for different scenarios by @merlimat in #774
Disable colored log in tests to avoid data race detection in zerolog by @merlimat in #775
Raft based metadata provider by @merlimat in #757
Fixed logic for checking metadata version in raft metadata provider by @merlimat in #777
Coordinator: swapNode() should keep retrying leader election until it succeeds by @merlimat in #781
Use wait group in more idiomatic way by @merlimat in #784
Apply timeout to mocked server responses in coordinator test by @merlimat in #785
Minor clean up of using util methods in shard controller mocked tests by @merlimat in #787
Handle deletion of swapped out node in a background periodic task by @merlimat in #786
Removed unnecessary warning in coordinator shutdown by @merlimat in #789
fixes: flaky test - TestShardController_StartingWithLeaderAlreadyPresent by @mattisonchao in #790
fixes: deadlock when swap waitForFollowersToCatchUp by @mattisonchao in #788
refactor: introduce thread safe shard metadata by @mattisonchao in #791
Delete charts as it removed to oxia-db/helm-charts by @dao-jun in #782
refactor: move election leader logic into a separate struct by @mattisonchao in #793
[feat][admin] Initial Oxia admin commit and add list-namespaces command. by @dao-jun in #792
fix: using async write for wal benchmark by @mattisonchao in #795
Fix coordinator NPE by @dao-jun in #797
Add listnodes admin tool by @dao-jun in #800
fix: data lost by change ensemble by @mattisonchao in #798
Add modules for oxia common and client by @merlimat in #802
Ensure internal keys always sort last for natural ordering by @merlimat in #803
refactor: rename 'server/node' to 'dataserver' for better semantic clarity by @mattisonchao in #806
Use range-scan API for CLI list operation by @merlimat in #807
Fixed server -> dataserver import in client tests by @merlimat in #808
Revert "Add modules for oxia common and client (#802)" by @merlimat in #809
Hide internal keys by default in list and range-scan by @merlimat in #810
refactor: Introduce go workspace by @mattisonchao in #813
Fixed dependency issue with genproto and workspace by @merlimat in #817
Allow to configure key-sorting natural/hierarchical on namespaces by @merlimat in #818
refactor: reorganise the dataserver directory by @mattisonchao in #819
refactor: rename some plural packages to singular by @mattisonchao in #820
fixes: fix trivy upload error due to insufficient permission by @mattisonchao in #821
feat: rename nodeController to dataserverController by @mattisonchao in #822
Updated jwt, crypto and go-jose dependencies for CVEs by @merlimat in #823
Removed unused go tools from CI by @merlimat in #825
Moved proto under common/proto by @merlimat in #824
fix: rename node to data server in shard controller by @mattisonchao in #826
Added script to tag the releases by @merlimat in #827

New Contributors

@dao-jun made their first contribution in #782

Full Changelog: v0.14.4...v0.15.0

@mattisonchao

⚠️ Known Issue: Data loss with node expansion.
This version has a non-atomic shard moving deletion that can cause data loss when expanding nodes. Fixed in v0.15.1. See #845 for details.

⚠️ Known Issue: Data loss when upgrading to v0.15.0 or v0.15.1.
Do not upgrade to v0.15.0 or v0.15.1. Upgrade directly to v0.15.2 or later to avoid data loss from a database format conversion bug. See #846 for details.

What's Changed

fix(obj): fix the potential pooled object leak by @mattisonchao in #743
fix(session): avoid complete rpc without response by @mattisonchao in #744

Full Changelog: v0.14.3...v0.14.4

@mattisonchao

What's Changed

build(deps): bump github.com/go-viper/mapstructure/v2 from 2.3.0 to 2.4.0 by @dependabot[bot] in #741
fix(server): fix the ephemeral key leak by non-atomic operation by @mattisonchao in #742
feat: introduce committed version IDs to avoid accidentally writing dirty version ID trackers by @mattisonchao in #740

Full Changelog: v0.14.2...v0.14.3

@mattisonchao

What's Changed

feat: support leader rebalance by @mattisonchao in #726
fixes: fix GHSA-fv92-fjc5-jj9h by @mattisonchao in #735
fixes: support closable health server to avoid being stuck by @mattisonchao in #734
feat(coordinator): support node controller graceful close by @mattisonchao in #733
Always validate shard id not to be null by @merlimat in #736
fix(leader): fix the concurrent write causes disordered offset by @mattisonchao in #738

Full Changelog: v0.14.1...v0.14.2

@merlimat

What's Changed

Removed documentation since it's moved to different repository by @merlimat in #724
Fixed more usages of old repository name by @merlimat in #725
fix(tla+): fix wrong name epoch by @mattisonchao in #727
fix: change doc link to official website by @mattisonchao in #728
fixes: fixes data race condition by @mattisonchao in #732
feat: support secondary index name for cmd read operations by @mattisonchao in #731

Full Changelog: v0.14.0...v0.14.1

Releases: oxia-db/oxia

v0.16.1

Oxia v0.16.1 Release Notes

Highlights

Shard Split Foundations (#911)

Graceful Metadata Leadership Loss (#935)

Follower Recovery After Data Cleanup (#929)

Client Resilience

Operational Improvements

Bug Fixes

Build & Dependencies

Contributors

Contributors

Uh oh!

v0.16.0

Oxia v0.16.0 Release Notes

Highlights

Database Fingerprint Checksum (#877, #890, #891)

Negotiated Features for Safe Rolling Upgrades (#878)

Control Request Replication via WAL (#882)

Leader Hint Support (#883)

Bug Fixes

Improvements

Build & Dependencies

Contributors

Contributors

Uh oh!

v0.15.3

What's Changed

Contributors

Uh oh!

v0.15.2

What's Changed

🛠 Bug Fixes

🚀 Features

Contributors

Uh oh!

v0.15.1

What's Changed

Contributors

Uh oh!

v0.15.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.14.4

What's Changed

Contributors

Uh oh!

v0.14.3

What's Changed

Contributors

Uh oh!

v0.14.2

What's Changed

Contributors

Uh oh!

v0.14.1

What's Changed

Contributors

Uh oh!