Skip to content

Write-behind log for async AOF-based durability#3381

Open
jjuleslasarte wants to merge 46 commits into
valkey-io:unstablefrom
jjuleslasarte:aof-durability
Open

Write-behind log for async AOF-based durability#3381
jjuleslasarte wants to merge 46 commits into
valkey-io:unstablefrom
jjuleslasarte:aof-durability

Conversation

@jjuleslasarte

@jjuleslasarte jjuleslasarte commented Mar 18, 2026

Copy link
Copy Markdown
Contributor

AOF-based Durability

fixes: #3994

Summary

This PR optimizes appendfsync always by moving the AOF flush+fsync off the
main thread (a write-behind log), so synchronous-durability workloads aren't
bottlenecked on the event loop. It preserves the existing zero-data-loss
guarantee — and extends it to reads of not-yet-durable keys — by holding
client replies in the client output buffer (COB) until the write is fsynced,
rather than blocking the main thread. It is milestone one in the durability
plan (here).

So a client writing SET foo bar still won't receive +OK until the data is
fsynced to disk (no application-level WAITAOF needed), but the main thread is
free to keep serving other clients while the fsync is in flight.

How it's enabled

The feature is opt-in and gated behind a hidden feature flag. All of the
following must hold for reply-blocking to engage on a node:

bio-aof-offload-enabled yes   # hidden flag, defaults to "no" (opt-in)
appendonly yes                # AOF must be on
appendfsync always            # synchronous fsync
# ...and the node must be a primary

With the default bio-aof-offload-enabled no, none of the reply-blocking code
runs, so stock Valkey is unaffected. There is no separate durability
config (an earlier bool was removed); behavior derives from the flag + AOF
settings. Changing appendfsync/appendonly triggers a reply-blocking reset.

Design decisions

  • Response blocking, not command blocking (reply-blocking vs WAL): Clients
    are never paused before executing — the command runs immediately and its
    reply is buffered in the COB. There is no command-rejection path;
    consistency is enforced entirely by holding the reply. A byte-precise
    "disallowed byte offset" boundary in the COB prevents the networking layer
    from flushing past it until the durably-committed offset catches up. A single
    client can hold a list of boundaries (deeper pipelines unblock
    incrementally), and _writeToClient()/writevToClient() cap each syscall at
    the boundary so a write physically cannot push blocked bytes out.
  • Uncommitted key tracking: When enabled, modified keys are tracked
    per-database with the replication offset they were dirtied at. Read commands
    touching an uncommitted key are response-blocked the same as writes: their
    replies are held until the offset at which the key was last modified is
    durably committed. This holds even for non-replicating replies (e.g. SETNX
    returning 0 because of a not-yet-durable write) — any reply can encode
    not-yet-durable state. Committed keys are cleaned up by iterating the per-DB
    hashtable when the committed offset advances — no separate tracking queue.
  • Deferred side-effects: Client (pub/sub) keyspace notifications and
    client-side-cache (tracking) invalidations are deferred until the write is
    acknowledged, so subscribers and caches don't learn about changes before
    they're durable. Module keyspace notifications are delivered inline at
    change time (not deferred) — modules filter on their own event_mask
    independent of notify-keyspace-events, and inline delivery is the only model
    that (a) preserves read-after-write for change-time consumers and (b) composes
    with module client-blocking (Fix engine crash on module client blocking during keyspace events #1819). The deferred client notification
    re-enters notifyKeyspaceEvent() at ack time with NOTIFY_IN_POST_COMMIT_TASK
    set.
  • AOF flush+fsync offloaded to BIO: To avoid blocking the main thread on
    every fsync under appendfsync always, the write+fsync runs as a background
    (BIO) job that advances the AOF-acked offset on completion. (This was
    initially built on IO threads, then corrected to the BIO path.)

How to Review

The branch isn't squashed; below is a subsystem-oriented guide (roughly the
recommended reading order). Reviewers are also welcome to review the whole thing
at once.

# Area Files What to look for
1 AOF flush+fsync offload aof.c, bio.c, bio.h The unified BIO_AOF_FSYNC job (buf != NULL ⇒ write+fsync+notify path), bioCreateAofFsyncNotifyJob, and the acked-offset tracking. Standalone perf improvement.
2 Reply-blocking orchestration reply_blocking.c/.h Core layer: per-command offset tracking, computeCommandBlockingOffset (replicating vs non-replicating paths), the COB boundary mechanism, getDurablyCommittedOffset(), and notifyReplyBlockingProgress unblocking from beforeSleep.
3 Uncommitted key tracking uncommitted_keys.c/.h Per-DB hashtable, lazy purge on read, hashtable-iteration cleanup (no queue), MULTI/EXEC buffering, and FLUSHDB/FLUSHALL/SWAPDB + function-store handling.
4 Post-commit task system post_commit_task.c/.h Task type registry, client-pointer lifetime in invalidation tasks, pending→ack promotion flow.
5 Server integration server.c, db.c, networking.c, notify.c, script.c, module.c, object.c, evict.c Pre/post command hooks, COB write-cap and clientHasPendingReplies same-buffer invariant, inline-module / deferred-client notification split, deferred-reply boundary-on-commit hooks.
6 Command spec change commands/*.json, commands.def New CMD_KEYSPACE_INFORMATIONAL flag on read-only introspection commands (KEYS, SCAN, EXISTS, TTL, TYPE, OBJECT *, DUMP, RANDOMKEY, …) so they're tracked for reply-blocking.
7 Config & build config.c, server.h, src/Makefile, cmake/Modules/SourceFiles.cmake, src/unit/Makefile, src/unit/CMakeLists.txt Hidden bio-aof-offload-enabled flag, struct fields, source wiring.
8 Tests src/unit/test_reply_blocking.cpp, tests/durability/reply_blocking.tcl, tests/unit/moduleapi/block_keyspace_notification_durability.tcl gtest (blocking/unblocking mechanics, offset tracking, task lifecycle, invalid-task rejection) + Tcl end-to-end (pause/resume, uncommitted-key blocking, MULTI/EXEC, Lua, function store, module client-blocking).

New INFO section

INFO reply_blocking reports:

  • reply_blocking_enabled1 when reply-blocking is active on this node
    (hidden flag on + appendonly yes + appendfsync always + primary), else 0.
    When 0, no other fields are emitted.
  • reply_blocking_read_blocked_count — cumulative number of read command
    replies that have been held in the COB (because they touched a not-yet-durable
    key). Counts since reply-blocking was last (re)enabled; not a live gauge.
  • reply_blocking_write_blocked_count — cumulative number of write command
    replies that have been held in the COB awaiting their durability ack. Same
    reset semantics as the read counter.
  • reply_blocking_clients_waiting_ack — live gauge: how many clients
    currently have at least one reply held in the COB waiting for the committed
    offset to advance.
  • reply_blocking_uncommitted_keys — live gauge: number of keys currently
    tracked as modified-but-not-yet-durable across all databases.
  • reply_blocking_previous_acked_offset — the most recent durably-committed
    replication offset the node has advanced to (the offset up to which replies
    have been released).
  • reply_blocking_primary_repl_offset — the node's current primary
    replication offset. The gap between this and
    reply_blocking_previous_acked_offset is the amount of written-but-not-yet-durable
    data — i.e. how far ahead execution is of the fsync.

(plus internal cumulative timing counters — per-type blocked-time totals in
microseconds — used for latency measurement)

New DEBUG subcommands (testing levers)

  • DEBUG reply-blocking-pause aof — freeze the AOF-acked offset to
    deterministically hold blocked replies.
  • DEBUG reply-blocking-resume aof — resume and let the offset catch up.
  • DEBUG client-enforce-reply-list <0|1> — force replies to skip the static
    buffer and go straight to c->reply, to exercise the reply-list boundary
    branch.
  • DEBUG set-io-last-written <client-id> <bufpos> <data_len> — inject a
    post-partial-write io_last_written state to probe the boundary comparison.

Changes since initial review round

Addressed feedback from @murphyjacob4, @zuiderkwast, @sushilpaneru1,
@sumitk163, @QuChen88 (and CodeRabbit):

  • Removed the provider registry abstraction (durability_provider.{c,h}) —
    inlined AOF logic directly since only one provider exists; renamed
    getDurabilityConsensusOffsetgetDurablyCommittedOffset,
    anyDurabilityProviderEnabledisAofDurabilityEnabled (per @zuiderkwast).
  • Restored the BIO path for AOF flush (it was incorrectly using IO threads);
    unified BIO_AOF_ALWAYS_FLUSH into BIO_AOF_FSYNC, distinguishing the
    write+fsync+notify path by buf != NULL.
  • Removed the separate durability config; gated the whole subsystem behind
    the hidden bio-aof-offload-enabled flag derived alongside appendfsync always (per @murphyjacob4, @zuiderkwast).
  • Replaced the offset-tracker queue with hashtable iteration for
    uncommitted-key cleanup — simpler, less memory, better hot-key behavior (per
    @sumitk163).
  • Renaming pass: durability*replyBlocking*/postCommit*;
    durable_task.*post_commit_task.*; INFO durability_*
    reply_blocking_*; DEBUG durability-provider-pausereply-blocking-pause.
  • Fixed dropped module keyspace notifications: module notifications are now
    delivered inline and unconditionally (independent of notify-keyspace-events
    and foreground/background origin); only client pub/sub is deferred. Removed the
    redundant moduleNotifyKeyspaceEvent() call at certification time.
  • Fixed module client-blocking (Fix engine crash on module client blocking during keyspace events #1819) durability hole: apply the durability
    boundary when the deferred reply is committed into the COB, with a fresh COB
    snapshot taken at commit (avoids a use-after-free on a stale reply-block
    pointer across the module-blocked window).
  • Fixed MOVE/COPY out-of-range destination DB: getIntFromObject only
    range-checks INT_MIN/INT_MAX, so a rejected MOVE key <dbnum..INT_MAX>
    read server.db[dest_dbid] out of bounds; both MOVE branches now bound
    dest_dbid to [0, dbnum).
  • Fixed SIGSEGV: zero-initialize the per-client reply-blocking state in
    createClient() (zmalloc doesn't zero).
  • Fixed MULTI+FLUSHALL hang: clear stale uncommitted_keys entries when
    FLUSHALL sets all_dbs_dirty inside a transaction.
  • Added a same-buffer assertion (enable-debug-assert) guarding the
    clientHasPendingReplies cursor-vs-boundary comparison.
  • Various: 32-bit build fixes (_FILE_OFFSET_BITS=64), macOS clang CI fixes,
    _Atomic C++ compatibility in server.h, dangling event-string fix in the
    keyspace-notify task, bufpos-vs-data_len copy-avoidance fix,
    getKeysUsingKeySpecs fix, CMake source-list rename
    (durable_task.cpost_commit_task.c), and removal of
    sync-replication / replica-rejection leftovers.

Known follow-ups (out of scope for this PR)

  • Module APIs to interact with durability
    (#4027): expose
    module-facing APIs so modules can observe/interact with the durability
    (reply-blocking) state rather than relying on internal hooks. Tracked
    separately.
  • Per-module keyspace-notification timing opt-in (raised in review; deferred
    per the review-bot summary): module keyspace notifications fire inline at
    change time by default. Durability-sensitive modules (external forwarding,
    audit/CDC) may want delivery gated on the write becoming durable. A follow-up
    should add a module option (mirroring the internal
    DELAY_KEYSPACE_NOTIFICATION_FOR_ZDL flag) to opt notifications into
    durable/delayed delivery while the default stays inline. Not in this PR.
  • Output-buffer back-pressure / OOM risk (raised by @yairgott): because
    replies are held in the COB until durable rather than pausing the client
    before execution, a sustained stream of not-yet-acked writes can grow the COB
    unboundedly and risk OOM. There is no pre-execution memory gate today; the
    likely mitigation is reactive throttling — suspend command ingestion for a
    client once its buffered output crosses a threshold, resuming when it drops
    back below — rather than the connection-closing client-output-buffer-limit.
    Not addressed in this PR.
  • Sync-replication / additional durability providers (next milestone): this
    PR ships only the AOF path. The broader durability plan adds replica-based
    acknowledgement as additional providers, at which point the committed offset
    becomes a consensus across providers. Future milestone.
  • Performance validation
    (#4029): throughput/latency
    of the BIO offload and the reply-blocking overhead are benchmarked separately;
    results are posted in Performance testing of improvements #4029.
  • DEBUG set-io-last-written input validation (CodeRabbit, minor): the
    subcommand has no debug-build guard and does not range-check its
    bufpos/data_len arguments, so a negative/oversized cast is reachable. It is
    a test-only DEBUG path; hardening is a low-priority follow-up.

Move the expensive AOF write+fsync off the main thread when IO threads
are available. This prevents the main thread from blocking on disk I/O
when appendfsync is set to 'always'.

Add a generic trySendJobToIOThreads() API to io_threads with round-robin
distribution, and an aof IO flush state machine (IDLE/PENDING/DONE/ERR)
with atomic coordination between main and IO threads.

The adjustIOThreadsByEventLoad() function gains a has_background_work
parameter to ensure IO threads stay active when AOF fsync work is
pending, even during low-traffic periods.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Introduce a provider registry that allows multiple durability backends
(AOF fsync, replicas, etc.) to register and contribute to a consensus
offset. The overall durability consensus is the MIN (AND) of all enabled
providers' acknowledged offsets.

Include the built-in AOF provider that tracks fsynced_reploff_pending
when appendfsync=always, and transparently passes through when not.

Add pause/resume support for providers (used via DEBUG commands) to
enable deterministic testing by freezing a provider's acknowledged
offset at a point-in-time snapshot.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add a task registry that defers side-effects (keyspace notifications,
key invalidations, flush invalidations) until durability providers
acknowledge the associated write offset.

Each task type registers create/destroy/execute/onClientDestroy
handlers. Tasks are created during command execution with a deferred
offset, then moved to an official waiting list once the replication
offset is known. When the consensus offset advances past a task's
offset, the task is executed and freed.

Key invalidation tasks track the originating client pointer and
properly handle client disconnection before task execution.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Track which keys have been modified but not yet acknowledged by
durability providers using a per-database hashtable. This enables
rejecting reads of uncommitted keys to ensure clients only see
durable data (zero-data-loss semantics).

Each uncommitted key stores the replication offset at which it was
last modified. Keys are purged when the durability consensus offset
advances past their stored offset.

Include incremental cleanup via serverCron that scans databases
round-robin with a configurable time limit, plus immediate purging
on read access (lazy cleanup).

Also handle database-level modifications (FLUSHDB, FLUSHALL, SWAPDB)
and function store dirty tracking for transactions.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add the core orchestration layer that blocks client responses in the
client output buffer (COB) until durability providers confirm the write
offset, then unblocks and flushes responses to clients.

reply_blocking.c/h contains:
- durabilityInit/Cleanup/Reset lifecycle management
- beforeCommandTrackReplOffset/afterCommandTrackReplOffset for tracking
  which replication offsets each command produces
- preCommandExec: rejects commands accessing uncommitted keys
- postCommandExec: blocks client responses until providers acknowledge
- notifyDurabilityProgress: called from beforeSleep to unblock clients
  whose offsets have been acknowledged
- blockClientOnReplOffset/unblockResponsesWithAckOffset
- Function store dirty tracking for FUNCTION LOAD/DELETE
- INFO durability stats generation

Integration points across the server:
- server.c: init/cleanup in server lifecycle, pre/post command hooks in
  call() and processCommand(), notifyDurabilityProgress in beforeSleep,
  uncommitted keys cleanup in serverCron, per-DB init, INFO section
- server.h: durable_t in server struct, clientDurabilityInfo in client,
  uncommitted_keys/dirty_repl_offset in serverDb, new client flag
- config.c: 'durability' bool config with dynamic update callback
- db.c: durabilitySignalModifiedKey/durabilitySignalFlushedDb hooks
- networking.c: client durability init/reset, COB reply limiting
- notify.c: defer keyspace notifications when durability is enabled
- script.c/module.c: pre-script checks for uncommitted data access
- replication.c: clear durability state on primary change
- debug.c: durability-provider-pause/resume DEBUG subcommands
- object.c: getIntFromObject utility

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add reply_blocking.c, durable_task.c, durability_provider.c, and
uncommitted_keys.c to the build system (both Makefile and CMake).

Also fix a clang compatibility issue in unit test CMakeLists.txt:
-fno-var-tracking-assignments is GCC-only, so guard it with a
compiler ID check.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add comprehensive gtest-based unit tests covering the reply blocking
subsystem including:
- Client output buffer blocking and unblocking mechanics
- Offset tracking through command execution
- Multi-command transaction (MULTI/EXEC) offset handling
- Durability provider consensus calculations
- Deferred task lifecycle (create, execute, cleanup)
- Uncommitted key tracking and purging
- Edge cases: client disconnection, provider pause/resume

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Add Tcl-based integration tests (1,051 lines) covering end-to-end
durability behavior including:
- AOF-based response blocking with appendfsync=always
- Provider pause/resume via DEBUG commands for deterministic testing
- Uncommitted key rejection (reads return error for dirty keys)
- MULTI/EXEC transaction durability semantics
- Lua script and FCALL durability checks
- Function store (FUNCTION LOAD/DELETE) durability blocking
- Client disconnection during blocked state
- Multiple concurrent clients with interleaved blocking/unblocking
- INFO durability stats verification

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
@jjuleslasarte

Copy link
Copy Markdown
Contributor Author

@murphyjacob4

Copy link
Copy Markdown
Contributor

durability yes

Do you think we need a separate config for this? If you set up fsync always, can we imply that durability yes makes sense? I would prefer to kill the durability no and appendfsync always combination if it doesn't have any value

@yairgott

Copy link
Copy Markdown
Contributor

Response blocking vs. command blocking (or, WAL vs WBL): Clients are not paused before executing — the command runs immediately and the response is buffered in the client output buffer (COB).

Should we factor in available memory before executing the command to avoid the over-buffering which may introduce OOM risk?

@jjuleslasarte

Copy link
Copy Markdown
Contributor Author

durability yes

Do you think we need a separate config for this? If you set up fsync always, can we imply that durability yes makes sense? I would prefer to kill the durability no and appendfsync always combination if it doesn't have any value

Yeah, I went back and forth on this. I had the flag left over from the initial draft and figured it might be useful to not enable this since I wasn't sure whether we'd do a major version or minor with this change. I can remove

@jjuleslasarte

Copy link
Copy Markdown
Contributor Author

Response blocking vs. command blocking (or, WAL vs WBL): Clients are not paused before executing — the command runs immediately and the response is buffered in the client output buffer (COB).

Should we factor in available memory before executing the command to avoid the over-buffering which may introduce OOM risk?

Good point, we should have a mechanism for this. Let me think through the options -- a proactive one might be harder (as we need to estimate the output before execution) but we can probably track the ammount of pending responses (or pending writes to the durability providers) and start throttling (rejecting) writes after a certain threshold?

@zuiderkwast

Copy link
Copy Markdown
Contributor

Regarding the durability config, the way I see it is that appendfsync always already gives durability (client won't receive +OK until the data is fsynced to disk). This feature just optimizes it, so a new config is not necessary IMO.

@yairgott

yairgott commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Good point, we should have a mechanism for this. Let me think through the options -- a proactive one might be harder (as we need to estimate the output before execution) but we can probably track the ammount of pending responses (or pending writes to the durability providers) and start throttling (rejecting) writes after a certain threshold?

Yeah, proactive would be challenging but a reactive approach might be good enough. We could track the total consumed output buffer and initiate throttling once a predefined threshold is reached.

Valkey’s existing client-output-buffer-limit supports hard and soft limits, but its semantics, which lead to closing the connection, might be too disruptive in this context. Ideally, once a threshold is breached, we would suspend command ingestion for that specific client and only resume once the output buffer consumption falls back below the threshold.

In addition, pointing out that this client suspension should be conditioned on the ability to zero-copy responses (e.g., the requested key is not robj based).

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
@jjuleslasarte

Copy link
Copy Markdown
Contributor Author

Regarding the durability config, the way I see it is that appendfsync always already gives durability (client won't receive +OK until the data is fsynced to disk). This feature just optimizes it, so a new config is not necessary IMO.

Yeah, makes sense to me. I will remove it in the next commit, along with other feedback!

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Comment thread src/networking.c Outdated
Comment thread src/durable_task.c
Comment thread src/networking.c
@zuiderkwast

Copy link
Copy Markdown
Contributor

Just regarding the title and description of the PR. This is not really changing the AOF durability. It just makes it possible to do fsync in the background so it gets faster.

I think we can name the PR something like Write-behind log for async AOF durability.

The main benefit for the future is that it introduces the WBL into the code base.

Also it isn't doing anything to sync replication, other than preparing for introducing it in the future, so I don't think it should be highlighted that much.

@jjuleslasarte

Copy link
Copy Markdown
Contributor Author

Just regarding the title and description of the PR. This is not really changing the AOF durability. It just makes it possible to do fsync in the background so it gets faster.

I think we can name the PR something like Write-behind log for async AOF durability.

The main benefit for the future is that it introduces the WBL into the code base.

Also it isn't doing anything to sync replication, other than preparing for introducing it in the future, so I don't think it should be highlighted that much.

Ack! yeah, my brain is focused on the long term but you are correct. I will address some of the other feedback and change the description.

@murphyjacob4 murphyjacob4 self-requested a review March 27, 2026 00:11

@murphyjacob4 murphyjacob4 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some first pass comments, not a complete review

Comment thread src/durability_provider.c Outdated
Comment thread src/unit/test_entry.cpp Outdated
Comment thread src/unit/test_files.h Outdated
Comment thread src/notify.c Outdated
Comment thread src/uncommitted_keys.c Outdated
Comment thread src/uncommitted_keys.c Outdated
@zuiderkwast zuiderkwast changed the title Aof durability Write-behind log for async AOF-based durability Mar 30, 2026
Remove the standalone 'durability' / 'sync-replication' bool config.
Durability is now implied by the combination of appendonly + appendfsync
always — there is no need for a separate knob.

- isDurabilityEnabled() now checks aof_state != AOF_OFF && aof_fsync == AOF_FSYNC_ALWAYS
- Removed 'enabled' field from durable_t struct
- AOF durability provider isEnabled() now requires appendfsync always
  (instead of a pass-through returning primary_repl_offset)
- updateAppendFsync() and updateAppendOnly() call durabilityReset()
- Updated integration and unit tests to use appendfsync always/everysec

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
… queue

Separate the two concerns in uncommitted key tracking:

1. Is a key dirty? Mark it immediately when mutated, even inside
   MULTI/EXEC (using LLONG_MAX as placeholder offset so reads block).

2. When can we clean it? Use a FIFO offset tracker queue in the
   durability system. After a transaction completes, update the
   placeholder to the real offset and enqueue for cleanup.

drainCommittedKeys() pops from the queue head when the committed
offset advances, replacing the old periodic scan-based
clearUncommittedKeysAcknowledged() cron job entirely.

- Removed keys_cleanup_time_limit_ms and curr_db_scan_idx from durable_t
- Removed getUncommittedKeysCleanupTimeLimit()
- drainCommittedKeys() called from notifyDurabilityProgress()
- Removed periodic clearUncommittedKeysAcknowledged() from serverCron

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
@zuiderkwast

Copy link
Copy Markdown
Contributor

@valkey-review-bot Please review this PR and also summarize the open review comments.

@valkey-review-bot

valkey-review-bot Bot commented Jun 18, 2026

Copy link
Copy Markdown

Reviewed the reply-blocking / write-behind-AOF change end to end (AOF BIO offload, the reply_blocking/post_commit_task/uncommitted_keys modules, and the db.c/networking.c/notify.c/server.c integration). Three issues stood out where I have concrete evidence; details inline.

Summary of the currently-open review threads, per your request:

  • Module keyspace events on the durable path (coderabbit on notify.c:147, murphyjacob4 on post_commit_task.c:294 / module.c:6818): partially open. The author's reply that "modules receive callbacks immediately" holds only when a deferred task is queued; the class-disabled branch drops them entirely (flagged inline). The ASM-style opt-in for module KSN was deferred to a follow-up.
  • COB write-boundary gating (sushilpaneru1 on networking.c:2732): author acknowledged the write handler must cap the boundary and said a test would be added; the bufpos-based cap in _writeToClient/writevToClient/clientHasPendingReplies is the result. The copy-avoidance data_len-vs-bufpos concern was addressed by switching to io_last_written.bufpos.
  • Expire/evict dirty-key tracking (sushilpaneru1 on db.c:757): addressed — handleUncommittedKeyForClient(NULL, …) is now called on the active-expire and eviction paths.
  • Uncommitted-key cleanup mechanism (sumitk163 on uncommitted_keys.c:146): resolved — the offset-tracker queue was replaced with safe-iterator hashtable drain.
  • Replica read rejection (murphyjacob4 on uncommitted_keys.c:450) and clearing dirty keys on role change (sushilpaneru1 on replication.c:1476): confirmed sync-replication leftovers, explicitly out of scope for this PR.
  • AOF offload depth (sumitk163 on aof.c:1207/1192): incremental concurrent-write and aof-fsync-always latency tracking deferred to follow-ups.
  • Naming / file layout (zuiderkwast; sushilpaneru1 on reply_blocking.h:55): much of the "durability" prefix was renamed to reply_blocking/post_commit; keeping the three files separate and deferring further renames was accepted to limit churn.
  • config.c apply ordering (coderabbit on config.c:2610): author argues bioDrainWorker(BIO_AOF_FSYNC) in updateAppendFsync() covers it; thread left open by the bot but not by a human.
  • DEBUG set-io-last-written input validation (coderabbit on debug.c:1104): open; subcommand has no debug-build guard, so the negative/oversized cast is reachable, but it's a test-only DEBUG path.

Comment thread src/notify.c
Comment thread src/reply_blocking.c Outdated
Comment thread cmake/Modules/SourceFiles.cmake Outdated
CPUmaker added a commit to jjuleslasarte/valkey that referenced this pull request Jun 18, 2026
…io#3381)

clientHasPendingReplies() compares io_last_written.bufpos against the
blocked-response boundary, which is only valid when both refer to the
same buffer. That holds today via implicit invariants but is fragile.
Add a debugServerAssert (active only under enable-debug-assert) that
io_last_written.buf is NULL or the exact buffer the boundary sits in, so
a future change that breaks the invariants fails tests.

Also add an integration test for the previously-untested reply-list
branch: a multi-block allowed prefix is released while a pipelined
blocked write reply is withheld until the provider acks.

Signed-off-by: Chris Li <1070743423@qq.com>
CPUmaker added a commit to jjuleslasarte/valkey that referenced this pull request Jun 19, 2026
…e list (valkey-io#3381)

A rejected MOVE (e.g. MOVE key 100000) reaches the reply-blocking offset path and indexed server.db[dest_dbid] without checking dbnum, an out-of-bounds read. Bound dest_dbid to [0, dbnum) in both MOVE branches (COPY was already guarded via getDbIdFromRobj) and add a MOVE/COPY out-of-range regression test.

Also point the CMake source list at the renamed post_commit_task.c (was durable_task.c), which broke cmake configure.

Signed-off-by: Chris Li <1070743423@qq.com>
CPUmaker added a commit to jjuleslasarte/valkey that referenced this pull request Jun 23, 2026
…#3381)

Under reply-blocking, module keyspace notifications were delivered only
via the deferred post-commit task, which is registered only when the
client's notify-keyspace-events matches the event. With the default
empty config, or for background jobs (expiry/eviction) that bypass
certification, module subscribers were dropped entirely -- yet modules
filter by their own event mask and should see every event. Notify
modules inline (as upstream does) and defer only the client pub/sub;
certification no longer re-notifies modules.

Inline notification also lets a module block the client from its
callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply
the reply-blocking boundary when the parked reply is committed on
unblock, unless the write is already acked. The boundary is snapshotted
at commit time, not reused from command start, to avoid a use-after-free
on a reply buffer node freed during the block.

Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl.

Signed-off-by: Chris Li <1070743423@qq.com>
jjuleslasarte and others added 5 commits June 23, 2026 20:28
…IO jobs, rename APIs

Address @zuiderkwast's review comments on the AOF durability PR:

Architecture:
- Remove durability_provider.{h,c} registry abstraction — inline AOF
  logic directly since only one provider exists
- Unify BIO_AOF_ALWAYS_FLUSH into BIO_AOF_FSYNC — use buf != NULL to
  distinguish write+fsync+notify path from plain fsync path

Naming:
- getDurabilityConsensusOffset -> getDurablyCommittedOffset
- anyDurabilityProviderEnabled -> isAofDurabilityEnabled
- clientDurabilityInfo -> clientReplyBlockingState
- durabilityInit/Cleanup/Reset -> replyBlockingInit/Cleanup/Reset
- bioCreateAofAlwaysFlushJob -> bioCreateAofFsyncNotifyJob
- aof_io_flush_* -> aof_bio_flush_*, AOF_IO_FLUSH_* -> AOF_BIO_FLUSH_*
- DEBUG durability-provider-pause -> DEBUG reply-blocking-pause
- taskWaitingAckType fields to snake_case

Struct layout:
- Move durable_t durability and bio_aof_offload_enabled from top of
  valkeyServer to AOF persistence section

Style:
- Convert doxygen /** to compact /* */ throughout PR-introduced code
- Convert // comments to /* */ in PR-introduced code
- Fix tabs/spaces in src/Makefile

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
…e cap fix

Naming refactor:
- Rename durability* → replyBlocking*/postCommit* across all files
- Rename durable_task.c/h → post_commit_task.c/h
- Rename INFO fields: durability_* → reply_blocking_*
- Fix comment style (single-line /* */ → //)

Write cap fix:
- Cap _writeToClient() and writevToClient() at disallowed_byte_offset
- Prevents single syscall from flushing past the reply-blocking boundary

Module boundary cleanup:
- Move all_dbs_dirty_in_current_cmd into uncommitted_keys module
- Pass previous_acked_offset as parameter instead of reading server struct
- Decouple uncommitted_keys from reply_blocking internals

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>

Signed-off-by: Jules Lasarte <jules.lasarte@gmail.com>
Address review comments from @murphyjacob4:

- Add a dedicated CMD_KEYSPACE_INFORMATIONAL command flag instead of
  overloading the KEYSPACE ACL category to identify read-only keyspace
  introspection commands; tag the 18 matching commands and simplify the
  IS_KEYSPACE_INFORMATIONAL macro.
- Rename call-site-named reply-blocking hooks to describe behavior:
  recordReplOffsetBaseline, computeCommandBlockingOffset,
  beginCommandReplyBlocking, validateScriptForReplyBlocking,
  finalizeCommandReplyBlocking.
- Move NOTIFY_IN_POST_COMMIT_TASK into the NOTIFY_ flag block in server.h.
- Use C-style comments and fresh bitfield numbering in reply_blocking.h
  and drop redundant macro comments.
- Replace the ambiguous "ZDL" comment with "reply-blocking".

Also fix an incomplete rename from the durability -> reply-blocking
refactor: aof.c still called notifyDurabilityProgress(); renamed it to
notifyReplyBlockingProgress() so the build links.

Signed-off-by: jjuleslasarte <jules.lasarte@gmail.com>
…io#3381)

clientHasPendingReplies() compares io_last_written.bufpos against the
blocked-response boundary, which is only valid when both refer to the
same buffer. That holds today via implicit invariants but is fragile.
Add a debugServerAssert (active only under enable-debug-assert) that
io_last_written.buf is NULL or the exact buffer the boundary sits in, so
a future change that breaks the invariants fails tests.

Also add an integration test for the previously-untested reply-list
branch: a multi-block allowed prefix is released while a pipelined
blocked write reply is withheld until the provider acks.

Signed-off-by: Chris Li <1070743423@qq.com>
…e list (valkey-io#3381)

A rejected MOVE (e.g. MOVE key 100000) reaches the reply-blocking offset path and indexed server.db[dest_dbid] without checking dbnum, an out-of-bounds read. Bound dest_dbid to [0, dbnum) in both MOVE branches (COPY was already guarded via getDbIdFromRobj) and add a MOVE/COPY out-of-range regression test.

Also point the CMake source list at the renamed post_commit_task.c (was durable_task.c), which broke cmake configure.

Signed-off-by: Chris Li <1070743423@qq.com>
CPUmaker added a commit to jjuleslasarte/valkey that referenced this pull request Jun 23, 2026
…#3381)

Under reply-blocking, module keyspace notifications were delivered only
via the deferred post-commit task, which is registered only when the
client's notify-keyspace-events matches the event. With the default
empty config, or for background jobs (expiry/eviction) that bypass
certification, module subscribers were dropped entirely -- yet modules
filter by their own event mask and should see every event. Notify
modules inline (as upstream does) and defer only the client pub/sub;
certification no longer re-notifies modules.

Inline notification also lets a module block the client from its
callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply
the reply-blocking boundary when the parked reply is committed on
unblock, unless the write is already acked. The boundary is snapshotted
at commit time, not reused from command start, to avoid a use-after-free
on a reply buffer node freed during the block.

Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl.

Signed-off-by: Chris Li <1070743423@qq.com>
CPUmaker added a commit to jjuleslasarte/valkey that referenced this pull request Jun 23, 2026
…#3381)

Under reply-blocking, module keyspace notifications were delivered only
via the deferred post-commit task, which is registered only when the
client's notify-keyspace-events matches the event. With the default
empty config, or for background jobs (expiry/eviction) that bypass
certification, module subscribers were dropped entirely -- yet modules
filter by their own event mask and should see every event. Notify
modules inline (as upstream does) and defer only the client pub/sub;
certification no longer re-notifies modules.

Inline notification also lets a module block the client from its
callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply
the reply-blocking boundary when the parked reply is committed on
unblock, unless the write is already acked. The boundary is snapshotted
at commit time, not reused from command start, to avoid a use-after-free
on a reply buffer node freed during the block.

Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl.

Signed-off-by: Chris Li <1070743423@qq.com>
…#3381)

Under reply-blocking, module keyspace notifications were delivered only
via the deferred post-commit task, which is registered only when the
client's notify-keyspace-events matches the event. With the default
empty config, or for background jobs (expiry/eviction) that bypass
certification, module subscribers were dropped entirely -- yet modules
filter by their own event mask and should see every event. Notify
modules inline (as upstream does) and defer only the client pub/sub;
certification no longer re-notifies modules.

Inline notification also lets a module block the client from its
callback (valkey-io#1819) under reply-blocking. To keep that durable, re-apply
the reply-blocking boundary when the parked reply is committed on
unblock, unless the write is already acked. The boundary is snapshotted
at commit time, not reused from command start, to avoid a use-after-free
on a reply buffer node freed during the block.

Add tests/unit/moduleapi/block_keyspace_notification_durability.tcl.

Signed-off-by: Chris Li <1070743423@qq.com>
Resolve conflict in src/debug.c by keeping DEBUG subcommands from both
sides: reply-blocking-pause/resume and set-io-last-written (feature
branch) alongside force-free-primary-async and protect-client (unstable).

Signed-off-by: Chris Li <1070743423@qq.com>
@CPUmaker

CPUmaker commented Jun 24, 2026

Copy link
Copy Markdown

Reviewed the reply-blocking / write-behind-AOF change end to end (AOF BIO offload, the reply_blocking/post_commit_task/uncommitted_keys modules, and the db.c/networking.c/notify.c/server.c integration). Three issues stood out where I have concrete evidence; details inline.

The open threads mentioned above are addressed or tracked as follow-ups of separate issues.

jjuleslasarte and others added 3 commits June 25, 2026 09:40
…#3381)

Per the maintainers' decision, reply-blocking observability is internal
and test-only, not customer-facing. Move the reply_blocking_* fields out
of the standalone (all/everything-visible) INFO section and into the
hidden Debug section, which is emitted only for an explicit INFO debug --
excluded from INFO, INFO all, and INFO everything.

Update the durability tests to read INFO debug. Further field-level
changes (dropping reply_blocking_enabled and the offset fields, and
relocating the committed/uncommitted offsets to the Replication section)
are tracked separately pending TSC approval.

Signed-off-by: Chris Li <1070743423@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

major-decision-pending Major decision pending by TSC team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement WBL for AOF

9 participants