-
Notifications
You must be signed in to change notification settings - Fork 10.3k
Durability API guarantee broken in single node cluster #14370
Description
I observed the possibility of data loss and I would like the community to comment / correct me otherwise.
Before explaining that, I would like to explain the happy path when user does a PUT <key, value>. I have tried to only necessary steps to focus this issue. And considered a single etcd instance.
====================================================================================
----------api thread --------------
User calls etcdctl PUT k v
It lands in v3_server.go::put function with the message about k,v
Call delegates to series of function calls and enters v3_server.go::processInternalRaftRequestOnce
It registers for a signal with wait utility against this keyid
Call delegates further to series of function calls and enters raft/node.go::stepWithWaitOption(..message..)
It wraps this message in a msgResult channel and updates its result channel; then sends this message to propc channel.
After sending it waits on msgResult.channel
----------api thread waiting --------------
On seeing a message in propc channel, raft/node.go::run(), it wakes up and sequence of calls adds the message.Entries to raftLog
Notifies the msgResult.channel
----------api thread wakes--------------
10. Upon seeing the msgResult.channel, api thread wakes and returns down the stack back to v3_server.go::processInternalRaftRequestOnce and waits for signal that it registered at step#4
----------api thread waiting --------------
In next iteration of raft/node.go::run(), it gets the entry from raftLog and add it to readyc
etcdserver/raft.go::start wakes up on seeing this entry in readyc and adds this entry to applyc channel
and synchronously writes to wal log ---------------------> wal log
etcdserver/server.go wakes up on seeing entry in applyc channel (added in step #12)
From step#14, the call goes through series of calls and lands in server.go::applyEntryNormal
applyEntryNormal calls applyV3.apply which will eventually puts the KV to mvcc kvstore txn kvindex
applyEntryNormal now sends the signal for this key which is basically to wake up api thread that is waiting in 7
----------api thread wakes--------------
18. User thread here wakes and sends back acknowledgement
----------user sees ok--------------
Batcher flushes the entries added to kvstore txn kvindex to database file. (also this can happen before 18 based on its timer)
Here if step #13 thread is pre-empted and rescheduled by the underlying operating system after completing step #18 and when there is a power failure at the end of step 18 where after user sees error, then the kv is neither written to wal nor to database file
I think this is not seen today because it is a small window where the server has to restart immediately after step 18 (and immediately after step 12 the underlying os must have pre-empted the etcdserver/raft.go::start and added to end of the runnable Q.). Given these multiple conditions, it appears that we dont see data loss.
But it appears from the code that it is possible. To simulate, added sleep after step 12 (also added exit) and 19. I was able to see ok but the data is not in both wal and db.
If I am not correct, my apology and also please correct my understanding.
Before repro please do the changes:
2.Do the code changes in tx.go

- Rebuild etcd server
Now follow the steps to repro
//1. Start etcd server with changes
//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation)
$ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1
OK
//3. Remove this control flag file and restart the etcd server
$ rm /tmp/exitnow
//4. Check if key present
$ ./bin/etcdctl get /k --prefix
$
// We can see no key-value
