[Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change)#14411
Closed
ahrtr wants to merge 2 commits intoetcd-io:mainfrom
Closed
[Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change)#14411ahrtr wants to merge 2 commits intoetcd-io:mainfrom
ahrtr wants to merge 2 commits intoetcd-io:mainfrom
Conversation
83bada5 to
cf9306b
Compare
Codecov Report
@@ Coverage Diff @@
## main #14411 +/- ##
==========================================
- Coverage 75.56% 75.26% -0.30%
==========================================
Files 457 458 +1
Lines 37183 37202 +19
==========================================
- Hits 28098 28001 -97
- Misses 7335 7432 +97
- Partials 1750 1769 +19
Flags with carried forward coverage won't be shown. Click here to find out more.
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
cf9306b to
d1957fe
Compare
ptabor
approved these changes
Sep 2, 2022
Contributor
ptabor
left a comment
There was a problem hiding this comment.
Thank you. This change looks good to me.
The fixed tests are huge value, irregardless whether we will go with Step or the in-line approach.
0773176 to
a799416
Compare
For a cluster with only one member, the raft always send identical unstable entries and committed entries to etcdserver, and etcd responds to the client once it finishes (actually partially) the applying workflow. When the client receives the response, it doesn't mean etcd has already successfully saved the data, including BoltDB and WAL, because: 1. etcd commits the boltDB transaction periodically instead of on each request; 2. etcd saves WAL entries in parallel with applying the committed entries. Accordingly, it may run into a situation of data loss when the etcd crashes immediately after responding to the client and before the boltDB and WAL successfully save the data to disk. Note that this issue can only happen for clusters with only one member. For clusters with multiple members, it isn't an issue, because etcd will not commit & apply the data before it being replicated to majority members. When the client receives the response, it means the data must have been applied. It further means the data must have been committed. Note: for clusters with multiple members, the raft will never send identical unstable entries and committed entries to etcdserver. Signed-off-by: Benjamin Wang <wachao@vmware.com>
1. added one more command "report-status" so that the leader can acknowledges that the entries has already been persisted. 2. regenerated some test data. Signed-off-by: Benjamin Wang <wachao@vmware.com>
a799416 to
e60cb56
Compare
Member
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The fourth solution to fix #14370