[Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change) by ahrtr · Pull Request #14411 · etcd-io/etcd

ahrtr · 2022-09-01T09:13:01Z

The fourth solution to fix #14370

codecov-commenter · 2022-09-01T11:19:28Z

Codecov Report

Merging #14411 (e60cb56) into main (5707147) will decrease coverage by 0.29%.
The diff coverage is 92.30%.

@@            Coverage Diff             @@
##             main   #14411      +/-   ##
==========================================
- Coverage   75.56%   75.26%   -0.30%     
==========================================
  Files         457      458       +1     
  Lines       37183    37202      +19     
==========================================
- Hits        28098    28001      -97     
- Misses       7335     7432      +97     
- Partials     1750     1769      +19

Flag	Coverage Δ
all	`75.26% <92.30%> (-0.30%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...st/interaction_env_handler_leader_report_status.go	`77.77% <77.77%> (ø)`
raft/node.go	`87.02% <100.00%> (-0.53%)`	⬇️
raft/raft.go	`90.20% <100.00%> (+0.02%)`	⬆️
raft/rafttest/interaction_env_handler.go	`83.33% <100.00%> (+0.52%)`	⬆️
server/etcdserver/api/rafthttp/peer_status.go	`87.87% <0.00%> (-12.13%)`	⬇️
server/auth/simple_token.go	`80.00% <0.00%> (-8.47%)`	⬇️
server/etcdserver/api/rafthttp/peer.go	`87.01% <0.00%> (-8.45%)`	⬇️
client/pkg/v3/fileutil/purge.go	`66.03% <0.00%> (-7.55%)`	⬇️
client/v3/experimental/recipes/queue.go	`58.62% <0.00%> (-6.90%)`	⬇️
client/v3/namespace/watch.go	`87.87% <0.00%> (-6.07%)`	⬇️
... and 21 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

ptabor

Thank you. This change looks good to me.
The fixed tests are huge value, irregardless whether we will go with Step or the in-line approach.

raft/node.go

For a cluster with only one member, the raft always send identical unstable entries and committed entries to etcdserver, and etcd responds to the client once it finishes (actually partially) the applying workflow. When the client receives the response, it doesn't mean etcd has already successfully saved the data, including BoltDB and WAL, because: 1. etcd commits the boltDB transaction periodically instead of on each request; 2. etcd saves WAL entries in parallel with applying the committed entries. Accordingly, it may run into a situation of data loss when the etcd crashes immediately after responding to the client and before the boltDB and WAL successfully save the data to disk. Note that this issue can only happen for clusters with only one member. For clusters with multiple members, it isn't an issue, because etcd will not commit & apply the data before it being replicated to majority members. When the client receives the response, it means the data must have been applied. It further means the data must have been committed. Note: for clusters with multiple members, the raft will never send identical unstable entries and committed entries to etcdserver. Signed-off-by: Benjamin Wang <wachao@vmware.com>

1. added one more command "report-status" so that the leader can acknowledges that the entries has already been persisted. 2. regenerated some test data. Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr · 2022-09-05T21:39:33Z

leave it to @tbg to fix the raft side in #14413

ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch 2 times, most recently from 83bada5 to cf9306b Compare September 1, 2022 10:41

ahrtr mentioned this pull request Sep 1, 2022

Durability API guarantee broken in single node cluster #14370

Closed

tbg mentioned this pull request Sep 1, 2022

raft: re-work leader self-ack mechanism cockroachdb/cockroach#87264

Closed

tbg self-requested a review September 1, 2022 13:42

ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch from cf9306b to d1957fe Compare September 2, 2022 00:57

ptabor approved these changes Sep 2, 2022

View reviewed changes

raft/node.go Show resolved Hide resolved

ptabor mentioned this pull request Sep 2, 2022

WIP: Durability issue fix in RAFT layer. #14406

Closed

ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch from 0773176 to a799416 Compare September 2, 2022 08:59

ahrtr added 2 commits September 3, 2022 02:01

updated the raft test data

e60cb56

1. added one more command "report-status" so that the leader can acknowledges that the entries has already been persisted. 2. regenerated some test data. Signed-off-by: Benjamin Wang <wachao@vmware.com>

ahrtr force-pushed the one_member_data_loss_raft_protocol_ptabor branch from a799416 to e60cb56 Compare September 2, 2022 18:03

ahrtr closed this Sep 5, 2022

ahrtr mentioned this pull request Sep 9, 2022

raft: don't emit unstable CommittedEntries #14413

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change)#14411

[Fourth solution] Fix the potential data loss for clusters with only one member (raft layer change)#14411
ahrtr wants to merge 2 commits intoetcd-io:mainfrom
ahrtr:one_member_data_loss_raft_protocol_ptabor

ahrtr commented Sep 1, 2022

Uh oh!

codecov-commenter commented Sep 1, 2022 •

edited

Loading

Uh oh!

ptabor left a comment

Uh oh!

Uh oh!

ahrtr commented Sep 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

ahrtr commented Sep 1, 2022

Uh oh!

codecov-commenter commented Sep 1, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ptabor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ahrtr commented Sep 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Sep 1, 2022 •

edited

Loading