Skip to content

raft: improve commentary on Ready#10861

Closed
tbg wants to merge 2 commits intoetcd-io:masterfrom
tbg:ready-docs
Closed

raft: improve commentary on Ready#10861
tbg wants to merge 2 commits intoetcd-io:masterfrom
tbg:ready-docs

Conversation

@tbg
Copy link
Copy Markdown
Contributor

@tbg tbg commented Jun 27, 2019

The correct way to handle a Ready was elusive and comes with a surprising
amount of pitfalls. There are also certain optimizations that need
justification and were previously not documented.

Fix both.

tbg added 2 commits June 27, 2019 14:19
The correct way to handle a Ready was elusive and comes with a
surprising amount of pitfalls. There are also certain optimizations
that need justification and were previously not documented.

Fix both.
softState() allocates for no good reason, but plenty of callers use it
in the hot path. This is probably not major or we would've knocked it
out already.
@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #10861 into master will decrease coverage by 0.09%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff            @@
##           master   #10861     +/-   ##
=========================================
- Coverage   62.82%   62.73%   -0.1%     
=========================================
  Files         398      398             
  Lines       37446    37446             
=========================================
- Hits        23526    23492     -34     
- Misses      12339    12375     +36     
+ Partials     1581     1579      -2
Impacted Files Coverage Δ
raft/node.go 92.74% <ø> (ø) ⬆️
raft/log.go 79.56% <ø> (ø) ⬆️
raft/raft.go 91.98% <ø> (+0.69%) ⬆️
clientv3/balancer/grpc1.7-health.go 17.44% <0%> (-43.03%) ⬇️
auth/options.go 57.5% <0%> (-35%) ⬇️
proxy/grpcproxy/register.go 69.44% <0%> (-11.12%) ⬇️
clientv3/naming/grpc.go 68.42% <0%> (-7.02%) ⬇️
clientv3/namespace/watch.go 87.87% <0%> (-6.07%) ⬇️
auth/range_perm_cache.go 56.96% <0%> (-3.8%) ⬇️
etcdserver/api/rafthttp/msgappv2_codec.go 69.56% <0%> (-1.74%) ⬇️
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 948e276...64ad156. Read the comment docs.

// optimizations for simplicity. These applications will use the following
// sequence of steps:
//
// 1 apply snapshot (if any) and sync
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add dot after the number like line 60 to 62?

// be saved to stable storage, committed or sent to other peers.
// All fields in Ready are read-only.
//
// Handling the Ready can be done with or without performance optimizations.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel we should move this explanation to https://github.com/etcd-io/etcd/blob/master/raft/README.md and add a link here.

// appended and synced before the HardState is updated. This avoids the
// situation in which the HardState references a Commit index which is not
// durably appended yet. This does not penalize performance because updates to
// the HardState must only be synced when MustSync is set, and those occur only
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually true. Unfortunately, we set MustSync to true if there are entries to append but no change to HardState.Vote or HardState.Term. We might want to expose a new MustSyncHardState or something.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right. I forgot that what I wanted to do was to change this to MustSyncHardState and add an accessor func (rd *Ready) MustSync() { return len(rd.Entries) > 0 || rd.MustSyncHardState }.

// before persisting its Entries, it won't be the leader for its old term any
// more and thus will not handle the MsgAppResp. Additionally, the fact that a
// commit index that is higher than that ultimately persisted on the crashed
// leader is now known to a follower (via the MsgApp) is benign. And, for one
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the fact ... is benign"

Is that what you meant to say?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heh, not quite. Will fix.

Going to ping when it's ready, won't be today

@tbg
Copy link
Copy Markdown
Contributor Author

tbg commented Jul 18, 2019

Reminder to self: work this new wall of text into the growing body of this documentation project 😄

// A Snapshot is sent by the leader to catch up a follower that has fallen behind.
// The snapshot corresponds to the on-disk replicated state of the distributed
// state machine as of the log index described by the metadata's Index and Term.
// When raft asks a follower to apply a Snapshot (as part of Ready handling), it
// is informing the application that the contents of the Raft log are made obsolete
// by the Snapshot. 
//
// If all data is on the same storage engine and can be manipulated atomically,
// the procedure is easy: apply the snapshot and the remainder of the Ready (it
// will typically at least include an updated HardState to reflect a new Commit
// index) in one atomic operation.
//
// If atomicity between the two is not available, a more complicated approach
// has to be taken. Assume that the application state (i.e the Data field) can
// be written atomically along with the index it belongs to (Metadata.Index).
// Assume further that the Raft log and HardState are on different storage
// engines, respectively, and no atomicity is available between the three.
//
// The Raft log poses the smallest problem. We must not delete from it before
// the snapshot data is durably applied (otherwise we break all of the promises
// we made by accepting entries into it). Assuming that, it won't matter whether
// the log gets removed before or after the HardState is updated. Care must be
// taken when initializing the system after a crash to understand that the Raft
// log needs to be compacted synchronously if the applied index is ahead of the
// last index. To avoid watering down detection of problems in general, this
// should be considered only if the data partition holds the immediate result
// of a snapshot application following the procedure detailed later in this
// comment.
//
// The bigger
// constraint exists between the HardState and snapshot data because they
// interact via HardState.Commit: if that is updated but then the system
// crash-restarts before the snapshot data is persisted, we'll see a commit
// index ahead of the last index, which is an undefined state and will panic. On
// the other hand, if the snapshot data is persisted but not the commit index,
// we will see the applied index ahead of the commit index, which is similarly
// problematic. To avoid that, the app can adapt Storage.InitialState so that it
// takes the commit index not only from the persisted HardState, but also takes
// into account any larger applied index in its data engine. Naively this waters
// down Raft's internal detection against data loss (if the storage engine
// unexpectedly lost HardState updates, this mechanism could mask it) but the
// app can additionally make this behavior conditional: when snapshot data is
// written to the data engine, it (atomically) includes a marker. At startup,
// only when the marker is present is the applied index taken into account and
// the HardState transparently updated, and then the flag removed again.
//
// It remains to detail how the Raft log is removed. As discussed, this is the
// last step and requires essentially carrying out a log compaction
// (occasionally called truncation) that removes the entirety of the log
// entries.
message Snapshot {
	optional bytes            data     = 1;
	optional SnapshotMetadata metadata = 2 [(gogoproto.nullable) = false];
}

@stale
Copy link
Copy Markdown

stale bot commented Apr 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@ptabor
Copy link
Copy Markdown
Contributor

ptabor commented Sep 2, 2022

(reopening)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Development

Successfully merging this pull request may close these issues.

5 participants