Conversation
The correct way to handle a Ready was elusive and comes with a surprising amount of pitfalls. There are also certain optimizations that need justification and were previously not documented. Fix both.
softState() allocates for no good reason, but plenty of callers use it in the hot path. This is probably not major or we would've knocked it out already.
Codecov Report
@@ Coverage Diff @@
## master #10861 +/- ##
=========================================
- Coverage 62.82% 62.73% -0.1%
=========================================
Files 398 398
Lines 37446 37446
=========================================
- Hits 23526 23492 -34
- Misses 12339 12375 +36
+ Partials 1581 1579 -2
Continue to review full report at Codecov.
|
| // optimizations for simplicity. These applications will use the following | ||
| // sequence of steps: | ||
| // | ||
| // 1 apply snapshot (if any) and sync |
There was a problem hiding this comment.
add dot after the number like line 60 to 62?
| // be saved to stable storage, committed or sent to other peers. | ||
| // All fields in Ready are read-only. | ||
| // | ||
| // Handling the Ready can be done with or without performance optimizations. |
There was a problem hiding this comment.
i feel we should move this explanation to https://github.com/etcd-io/etcd/blob/master/raft/README.md and add a link here.
| // appended and synced before the HardState is updated. This avoids the | ||
| // situation in which the HardState references a Commit index which is not | ||
| // durably appended yet. This does not penalize performance because updates to | ||
| // the HardState must only be synced when MustSync is set, and those occur only |
There was a problem hiding this comment.
This isn't actually true. Unfortunately, we set MustSync to true if there are entries to append but no change to HardState.Vote or HardState.Term. We might want to expose a new MustSyncHardState or something.
There was a problem hiding this comment.
Oh, right. I forgot that what I wanted to do was to change this to MustSyncHardState and add an accessor func (rd *Ready) MustSync() { return len(rd.Entries) > 0 || rd.MustSyncHardState }.
| // before persisting its Entries, it won't be the leader for its old term any | ||
| // more and thus will not handle the MsgAppResp. Additionally, the fact that a | ||
| // commit index that is higher than that ultimately persisted on the crashed | ||
| // leader is now known to a follower (via the MsgApp) is benign. And, for one |
There was a problem hiding this comment.
"the fact ... is benign"
Is that what you meant to say?
There was a problem hiding this comment.
Heh, not quite. Will fix.
Going to ping when it's ready, won't be today
|
Reminder to self: work this new wall of text into the growing body of this documentation project 😄 // A Snapshot is sent by the leader to catch up a follower that has fallen behind.
// The snapshot corresponds to the on-disk replicated state of the distributed
// state machine as of the log index described by the metadata's Index and Term.
// When raft asks a follower to apply a Snapshot (as part of Ready handling), it
// is informing the application that the contents of the Raft log are made obsolete
// by the Snapshot.
//
// If all data is on the same storage engine and can be manipulated atomically,
// the procedure is easy: apply the snapshot and the remainder of the Ready (it
// will typically at least include an updated HardState to reflect a new Commit
// index) in one atomic operation.
//
// If atomicity between the two is not available, a more complicated approach
// has to be taken. Assume that the application state (i.e the Data field) can
// be written atomically along with the index it belongs to (Metadata.Index).
// Assume further that the Raft log and HardState are on different storage
// engines, respectively, and no atomicity is available between the three.
//
// The Raft log poses the smallest problem. We must not delete from it before
// the snapshot data is durably applied (otherwise we break all of the promises
// we made by accepting entries into it). Assuming that, it won't matter whether
// the log gets removed before or after the HardState is updated. Care must be
// taken when initializing the system after a crash to understand that the Raft
// log needs to be compacted synchronously if the applied index is ahead of the
// last index. To avoid watering down detection of problems in general, this
// should be considered only if the data partition holds the immediate result
// of a snapshot application following the procedure detailed later in this
// comment.
//
// The bigger
// constraint exists between the HardState and snapshot data because they
// interact via HardState.Commit: if that is updated but then the system
// crash-restarts before the snapshot data is persisted, we'll see a commit
// index ahead of the last index, which is an undefined state and will panic. On
// the other hand, if the snapshot data is persisted but not the commit index,
// we will see the applied index ahead of the commit index, which is similarly
// problematic. To avoid that, the app can adapt Storage.InitialState so that it
// takes the commit index not only from the persisted HardState, but also takes
// into account any larger applied index in its data engine. Naively this waters
// down Raft's internal detection against data loss (if the storage engine
// unexpectedly lost HardState updates, this mechanism could mask it) but the
// app can additionally make this behavior conditional: when snapshot data is
// written to the data engine, it (atomically) includes a marker. At startup,
// only when the marker is present is the applied index taken into account and
// the HardState transparently updated, and then the flag removed again.
//
// It remains to detail how the Raft log is removed. As discussed, this is the
// last step and requires essentially carrying out a log compaction
// (occasionally called truncation) that removes the entirety of the log
// entries.
message Snapshot {
optional bytes data = 1;
optional SnapshotMetadata metadata = 2 [(gogoproto.nullable) = false];
} |
|
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
|
(reopening) |
The correct way to handle a Ready was elusive and comes with a surprising
amount of pitfalls. There are also certain optimizations that need
justification and were previously not documented.
Fix both.