Add doc on master elections in DistributedArchitectureGuide#142435
Add doc on master elections in DistributedArchitectureGuide#142435inespot merged 10 commits intoelastic:mainfrom
Conversation
Details master eligibility, node roles, the election flow and failure cases.
🔍 Preview links for changed docs |
ℹ️ Important: Docs version tagging👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version. We use applies_to tags to mark version-specific features and changes. Expand for a quick overviewWhen to use applies_to tags:✅ At the page level to indicate which products/deployments the content applies to (mandatory) What NOT to do:❌ Don't remove or replace information that applies to an older version 🤔 Need help?
|
70f489d to
cf25b4b
Compare
|
Pinging @elastic/es-distributed (Team:Distributed) |
|
Pinging @elastic/core-docs (Team:Docs) |
|
|
||
| (A node can coordinate a search across several other nodes, when the node itself does not have the data, and then return a result to the caller. Explain this coordinating role) | ||
|
|
||
| ### Cluster State |
There was a problem hiding this comment.
Outlined some additional subsections outside of Master Elections, to tackle in subsequent PRs.
|
|
||
| [CoordinationMetadata]:https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationMetadata.java | ||
|
|
||
| [VotingConfiguration]: https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationMetadata.java#L326 |
There was a problem hiding this comment.
This PR uses the v9.3.0 tag for all links not pointing to top-level classes to make sure the lines stay consistent. The existing documentation is a bit varied. Some sections use specific commits (like Snapshot Repository), and others (like HTTP Server) don't use links at all, just plain function names. If people have strong opinions on which is best, happy to adjust
There was a problem hiding this comment.
+1 to using a release tag like v9.3.0 because it's immutable (in practice) but please don't refer to a branch like main as these things change over time.
There was a problem hiding this comment.
Sounds good, will adjust for top level classes as well!
DaveCTurner
left a comment
There was a problem hiding this comment.
Great stuff, thanks for this.
|
|
||
| [CoordinationMetadata]:https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationMetadata.java | ||
|
|
||
| [VotingConfiguration]: https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationMetadata.java#L326 |
There was a problem hiding this comment.
+1 to using a release tag like v9.3.0 because it's immutable (in practice) but please don't refer to a branch like main as these things change over time.
|
|
||
| [VotingConfiguration]: https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/cluster/coordination/CoordinationMetadata.java#L326 | ||
|
|
||
| The cluster maintains at most a single master at all times. If no master is |
There was a problem hiding this comment.
This is the conceptual goal but it's surprisingly tricksy to even define what it even means for two nodes to be master at the same time. You can definitely have two nodes which each believe they are the master (and e.g. will service TransportMasterNodeAction requests) for a while, the key point is that all but at most one of them will not be able to update the cluster state.
Maybe too early to get into this level of detail? But it is worth saying somewhere, to avoid confusion about the exact invariants on which we can rely? You mention it below that we guarantee there will be at most one master in each term, and that the terms of committed cluster state updates are nondecreasing, so in a sense the term acts as a logical clock, but perhaps say there in the Terms section that different nodes may be at different logical times (i.e. terms) at the same physical time?
| The cluster maintains at most a single master at all times. If no master is | |
| The cluster maintains (conceptually at least) at most a single master at all times. If no master is |
Also maybe worth highlighting at the top that the point of electing a master (and everything else here) is purely to update the cluster state. The elected master also does other things too but the cluster-state-updating bit is the only essential bit.
| any [ClusterState] changes until a new master is elected. | ||
|
|
||
| To elect a master, Elasticsearch uses a consensus algorithm derived | ||
| from [Paxos](https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf). This algorithm is formally defined in a TLA+ |
There was a problem hiding this comment.
That's the original paper but maybe worth also linking these for a gentler introduction too:
| To elect a master, Elasticsearch uses a consensus algorithm derived | ||
| from [Paxos](https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf). This algorithm is formally defined in a TLA+ | ||
| specification referenced from the [CoordinationState] class. The [Coordinator] class handles the core logic of the | ||
| election, and manages how nodes transition between `CANDIDATE`, `LEADER`, and |
There was a problem hiding this comment.
|
|
||
| #### Election Flow | ||
|
|
||
| The overall election flow looks like this: |
There was a problem hiding this comment.
I think I'd rather this was more prose-like (e.g. so you can copy-paste the sentences elsewhere) - I'm not sure the boxes and arrows really add much to this straight-line flow, and they are a royal pain to maintain in future edits.
Something like this perhaps?
-
Leader failure detected.
Follower detects current master failure
See:
LeaderChecker Coordinator.onLeaderFailure() -
Node becomes
CANDIDATEFollower transitions to
CANDIDATEmode which triggers the discovery process.See:
Coordinator.becomeCandidate() Mode.CANDIDATE PeerFinder.activate(...)
etc.
There was a problem hiding this comment.
Works for me, I'll switch this to be pure prose
| └───────────────────────────────────────────────┴──────────────────────────────────┘ | ||
| ``` | ||
|
|
||
| #### Failure Detection |
There was a problem hiding this comment.
You asked a good question in the onboarding session about the reasons for having checks in both directions - would you cover that point here?
| the next `handleWakeUp` iteration | ||
|
|
||
| When [receiving](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/discovery/PeerFinder.java#L534) | ||
| a [PeersResponse], [PeerFinder] will reach out to all peers specified in the response, including a potential master. If |
There was a problem hiding this comment.
Nit but maybe worth mentioning that we also reach back out to nodes that send us requests for peers.
|
|
||
| [DiscoveryPlugin]: https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/plugins/DiscoveryPlugin.java | ||
|
|
||
| Discovery is a fast "gossip-like" protocol by which a node in `CANDIDATE` mode locates master-eligible nodes in the |
There was a problem hiding this comment.
Not sure if you want to mention how fast we mean by "fast" but FWIW it will discover every other master-eligible node in at most something like ⌈log₂(D)+1⌉ steps where D is the diameter of the graph of seed host configurations.
…on-sliced-reindex * upstream/main: Activity logging improvements (elastic#142901) Fix serialization of NodeGpuStatsResponse when no GPU is present (elastic#142937) Add doc on master elections in DistributedArchitectureGuide (elastic#142435) ESQL: Account for missing StubRelation due to SurrogateExpressions replacement (elastic#142882) Add BulkByScrollTask Serialization Tests (elastic#142697) Rebalance CI test partitions to reduce Part3 bottleneck (elastic#142930) Mute org.elasticsearch.xpack.esql.qa.multi_node.EsqlClientYamlIT test {p0=esql/40_tsdb/to_aggregate_metric_double with multi_values} elastic#142964 Bump OpenTelemetry dependencies (elastic#142323) SQL: add support for API key to JDBC and CLI (elastic#142021) Ensure requested capability exists (elastic#142695) Warn and fall back to local branches.json (elastic#142606) [CI] Mute testWithFetchFailures, testAddCompletionListenerScheduleErr… (elastic#142926) ESQL: Add support for ORC file format (elastic#142900) Update wolfi (versioned) (elastic#142948) Add BulkByScrollResponse Serialization Tests (elastic#142688) Run 25_id_generation with and without synthetic id (elastic#142770)
Details the master election flow.
ES-14214