Skip to content

Add CRUD doc to the DistributedArchitectureGuide#144710

Merged
inespot merged 21 commits intoelastic:mainfrom
inespot:doc/crud
Mar 31, 2026
Merged

Add CRUD doc to the DistributedArchitectureGuide#144710
inespot merged 21 commits intoelastic:mainfrom
inespot:doc/crud

Conversation

@inespot
Copy link
Copy Markdown
Contributor

@inespot inespot commented Mar 22, 2026

Documents the full write path under the Indexing / CRUD subsection (coordinator to primary to replication), plus seq_no / primary term and checkpoint behavior (including out-of-order replication and MSU vs LCP).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 22, 2026

🔍 Preview links for changed docs

@github-actions
Copy link
Copy Markdown
Contributor

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

@inespot
Copy link
Copy Markdown
Contributor Author

inespot commented Mar 25, 2026

I left out Lucene and locking details here since they’re covered (or slated) in other parts of the doc. But happy to reconsider and add short subsection in this section as a FLUP if preferred.

@inespot inespot marked this pull request as ready for review March 25, 2026 14:46
@inespot inespot added the :Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. label Mar 25, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Mar 25, 2026
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@inespot inespot added documentation >non-issue and removed Team:Distributed Meta label for distributed team. labels Mar 25, 2026
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Mar 25, 2026
@fcofdez fcofdez self-requested a review March 25, 2026 15:28
[routedBasedOnClusterVersion](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L1021)
field. Each forward updates it to the forwarding node’s cluster state version so the next receiver
[is on at least](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L898)
that version before it redirects again.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learning: if i read code correctly, it will then retry once again on next cluster update, what happens if that fails? 🤔 fail but then what happens after

not sure whether worth adding info in here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, the retries are bounded by the request's timeout. So on each retry, the node will check whether or not the timeout has expired and fail the request if so.

Copy link
Copy Markdown
Contributor Author

@inespot inespot Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can add a note about this in the doc! 196dc4

The primary will first try to
[acquire](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L484)
an operation permit via [IndexShardOperationPermits]. Note that recovery and relocation are operations that can
intentionally grab all available permits with the goal of blocking in-flight writes. If the permit is successfully
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

learning: what happens if we can't grab a permit 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We delay the operation until permits are again available in "expected" situations (e.g. recovery or relocation holding all permits) or throw a scary error in unexpected ones, which then fails the request.
Happy to add a note about the delayed case!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 196dc4


The Distributed team owns Elasticsearch's write path. A typical write begins as an HTTP/REST request, is routed to
the appropriate primary shard, is applied to the shard's engine and translog, and is finally replicated across replicas
before an ack is sent back to the client.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure whether worth documenting stateless behavior here 🤔

public knowledge AFAIK and might be handy, up to you

Suggested change
before an ack is sent back to the client.
the appropriate primary shard, is applied to the shard's engine and translog, and is finally replicated across replicas
before an ack is sent back to the client.
On stateless ack is sent after operation is added to translog and uploaded to blob storage, no replication needed for durability.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stateless probably deserves its own dedicated section (which I was actually planning on tackling next 😇), but happy to add a note here in the meantime too!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving it with yourself, looks great as is! :)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note in 196dc4

@szybia
Copy link
Copy Markdown
Contributor

szybia commented Mar 25, 2026

these are so useful for learning btw, ty!

Copy link
Copy Markdown
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I left a few comments

before an ack is sent back to the client.

Note that in serverless Elasticsearch, durability is instead provided by writing to the translog and uploading to blob
storage. There is no replication process. See this
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say that we rely on the blob store properties to achieve the same reliability and fault tolerance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 1a8073


### Coordinator: REST to Transport

A user sends a `PUT` or `POST` to `/{index}/_doc/{id}` (or `POST /{index}/_doc` with no id for auto-generated ids) to
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it's more interesting to describe the bulk API and maybe describe a bit the ingest pipelines and all the process that happens on the coordinating layer?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 61c853

If the client did not supply `_id`, one is
[generated](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/action/bulk/BulkOperation.java#L333)
before routing. The latest applied [ClusterState](#cluster-state) supplies routing for the chosen index (`routingTable`,
[ShardRouting](#cluster-state)). A routing key is chosen (by default the document `_id` but the request may
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should mention that the default routing key is computed differently depending on the index type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1a8073 but let me know if there are any additional details you want to add

request to the engine (typically [InternalEngine], see [Engine](#engine) section for more details) that
[generates](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java#L1239)
the sequence number for the operation, [updates](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java#L1262)
Lucene via an `IndexWriter`, and then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can mention why we append to the translog after indexing into Lucene?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a tentative note in 1a8073, but please let me know if that does capture all the relevant reasons!


The primary term is a monotonically increasing counter per shard, recorded in cluster state metadata (see
[IndexMetadata#primaryTerm](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/cluster/metadata/IndexMetadata.java#L1199)).
It advances when a primary is (re)selected, e.g. after a full restart or when a replica is promoted. The shard stamps
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when a replica is promoted can be a bit misleading as that happens during a regular relocation and in that case we don't bump the primary term. Maybe mention somehow that it's bumped after a primary crash?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1a8073

and each [ReplicaResponse](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java#L1284)
returns the replica’s local checkpoint and last synced global checkpoint).

Note that the `InternalEngine` also tracks the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should mention that when _ids are auto generated we don't need to check for duplicates under certain circumstances too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note about it in 61c853

which is the sequence number up to which all in-sync copies have
processed every earlier operation. The [ReplicationTracker] computes it as the minimum of those per-copy local checkpoints among in-sync shard copies.
The primary [updates](https://github.com/elastic/elasticsearch/blob/v9.3.0/server/src/main/java/org/elasticsearch/index/seqno/ReplicationTracker.java#L1375)
the global checkpoint whenever an in-sync copy’s local checkpoint advances, when the primary activates or takes over
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can mention GlobalCheckpointSyncAction too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1a8073

Copy link
Copy Markdown
Contributor

@michalborek michalborek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for adding this!

@inespot inespot merged commit 1b38b00 into elastic:main Mar 31, 2026
12 checks passed
szybia added a commit to szybia/elasticsearch that referenced this pull request Mar 31, 2026
…rics

* upstream/main: (21 commits)
  Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:external-basic.topSnippetsFunction} elastic#145353
  Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:external-basic.scoreFunction} elastic#145352
  [DiskBBQ] Fix bug in NeighborQueue#popRawAndAddRaw (elastic#145324)
  Fix dense_vector default index options when using BFLOAT16 (elastic#145202)
  Use checked exceptions in entitlement constructor rules (elastic#145234)
  ESQL: DS: datasource file plugins should not return TEXT types (elastic#145334)
  Plumb DLM error store through to DlmFrozenTransition classes (elastic#145243)
  Make Settings.Builder.remove() fluent (elastic#145294)
  Add FLS tests for METRICS_INFO and TS_INFO (elastic#145211)
  Fix flaky SecurityFeatureResetTests (elastic#145063)
  [DOCS] Fix conflict markers in ESQL processing command list (elastic#145338)
  Skip certain metric assertions on Windows (elastic#144933)
  [ES|QL] Add schema reconciliation for multi-file external sources (elastic#145220)
  Simplify DiskBBQ dynamic visit ratio to linear (elastic#142784)
  ESQL: Disallow unmapped_fields=load with partial non-KEYWORD (elastic#144109)
  [Transform] Track Linked Projects (elastic#144399)
  Fix bulk scoring to process last batch instead of falling through to scalar tail (elastic#145316)
  Clean up TickerScheduleEngineTests (elastic#145303)
  [CI] ShardBulkInferenceActionFilterIT testRestart - Ensuring that secrets-inference index is available after full restart and unmuting test (elastic#145317)
  Add CRUD doc to the DistributedArchitectureGuide (elastic#144710)
  ...
ncordon pushed a commit to ncordon/elasticsearch that referenced this pull request Apr 1, 2026
* Add CRUD doc to the DistributedArchitectureGuide

* Primary routing and execution

* Nit

* Fill out Replication

* Primary Terms & Sequence Numbers

* Fill out the checkpoint and gaps section

* Style nits

* Nits and typos

* Retries, failures and serverless notes

* Refer to blog post instead of serverless section

* Clarify result returned to transport caller

* Some clarifications and notes

* Bulk request + auto generated id optimization

* Typo

* Fix internal marker

* Small inaccuracy
mromaios pushed a commit to mromaios/elasticsearch that referenced this pull request Apr 9, 2026
* Add CRUD doc to the DistributedArchitectureGuide

* Primary routing and execution

* Nit

* Fill out Replication

* Primary Terms & Sequence Numbers

* Fill out the checkpoint and gaps section

* Style nits

* Nits and typos

* Retries, failures and serverless notes

* Refer to blog post instead of serverless section

* Clarify result returned to transport caller

* Some clarifications and notes

* Bulk request + auto generated id optimization

* Typo

* Fix internal marker

* Small inaccuracy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. documentation >non-issue Team:Distributed Meta label for distributed team. v9.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants