[ES|QL] Add schema reconciliation for multi-file external sources by costin · Pull Request #145220 · elastic/elasticsearch

costin · 2026-03-30T16:41:28Z

Adds schema reconciliation for external file sources that span multiple files
with potentially different schemas. Users select a strategy via the WITH clause:

FROM "s3://bucket/data/*.parquet" WITH {"schema_resolution": "strict"}
FROM "s3://bucket/data/*.parquet" WITH {"schema_resolution": "union_by_name"}

Problem

FIRST_FILE_WINS (the current default) silently ignores schema differences
across files. This causes wrong column values when files have different column
ordering, or runtime errors when types don't match — with no clear message
pointing to the root cause.

Strategies

STRICT — all files must share the exact same schema. Fails at planning
time with a descriptive error naming the offending file and column, and a
hint to use union_by_name.
UNION_BY_NAME — merges schemas into a superset by column name. Missing
columns are NULL-filled. Only lossless type widening is allowed (INTEGER→LONG,
INTEGER→DOUBLE, DATETIME→DATE_NANOS). LONG→DOUBLE is rejected (lossy above
2^53), matching DuckDB and Spark consensus.

The default remains first_file_wins for backward compatibility.

Design rationale

Informed by a survey of DuckDB, Spark, ClickHouse, and Cribl — see
SCHEMA_RECONCILIATION_DESIGN.md. Both strategies scan all file metadata
eagerly at planning time (footer reads only, parallelized with bounded
concurrency). This also collects per-file statistics for aggregate pushdown
Phase 2 at zero extra cost.

Type widening uses a custom schemaWiden() — not EsqlDataTypeConverter.commonType()
which allows lossy LONG→DOUBLE. Column matching is case-sensitive, consistent
with ESQL's field resolution semantics.

Developed with AI-assisted tooling

elasticsearchmachine · 2026-03-30T16:41:54Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

elasticsearchmachine · 2026-03-30T16:41:55Z

Hi @costin, I've created a changelog YAML for you.

coderabbitai · 2026-03-30T16:47:24Z

Caution

Review failed

An error occurred during the review process. Please try again later.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

🛠️ Update Documentation: Commit on current branch
🛠️ Update Documentation: Create PR

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Adds planning-time schema reconciliation for external file sources that span multiple files with potentially different schemas. Supports STRICT (exact match) and UNION_BY_NAME (merge by column name with safe type widening) strategies via the schema_resolution WITH clause parameter. Core changes: - SchemaReconciliation: reconciliation algorithms with ColumnMapping that handles both planning-time use and wire serialization (CastType enum ordinals, no strings on the wire) - SchemaAdaptingIterator: execution-time adapter that reorders columns, inserts NULL blocks for missing columns, and casts blocks for type widening (INTEGER→LONG, INTEGER→DOUBLE, DATETIME→DATE_NANOS) - ExternalSourceResolver: parallel metadata reading with bounded concurrency (semaphore + MAX_PARALLEL_METADATA_READS) - FileSplit carries nullable ColumnMapping; FileSplitProvider shares the same instance across all splits from the same file via dedup cache Developed with AI-assisted tooling

Wire SchemaAdaptingIterator into the async execution path and fix partition column dimensionality in both sync/async paths by using attributes.subList(0, columnCount()). Demote per-query schema reconciliation timing log to debug. Add ColumnMapping round-trip serialization tests and unit tests for SchemaAdaptingIterator covering all cast types, null fill, reorder, empty page, failure cleanup, and close delegation.

- Sync factory: mapping-aware column projection in openFileSplit - SchemaAdaptingIterator: constructor invariant for schema/mapping size - Async factory: wire adaptSchema into startMultiFileRead path - Remove unused logger field and trivial FileSplit adaptSchema overload

github-actions · 2026-03-31T09:46:51Z

🔍 Preview links for changed docs

⏳ Building and deploying preview... View progress

This comment will be updated with preview links when the build is complete.

github-actions · 2026-03-31T09:48:20Z

ℹ️ Important: Docs version tagging

👋 Thanks for updating the docs! Just a friendly reminder that our docs are now cumulative. This means all 9.x versions are documented on the same page and published off of the main branch, instead of creating separate pages for each minor version.

We use applies_to tags to mark version-specific features and changes.

Expand for a quick overview

When to use applies_to tags:

✅ At the page level to indicate which products/deployments the content applies to (mandatory)
✅ When features change state (e.g. preview, ga) in a specific version
✅ When availability differs across deployments and environments

What NOT to do:

❌ Don't remove or replace information that applies to an older version
❌ Don't add new information that applies to a specific version without an applies_to tag
❌ Don't forget that applies_to tags can be used at the page, section, and inline level

🤔 Need help?

Check out the cumulative docs guidelines
Reach out in the #docs Slack channel

bpintea

LG, except the serialization.

bpintea · 2026-03-31T06:52:42Z

...plugin/esql/src/main/java/org/elasticsearch/xpack/esql/datasources/SchemaReconciliation.java

+                if (ordinal < 0 || ordinal >= VALUES.length) {
+                    throw new IllegalArgumentException("Unknown cast ordinal: " + ordinal);
+                }
+                return VALUES[ordinal];


This will probably cause bwc issues when we update this enum. We usually serialize the strings to prevent that.

Switch from hand-rolled ordinal byte encoding to the standard ES writeEnum/readEnum pattern. Add assertEnumSerialization test to pin ordinal-to-value mapping per ES convention.

bpintea

🤖-assisted review.

bpintea · 2026-03-31T14:27:41Z

@costin, just noticed, the PR description needs updating too.

…rics * upstream/main: (21 commits) Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:external-basic.topSnippetsFunction} elastic#145353 Mute org.elasticsearch.xpack.esql.qa.mixed.MixedClusterEsqlSpecIT test {csv-spec:external-basic.scoreFunction} elastic#145352 [DiskBBQ] Fix bug in NeighborQueue#popRawAndAddRaw (elastic#145324) Fix dense_vector default index options when using BFLOAT16 (elastic#145202) Use checked exceptions in entitlement constructor rules (elastic#145234) ESQL: DS: datasource file plugins should not return TEXT types (elastic#145334) Plumb DLM error store through to DlmFrozenTransition classes (elastic#145243) Make Settings.Builder.remove() fluent (elastic#145294) Add FLS tests for METRICS_INFO and TS_INFO (elastic#145211) Fix flaky SecurityFeatureResetTests (elastic#145063) [DOCS] Fix conflict markers in ESQL processing command list (elastic#145338) Skip certain metric assertions on Windows (elastic#144933) [ES|QL] Add schema reconciliation for multi-file external sources (elastic#145220) Simplify DiskBBQ dynamic visit ratio to linear (elastic#142784) ESQL: Disallow unmapped_fields=load with partial non-KEYWORD (elastic#144109) [Transform] Track Linked Projects (elastic#144399) Fix bulk scoring to process last batch instead of falling through to scalar tail (elastic#145316) Clean up TickerScheduleEngineTests (elastic#145303) [CI] ShardBulkInferenceActionFilterIT testRestart - Ensuring that secrets-inference index is available after full restart and unmuting test (elastic#145317) Add CRUD doc to the DistributedArchitectureGuide (elastic#144710) ...

…astic#145220) * [ES|QL] Add schema reconciliation for multi-file external sources Adds planning-time schema reconciliation for external file sources that span multiple files with potentially different schemas. Supports STRICT (exact match) and UNION_BY_NAME (merge by column name with safe type widening) strategies via the schema_resolution WITH clause parameter. Core changes: - SchemaReconciliation: reconciliation algorithms with ColumnMapping that handles both planning-time use and wire serialization (CastType enum ordinals, no strings on the wire) - SchemaAdaptingIterator: execution-time adapter that reorders columns, inserts NULL blocks for missing columns, and casts blocks for type widening (INTEGER→LONG, INTEGER→DOUBLE, DATETIME→DATE_NANOS) - ExternalSourceResolver: parallel metadata reading with bounded concurrency (semaphore + MAX_PARALLEL_METADATA_READS) - FileSplit carries nullable ColumnMapping; FileSplitProvider shares the same instance across all splits from the same file via dedup cache Developed with AI-assisted tooling * Fix schema reconciliation gaps for multi-file sources Wire SchemaAdaptingIterator into the async execution path and fix partition column dimensionality in both sync/async paths by using attributes.subList(0, columnCount()). Demote per-query schema reconciliation timing log to debug. Add ColumnMapping round-trip serialization tests and unit tests for SchemaAdaptingIterator covering all cast types, null fill, reorder, empty page, failure cleanup, and close delegation. * Harden schema reconciliation gaps and clean up async factory - Sync factory: mapping-aware column projection in openFileSplit - SchemaAdaptingIterator: constructor invariant for schema/mapping size - Async factory: wire adaptSchema into startMultiFileRead path - Remove unused logger field and trivial FileSplit adaptSchema overload * Update docs/changelog/145220.yaml * Use writeEnum/readEnum for CastType serialization Switch from hand-rolled ordinal byte encoding to the standard ES writeEnum/readEnum pattern. Add assertEnumSerialization test to pin ordinal-to-value mapping per ES convention.

costin added >enhancement :Analytics/ES|QL AKA ESQL v9.4.0 ES|QL|DS ES|QL datasources labels Mar 30, 2026

costin requested a review from bpintea March 30, 2026 16:41

costin enabled auto-merge (squash) March 30, 2026 16:41

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 30, 2026

costin added 4 commits March 31, 2026 00:59

Update docs/changelog/145220.yaml

2f1bcad

costin force-pushed the esql/schema-reconciliation branch from c8489aa to 2f1bcad Compare March 31, 2026 09:45

bpintea reviewed Mar 31, 2026

View reviewed changes

costin force-pushed the esql/schema-reconciliation branch from 190084c to e5ef28e Compare March 31, 2026 12:51

Use writeEnum/readEnum for CastType serialization

b69c1d8

Switch from hand-rolled ordinal byte encoding to the standard ES writeEnum/readEnum pattern. Add assertEnumSerialization test to pin ordinal-to-value mapping per ES convention.

costin force-pushed the esql/schema-reconciliation branch from e5ef28e to b69c1d8 Compare March 31, 2026 13:03

bpintea approved these changes Mar 31, 2026

View reviewed changes

costin merged commit ef3486e into elastic:main Mar 31, 2026
33 of 35 checks passed

costin deleted the esql/schema-reconciliation branch March 31, 2026 14:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ES|QL] Add schema reconciliation for multi-file external sources#145220

[ES|QL] Add schema reconciliation for multi-file external sources#145220
costin merged 5 commits intoelastic:mainfrom
costin:esql/schema-reconciliation

costin commented Mar 30, 2026 •

edited

Loading

Uh oh!

elasticsearchmachine commented Mar 30, 2026

Uh oh!

elasticsearchmachine commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026

Review failed

Uh oh!

github-actions bot commented Mar 31, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 31, 2026

When to use applies_to tags:

What NOT to do:

Uh oh!

bpintea left a comment

Uh oh!

bpintea Mar 31, 2026

Uh oh!

bpintea left a comment

Uh oh!

Uh oh!

bpintea commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

costin commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Strategies

Design rationale

Uh oh!

elasticsearchmachine commented Mar 30, 2026

Uh oh!

elasticsearchmachine commented Mar 30, 2026

Uh oh!

coderabbitai bot commented Mar 30, 2026

Review failed

Uh oh!

github-actions bot commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Preview links for changed docs

Uh oh!

github-actions bot commented Mar 31, 2026

ℹ️ Important: Docs version tagging

When to use applies_to tags:

What NOT to do:

🤔 Need help?

Uh oh!

bpintea left a comment

Choose a reason for hiding this comment

Uh oh!

bpintea Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

bpintea left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bpintea commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

costin commented Mar 30, 2026 •

edited

Loading

github-actions bot commented Mar 31, 2026 •

edited

Loading