You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, kudos to @jimczi for the elegant work on synthetic vectors (#130382) and reindex handling (#130834)! 🙌 These PRs inspired me to explore a generalized solution for _source pruning.
Current Limitation
When using include/exclude in _source (docs), the existing include/exclude mechanism physically removes excluded fields from _source storage. This may create irreversible data loss. Operations like reindex, update, and update_by_query rely on intact _source but silently discard pruned fields. While #130834 patched this for vector fields, the problem persists for every other field type. Users reasonably expect document sources can be reconstructed via doc values (as we now do for vectors).
The synthetic_vectors workaround demonstrates the problem is solvable – but only shifts a hidden technical debt by creating:
Field-specific switches that need constant maintenance
Special-case logic that must be reimplemented per data type
Proposed Solution
Extend the hybrid model from #130382 to all field types by:
Automatic mode detection
If _source is intact → use the traditional stored _source (Mode.STORED).
If _source is under synthetic mode → use synthetic source (Mode.SYNTHETIC).
If _source has include/exclude rules → Auto nable hybrid source (Mode.HYBRID).
graph LR
A[_source config] -->|has includes/excludes| B[Hybrid Mode]
B --> C[Stored: included fields]
B --> D[Synthetic: excluded fields from doc_values]
Loading
Universal field support
Reconstruct ANY pruned field (not just vectors) using existing doc_values:
Numerics, dates, keywords → Direct from doc_values
Vectors → Existing vector reconstruction
Geos/nested → New reconstruction handlers
Zero-config upgrade
Fully backward compatible. Users get hybrid behavior automatically when pruning _source.
Advantages
Generality: It solves the problem of missing pruned fields in reindex/update operations for any field type, not just vectors.
Transparency: Users don't need to enable a separate setting (like synthetic_vectors) for specific fields. The behavior is automatically triggered by the _source pruning configuration.
Consistency: It unifies the handling of pruned fields and synthetic source.
Eliminates need for field-specific switches like synthetic_vectors
Unlocks new use cases:
// Keep small metadata in _source, reconstruct heavy fields on demand"_source": {
"includes": ["meta/*"],
"excludes": ["embeddings", "logs"]
}
Next Steps
I would like to get feedback from the team (@jimczi and others involved in the related PRs - cc @benwtrent@martijnvg ) on the feasibility and desirability of this approach.
If the team agrees this aligns with Elasticsearch's direction, I am willing to contribute. I plan to start by writing a unit test that demonstrates the problem (reindex fails for pruned non-vector fields) and then propose a solution following the hybrid model.
Looking forward to your thoughts!
Description
Hi Elasticsearch team,
First, kudos to @jimczi for the elegant work on synthetic vectors (#130382) and reindex handling (#130834)! 🙌 These PRs inspired me to explore a generalized solution for
_sourcepruning.Current Limitation
When using
include/excludein_source(docs), the existinginclude/excludemechanism physically removes excluded fields from_sourcestorage. This may create irreversible data loss. Operations likereindex,update, andupdate_by_queryrely on intact_sourcebut silently discard pruned fields. While #130834 patched this for vector fields, the problem persists for every other field type. Users reasonably expect document sources can be reconstructed via doc values (as we now do for vectors).The
synthetic_vectorsworkaround demonstrates the problem is solvable – but only shifts a hidden technical debt by creating:Proposed Solution
Extend the hybrid model from #130382 to all field types by:
Automatic mode detection
_sourceis intact → use the traditional stored _source (Mode.STORED)._sourceis under synthetic mode → use synthetic source (Mode.SYNTHETIC)._sourcehasinclude/excluderules → Auto nable hybrid source (Mode.HYBRID).Universal field support
Reconstruct ANY pruned field (not just vectors) using existing doc_values:
Zero-config upgrade
Fully backward compatible. Users get hybrid behavior automatically when pruning
_source.Advantages
Benefits
synthetic_vectorsNext Steps
I would like to get feedback from the team (@jimczi and others involved in the related PRs - cc @benwtrent @martijnvg ) on the feasibility and desirability of this approach.
If the team agrees this aligns with Elasticsearch's direction, I am willing to contribute. I plan to start by writing a unit test that demonstrates the problem (reindex fails for pruned non-vector fields) and then propose a solution following the hybrid model.
Looking forward to your thoughts!