ESQL: Improve field reference tracking in FORK command#137678
ESQL: Improve field reference tracking in FORK command#137678ioanatia merged 7 commits intoelastic:mainfrom
FORK command#137678Conversation
- Enhance field reference tracking for FORK command branches - Add branch-specific keep references tracking with `currentBranchKeepRefs` - Refine logic for determining when to project all fields in FORK branches - Update field collection strategy to handle more complex query scenarios - Modify test cases to validate new field reference tracking behavior The changes improve the field reference collection mechanism in ESQL's FORK command, providing more precise field selection and projection logic across different branch scenarios.
|
The following tests in
|
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
|
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
fang-xing-esql
left a comment
There was a problem hiding this comment.
Thanks for the PR, @kanoshiou! Looking at the changes in the existing tests, I noticed some FORK tests changed to ALL_FIELDS. Is there a specific reason why they need to be ALL_FIELDS? The old(existing) behavior of FORK looks fine to me, but let me know if I'm missing something.
The original issue was created for subqueries, so I didn't expect the behavior changes to FORK unless there is a good reason.
Also a heaps up to @ioanatia, to confirm whether the changes related to FORK are fine.
| (STATS x = count(*), y=min(z)) | ||
| | WHERE x > y | ||
| """, Set.of("_index", "x", "y", "a", "c", "z", "y.*", "x.*", "z.*", "a.*", "c.*")); | ||
| """, ALL_FIELDS); |
There was a problem hiding this comment.
A few of the FORK related tests change to ALL_FIELDS, I wonder why they should be ALL_FIELDS?
|
Thanks for the review, @fang-xing-esql! I believe this is a bug in it should return all fields, but currently only returns I think this could be a good opportunity to fix the |
ioanatia
left a comment
There was a problem hiding this comment.
thank you for looking into this.
while I agree this fixes a bug we currently have, @fang-xing-esql correctly noted that we might be returning more fields than we need
I left a suggestion for a change.
In the long term, I think FieldNameUtils would benefit from a refactor, as I myself find it hard to follow up what is going on.
| ( EVAL x = "abc" | EVAL y = "aaa" ) | ||
| | STATS a = count(*) WHERE _fork == "fork1", | ||
| b = max(_fork)""", Set.of("_index", "first_name", "emp_no", "last_name", "last_name.*", "first_name.*", "emp_no.*")); | ||
| b = max(_fork)""", ALL_FIELDS); |
There was a problem hiding this comment.
because we have a STATS after FORK, we don't need to return all fields.
| // - "fork (eval x = 1 | keep x) (eval y = 2 | keep y)" → specific fields only (both branches have KEEP) | ||
| if (currentBranchKeepRefs.get().isEmpty() | ||
| && (referencesBuilder.get().isEmpty() | ||
| || false == forkBranch.anyMatch(forkPlan -> shouldCollectReferencedFields(forkPlan, inlinestatsAggs)))) { |
There was a problem hiding this comment.
I think what we are missing here is another check to see if there are any plans after FORK that reduce the columns to a known set - such as Project or Aggregate.
If we have from test | fork (eval x = 1 | keep x) (where true) | stats c = count(*), with the approach from this PR we will be requesting all the fields.
we could have something like:
Holder<Boolean> reduceColumnsAfterFork = new Holder<>(false);
forEachDownProcessor.set((LogicalPlan p, Holder<Boolean> breakEarly) -> {// go over each plan top-down
if (lastSeenFork.get() == null && shouldCollectReferencedFields(p, inlinestatsAggs)) {
reduceColumnsAfterFork.set(true);
}
if (p instanceof Fork fork) {
...
then this check could be:
if (currentBranchKeepRefs.get().isEmpty()
&& (referencesBuilder.get().isEmpty()
|| false == forkBranch.anyMatch(forkPlan -> shouldCollectReferencedFields(forkPlan, inlinestatsAggs)))
&& false == reduceColumnsAfterFork.get()) {
but we would also need to make sure we set lastSeenFork earlier, not after we collected the references from the FORK branches.
Add a `reduceColumnsAfterFork` flag to track if there are plans after FORK that reduce columns to a known set (e.g., Project, Aggregate). Previously, for queries like: from test | fork (eval x = 1 | keep x) (where true) | stats c = count(*) The second branch would trigger projectAll because it has no KEEP command, causing all fields to be requested from field_caps. With this change, we now check if there's a column-reducing plan after FORK before deciding to request all fields. If such a plan exists (like `stats`), we only request the necessary fields.
|
Thank you @ioanatia, for the review and the detailed guidance! I've implemented the Updated and ready for another look! |
|
@elasticsearchmachine test this please |
|
@elasticsearchmachine test this please |
|
Hi @ioanatia, The tests pass locally for me. It might be an issue specific to the CI runner. |
|
Thanks a lot for your efforts on this PR, @ioanatia @fang-xing-esql! 😊 |
This change fixes the field resolution inefficiency in
FORKcommands where queries likewere unnecessarily requesting all fields (*) from
field_capsinstead of only the minimal required fields.The fix introduces branch-specific
KEEPreference tracking throughcurrentBranchKeepRefsand replaces the overly conservative empty reference check with nuanced logic that properly distinguishes between branches that genuinely need all fields (like filter-only branches) versus those with explicit field constraints, resulting in more efficient field resolution and reduced overhead during the logical plan optimization step.Closes #137283