Fix field resolution for FORK#128193
Conversation
|
Pinging @elastic/es-analytical-engine (Team:Analytics) |
astefan
left a comment
There was a problem hiding this comment.
From code and tests in csv-spec files I noticed that FORK is "generating" a _fork (sort of hidden) field. Can you, please, add tests with queries that make use of this _fork field?
astefan
left a comment
There was a problem hiding this comment.
The query below (from rerank.csv-spec) is using as a field name _fork which is strange (many other queries using fork and rrf do this). This is the full list of field names it's asking for: book_no, author, title, _fork, title.*, _fork.*, author.*, book_no.*.
FROM books METADATA _id, _index, _score
| FORK ( WHERE title:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
( WHERE author:"Tolkien" | SORT _score, _id DESC | LIMIT 3 )
| RRF
| RERANK "Tolkien" ON title WITH test_reranker
| LIMIT 2
| KEEP book_no, title, author
EsqlSession.fieldNames automatically ignores MetadataAttributes (like _id, _score, _index and others). I'm sure you've discussed about this: what was the reason for not considering _fork as a MetadataAttribute?
| | WHERE a > 2000 | ||
| | EVAL b = a + 100 | ||
| | FORK (WHERE c > 1 AND a < 10000 | EVAL d = a + 500) | ||
| (STATS x = count(*), y=min(z)) |
There was a problem hiding this comment.
hmmm - now that I am looking at this more - I think we should return all fields, because one branch is not bounded by any command like KEEP or STATS that resets the output to a known list of fields. will fix 🤔
There was a problem hiding this comment.
There is one more query where I am confused, in part because of probably not knowing what is the expectation for fork.
FROM employees | FORK ( WHERE true | stats min(salary) by gender) ( WHERE true | LIMIT 3 )
This shows these columns:
min(salary) | gender | _fork | salary
Should these be a "union" kind of set of columns and fieldNames is only limiting it to what it "sees" in the fork's first branch? If that's true, this means that fieldNames should consider a union kind of field names from all the branches of fork. As a shortcut, the first branch that it finds that's the "widest" it should stop checking the rest.
There was a problem hiding this comment.
The FORK output should be a union of the outputs of the fork branches.
If we detect that the same field exists in multiple fork branches outputs, but the field has different type, we fail with a validation error.
FORK will pad with null columns the output of FORK branches that are missing fields.
For example:
ROW a = [1, 2, 3], b = "foo"
| MV_EXPAND a
| FORK (WHERE true | LIMIT 2)
(STATS x = count(*))
(WHERE a % 2 == 0 | EVAL y = 2)
| SORT _fork, a
| KEEP _fork, a, b, x, y
Will produce:
_fork | a | b | x | y
---------------+---------------+---------------+---------------+---------------
fork1 |1 |foo |null |null
fork1 |2 |foo |null |null
fork2 |null |null |3 |null
fork3 |2 |foo |null |2
I think fixed the field resolution for this case.
In this test example here we should ask for all fields, since not all branches are bounded by an Aggregate or Project.
We could consider |
Should close #127208
Part of #121950
AFAICS we don't need to ask for all fields when FORK is used and the current field resolution should work since FORK is a N-ary plan.