Setup
We are using the colors index from the ES|QL CSV tests, to load it:
./gradlew :x-pack:plugin:esql:qa:testFixtures:loadCsvSpecData --args="http://elastic:changeme@localhost:9200"
Index a doc, that's missing the dense_vector rgb_vector field:
POST colors/_doc/90002
{
"color": "aaabbbb",
"hex_code": "#0"
}
Bug 1: Docs with missing dense_vector have priority
In the following example, the subretriever is returning at most 100 documents:
POST colors/_search
{
"retriever": {
"diversify": {
"type": "mmr",
"field": "rgb_vector",
"lambda": 0.5,
"rank_window_size": 100,
"size": 3,
"retriever": {
"standard": {
"query": {
"match_all": {}
},
"sort": {
"hex_code": "asc"
}
}
}
}
}
}
However the diversify retriever chooses to keep the document that is missing a value for dense_vector:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 66,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "colors",
"_id": "MSRA_5sB5WgNsq7we649",
"_score": 1,
"_source": {
"id": 21,
"color": "navy",
"hex_code": "#000080",
"primary": "false"
}
},
{
"_index": "colors",
"_id": "UiRA_5sB5WgNsq7we649",
"_score": 1,
"_source": {
"id": 54,
"color": "black",
"hex_code": "#000000",
"primary": "true"
}
},
{
"_index": "colors",
"_id": "90002",
"_score": 1,
"_source": {
"color": "aaabbbb",
"hex_code": "#0"
}
}
]
}
}
The same happens for the equivalent ES|QL query:
FROM colors METADATA _id
| SORT hex_code
| LIMIT 100
| MMR ON rgb_vector LIMIT 3
| KEEP _id, color, rgb_vector
response:
_id | color | rgb_vector
--------------------+---------------+-----------------
90002 |aaabbbb |null
UiRA_5sB5WgNsq7we649|black |[0.0, 0.0, 0.0]
MSRA_5sB5WgNsq7we649|navy |[0.0, 0.0, 128.0]
Bug 2: Diversify retriever fails with unhelpful error
When the subretriever returns only documents that are missing the dense_vector field, we raise an error:
POST colors/_search
{
"retriever": {
"diversify": {
"type": "mmr",
"field": "rgb_vector",
"lambda": 0.5,
"retriever": {
"standard": {
"query": {
"ids": {
"values": [
90002
]
}
}
}
}
}
}
}
response:
{
"error": {
"root_cause": [
{
"type": "status_exception",
"reason": "[diversify] search failed - retrievers '[diversify]' returned errors. All failures are attached as suppressed exceptions.",
"suppressed": [
{
"type": "status_exception",
"reason": "Failed to retrieve vectors for field [rgb_vector]. Is it a [dense_vector] field?"
}
]
}
],
"type": "status_exception",
"reason": "[diversify] search failed - retrievers '[diversify]' returned errors. All failures are attached as suppressed exceptions.",
"suppressed": [
{
"type": "status_exception",
"reason": "Failed to retrieve vectors for field [rgb_vector]. Is it a [dense_vector] field?"
}
]
},
"status": 400
}
Side note: if the subretriever was just using match_none so no docs are returned at all, no error is raised. It is weird that we only raise an error when docs are returned from the subretriever and all docs are missing the dense_vector field.
The equivalent ES|QL command does not return an error:
FROM colors METADATA _id
| WHERE _id == "90002"
| LIMIT 100
| MMR ON rgb_vector LIMIT 3
| KEEP _id, color, rgb_vector
response:
_id | color | rgb_vector
---------------+---------------+---------------
90002 |aaabbbb |null
Bug 3: Docs with missing dense_vector are dropped
This contradicts the first one, but it looks like when the subretriever is returning less documents than the MMR limit, we are dropping the documents that are missing the dense_vector field:
POST colors/_search
{
"retriever": {
"diversify": {
"type": "mmr",
"field": "rgb_vector",
"lambda": 0.5,
"retriever": {
"standard": {
"query": {
"bool": {
"should": [
{
"ids": {
"values": [
90002
]
}
},
{
"match": {
"color": "red"
}
}
]
}
}
}
}
}
}
}
response:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 4.0350413,
"hits": [
{
"_index": "colors",
"_id": "ICRA_5sB5WgNsq7we649",
"_score": 4.0350413,
"_source": {
"id": 4,
"color": "red",
"hex_code": "#FF0000",
"primary": "true"
}
}
]
}
}
same in ES|QL:
FROM colors METADATA _id
| WHERE _id == "90002" OR color:"red"
| LIMIT 100
| MMR ON rgb_vector LIMIT 10
| KEEP _id, color, rgb_vector
_id | color | rgb_vector
--------------------+---------------+-----------------
ICRA_5sB5WgNsq7we649|red |[255.0, 0.0, 0.0]
Expected behaviour
When the subretriever is returning documents that are missing the dense_vector field, we'd expect that the diversify retriever chooses to first keep documents that have a dense_vector value.
After that, if we still have less documents than the size param, the diversify retriever could keep some of the docs that have a null value for dense_vector field, so we return up to size documents.
The same should apply to ES|QL.
Or we could go the other way around, and say the diversify retriever / MMR command always drops documents that are missing the dense_vector field. But the first option is preferable, and if users want to drop the docs that are missing the dense_vector field before applying MMR, that is possible either by adding a WHERE field is NOT NULL in ES|QL before MMR, or a filter to the diversify retriever.
Setup
We are using the
colorsindex from the ES|QL CSV tests, to load it:Index a doc, that's missing the dense_vector
rgb_vectorfield:Bug 1: Docs with missing dense_vector have priority
In the following example, the subretriever is returning at most 100 documents:
However the diversify retriever chooses to keep the document that is missing a value for
dense_vector:The same happens for the equivalent ES|QL query:
response:
Bug 2: Diversify retriever fails with unhelpful error
When the subretriever returns only documents that are missing the dense_vector field, we raise an error:
response:
Side note: if the subretriever was just using
match_noneso no docs are returned at all, no error is raised. It is weird that we only raise an error when docs are returned from the subretriever and all docs are missing the dense_vector field.The equivalent ES|QL command does not return an error:
response:
Bug 3: Docs with missing dense_vector are dropped
This contradicts the first one, but it looks like when the subretriever is returning less documents than the MMR limit, we are dropping the documents that are missing the dense_vector field:
response:
same in ES|QL:
Expected behaviour
When the subretriever is returning documents that are missing the dense_vector field, we'd expect that the diversify retriever chooses to first keep documents that have a
dense_vectorvalue.After that, if we still have less documents than the
sizeparam, the diversify retriever could keep some of the docs that have a null value for dense_vector field, so we return up tosizedocuments.The same should apply to ES|QL.
Or we could go the other way around, and say the diversify retriever / MMR command always drops documents that are missing the dense_vector field. But the first option is preferable, and if users want to drop the docs that are missing the dense_vector field before applying MMR, that is possible either by adding a
WHERE field is NOT NULLin ES|QL before MMR, or afilterto the diversify retriever.