MMR: Inconsistent behaviour for null values

### Setup

We are using the `colors` index from the ES|QL CSV tests, to load it:

```
./gradlew :x-pack:plugin:esql:qa:testFixtures:loadCsvSpecData --args="http://elastic:changeme@localhost:9200"
```

Index a doc, that's missing the dense_vector `rgb_vector` field:

```
POST colors/_doc/90002
{
    "color": "aaabbbb",
    "hex_code": "#0"
}
```

### Bug 1: Docs with missing dense_vector have priority

In the following example, the subretriever is returning at most 100 documents:

```
POST colors/_search
{
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "rgb_vector",
      "lambda": 0.5,
      "rank_window_size": 100,
      "size": 3,
      "retriever": {
        "standard": {
          "query": {
            "match_all": {}
          },
          "sort": {
            "hex_code": "asc"
          }
        }
      }
    }
  }
}

```

However the diversify retriever chooses to keep the document that is missing a value for `dense_vector`:

```
{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 66,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "colors",
        "_id": "MSRA_5sB5WgNsq7we649",
        "_score": 1,
        "_source": {
          "id": 21,
          "color": "navy",
          "hex_code": "#000080",
          "primary": "false"
        }
      },
      {
        "_index": "colors",
        "_id": "UiRA_5sB5WgNsq7we649",
        "_score": 1,
        "_source": {
          "id": 54,
          "color": "black",
          "hex_code": "#000000",
          "primary": "true"
        }
      },
      {
        "_index": "colors",
        "_id": "90002",
        "_score": 1,
        "_source": {
          "color": "aaabbbb",
          "hex_code": "#0"
        }
      }
    ]
  }
}

```

The same happens for the equivalent ES|QL query:

```
 FROM colors METADATA _id
        | SORT hex_code
        | LIMIT 100
        | MMR ON rgb_vector LIMIT 3
        | KEEP _id, color, rgb_vector
```

response:

```
        _id         |     color     |   rgb_vector    
--------------------+---------------+-----------------
90002               |aaabbbb        |null             
UiRA_5sB5WgNsq7we649|black          |[0.0, 0.0, 0.0]  
MSRA_5sB5WgNsq7we649|navy           |[0.0, 0.0, 128.0]
```

### Bug 2: Diversify retriever fails with unhelpful error

When the subretriever returns only documents that are missing the dense_vector field, we raise an error:

```
POST colors/_search
{
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "rgb_vector",
      "lambda": 0.5,
      "retriever": {
        "standard": {
          "query": {
            "ids": {
              "values": [
                90002
              ]
            }
          }
        }
      }
    }
  }
}
```

response:

```
{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "[diversify] search failed - retrievers '[diversify]' returned errors. All failures are attached as suppressed exceptions.",
        "suppressed": [
          {
            "type": "status_exception",
            "reason": "Failed to retrieve vectors for field [rgb_vector]. Is it a [dense_vector] field?"
          }
        ]
      }
    ],
    "type": "status_exception",
    "reason": "[diversify] search failed - retrievers '[diversify]' returned errors. All failures are attached as suppressed exceptions.",
    "suppressed": [
      {
        "type": "status_exception",
        "reason": "Failed to retrieve vectors for field [rgb_vector]. Is it a [dense_vector] field?"
      }
    ]
  },
  "status": 400
}

```
Side note: if the subretriever was just using `match_none` so no docs are returned at all, no error is raised. It is weird that we only raise an error when docs are returned from the subretriever and all docs are missing the dense_vector field.


The equivalent ES|QL command does not return an error:

```
  FROM colors METADATA _id
        | WHERE _id == "90002"
        | LIMIT 100
        | MMR ON rgb_vector LIMIT 3
        | KEEP _id, color, rgb_vector
```
response:

```
      _id      |     color     |  rgb_vector   
---------------+---------------+---------------
90002          |aaabbbb        |null           

```



### Bug 3: Docs with missing dense_vector are dropped

This contradicts the first one, but it looks like when the subretriever is returning less documents than the MMR limit, we are dropping the documents that are missing the dense_vector field:

```
POST colors/_search
{
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "rgb_vector",
      "lambda": 0.5,
      "retriever": {
        "standard": {
          "query": {
            "bool": {
              "should": [
                {
                  "ids": {
                    "values": [
                      90002
                    ]
                  }
                },
                {
                    "match": {
                      "color": "red"
                    }
                }
              ]
            }
          }
        }
      }
    }
  }
}

```

response:

```
{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 4.0350413,
    "hits": [
      {
        "_index": "colors",
        "_id": "ICRA_5sB5WgNsq7we649",
        "_score": 4.0350413,
        "_source": {
          "id": 4,
          "color": "red",
          "hex_code": "#FF0000",
          "primary": "true"
        }
      }
    ]
  }
}
```


same in ES|QL:

```
        FROM colors METADATA _id
        | WHERE _id == "90002" OR color:"red"
        | LIMIT 100
        | MMR ON rgb_vector LIMIT 10
        | KEEP _id, color, rgb_vector
```

```
        _id         |     color     |   rgb_vector    
--------------------+---------------+-----------------
ICRA_5sB5WgNsq7we649|red            |[255.0, 0.0, 0.0]
```


### Expected behaviour

When the subretriever is returning documents that are missing the dense_vector field, we'd expect that the diversify retriever chooses to first keep documents that have a `dense_vector` value.
After that, if we still have less documents than the `size` param, the diversify retriever could keep some of the docs that have a null value for dense_vector field, so we return up to `size` documents.
The same should apply to ES|QL.

Or we could go the other way around, and say the diversify retriever / MMR command always drops documents that are missing the dense_vector field. But the first option is preferable, and if users want to drop the docs that are missing the dense_vector field before applying MMR, that is possible either by adding a `WHERE field is NOT NULL` in ES|QL before MMR, or a `filter` to the diversify retriever.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MMR: Inconsistent behaviour for null values #142939

Setup

Bug 1: Docs with missing dense_vector have priority

Bug 2: Diversify retriever fails with unhelpful error

Bug 3: Docs with missing dense_vector are dropped

Expected behaviour

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

MMR: Inconsistent behaviour for null values #142939

Description

Setup

Bug 1: Docs with missing dense_vector have priority

Bug 2: Diversify retriever fails with unhelpful error

Bug 3: Docs with missing dense_vector are dropped

Expected behaviour

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions