Skip to content

MMR: Inconsistent behaviour for null values #142939

@ioanatia

Description

@ioanatia

Setup

We are using the colors index from the ES|QL CSV tests, to load it:

./gradlew :x-pack:plugin:esql:qa:testFixtures:loadCsvSpecData --args="http://elastic:changeme@localhost:9200"

Index a doc, that's missing the dense_vector rgb_vector field:

POST colors/_doc/90002
{
    "color": "aaabbbb",
    "hex_code": "#0"
}

Bug 1: Docs with missing dense_vector have priority

In the following example, the subretriever is returning at most 100 documents:

POST colors/_search
{
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "rgb_vector",
      "lambda": 0.5,
      "rank_window_size": 100,
      "size": 3,
      "retriever": {
        "standard": {
          "query": {
            "match_all": {}
          },
          "sort": {
            "hex_code": "asc"
          }
        }
      }
    }
  }
}

However the diversify retriever chooses to keep the document that is missing a value for dense_vector:

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 66,
      "relation": "eq"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "colors",
        "_id": "MSRA_5sB5WgNsq7we649",
        "_score": 1,
        "_source": {
          "id": 21,
          "color": "navy",
          "hex_code": "#000080",
          "primary": "false"
        }
      },
      {
        "_index": "colors",
        "_id": "UiRA_5sB5WgNsq7we649",
        "_score": 1,
        "_source": {
          "id": 54,
          "color": "black",
          "hex_code": "#000000",
          "primary": "true"
        }
      },
      {
        "_index": "colors",
        "_id": "90002",
        "_score": 1,
        "_source": {
          "color": "aaabbbb",
          "hex_code": "#0"
        }
      }
    ]
  }
}

The same happens for the equivalent ES|QL query:

 FROM colors METADATA _id
        | SORT hex_code
        | LIMIT 100
        | MMR ON rgb_vector LIMIT 3
        | KEEP _id, color, rgb_vector

response:

        _id         |     color     |   rgb_vector    
--------------------+---------------+-----------------
90002               |aaabbbb        |null             
UiRA_5sB5WgNsq7we649|black          |[0.0, 0.0, 0.0]  
MSRA_5sB5WgNsq7we649|navy           |[0.0, 0.0, 128.0]

Bug 2: Diversify retriever fails with unhelpful error

When the subretriever returns only documents that are missing the dense_vector field, we raise an error:

POST colors/_search
{
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "rgb_vector",
      "lambda": 0.5,
      "retriever": {
        "standard": {
          "query": {
            "ids": {
              "values": [
                90002
              ]
            }
          }
        }
      }
    }
  }
}

response:

{
  "error": {
    "root_cause": [
      {
        "type": "status_exception",
        "reason": "[diversify] search failed - retrievers '[diversify]' returned errors. All failures are attached as suppressed exceptions.",
        "suppressed": [
          {
            "type": "status_exception",
            "reason": "Failed to retrieve vectors for field [rgb_vector]. Is it a [dense_vector] field?"
          }
        ]
      }
    ],
    "type": "status_exception",
    "reason": "[diversify] search failed - retrievers '[diversify]' returned errors. All failures are attached as suppressed exceptions.",
    "suppressed": [
      {
        "type": "status_exception",
        "reason": "Failed to retrieve vectors for field [rgb_vector]. Is it a [dense_vector] field?"
      }
    ]
  },
  "status": 400
}

Side note: if the subretriever was just using match_none so no docs are returned at all, no error is raised. It is weird that we only raise an error when docs are returned from the subretriever and all docs are missing the dense_vector field.

The equivalent ES|QL command does not return an error:

  FROM colors METADATA _id
        | WHERE _id == "90002"
        | LIMIT 100
        | MMR ON rgb_vector LIMIT 3
        | KEEP _id, color, rgb_vector

response:

      _id      |     color     |  rgb_vector   
---------------+---------------+---------------
90002          |aaabbbb        |null           

Bug 3: Docs with missing dense_vector are dropped

This contradicts the first one, but it looks like when the subretriever is returning less documents than the MMR limit, we are dropping the documents that are missing the dense_vector field:

POST colors/_search
{
  "retriever": {
    "diversify": {
      "type": "mmr",
      "field": "rgb_vector",
      "lambda": 0.5,
      "retriever": {
        "standard": {
          "query": {
            "bool": {
              "should": [
                {
                  "ids": {
                    "values": [
                      90002
                    ]
                  }
                },
                {
                    "match": {
                      "color": "red"
                    }
                }
              ]
            }
          }
        }
      }
    }
  }
}

response:

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 2,
      "relation": "eq"
    },
    "max_score": 4.0350413,
    "hits": [
      {
        "_index": "colors",
        "_id": "ICRA_5sB5WgNsq7we649",
        "_score": 4.0350413,
        "_source": {
          "id": 4,
          "color": "red",
          "hex_code": "#FF0000",
          "primary": "true"
        }
      }
    ]
  }
}

same in ES|QL:

        FROM colors METADATA _id
        | WHERE _id == "90002" OR color:"red"
        | LIMIT 100
        | MMR ON rgb_vector LIMIT 10
        | KEEP _id, color, rgb_vector
        _id         |     color     |   rgb_vector    
--------------------+---------------+-----------------
ICRA_5sB5WgNsq7we649|red            |[255.0, 0.0, 0.0]

Expected behaviour

When the subretriever is returning documents that are missing the dense_vector field, we'd expect that the diversify retriever chooses to first keep documents that have a dense_vector value.
After that, if we still have less documents than the size param, the diversify retriever could keep some of the docs that have a null value for dense_vector field, so we return up to size documents.
The same should apply to ES|QL.

Or we could go the other way around, and say the diversify retriever / MMR command always drops documents that are missing the dense_vector field. But the first option is preferable, and if users want to drop the docs that are missing the dense_vector field before applying MMR, that is possible either by adding a WHERE field is NOT NULL in ES|QL before MMR, or a filter to the diversify retriever.

Metadata

Metadata

Assignees

Labels

:Search Relevance/ES|QLSearch functionality in ES|QL:Search Relevance/SearchCatch all for Search Relevance>bugTeam:Search RelevanceMeta label for the Search Relevance team in Elasticsearchpriority:normalA label for assessing bug priority to be used by ES engineers

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions