Skip to content

Add SparseVectorStats#108793

Merged
kderusso merged 22 commits intoelastic:mainfrom
kderusso:kderusso/sparse-vector-stats
Jun 17, 2024
Merged

Add SparseVectorStats#108793
kderusso merged 22 commits intoelastic:mainfrom
kderusso:kderusso/sparse-vector-stats

Conversation

@kderusso
Copy link
Copy Markdown
Member

@kderusso kderusso commented May 17, 2024

Relates to #98275

Adds statistics on the number of sparse_vector fields in an index or cluster.

Because we can't pull dimensionality from the Lucene index, this relies on sparse_vector mappings to identify the documents to calculate fields for.

Here is an example script to test:

PUT my-index-1
{
  "mappings": {
    "properties": {
      "sparse_field1": {
        "type": "sparse_vector"
      },
      "sparse_field2": {
        "type": "sparse_vector"
      },
      "dense_field": {
        "type": "dense_vector"
      },
      "nonsparse_field": {
        "type": "keyword"
      }
    }
  },
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1
  }
}

// my-index-2 has 3 sparse_vector fields
PUT my-index-2
{
  "mappings": {
    "properties": {
      "sparse_field1": {
        "type": "sparse_vector"
      },
      "sparse_field2": {
        "type": "sparse_vector"
      },
      "sparse_field3": {
        "type": "sparse_vector"
      },
      "nonsparse_field": {
        "type": "keyword"
      }
    }
  },
  "settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 1
  }
}

PUT my-index-1/_doc/1
{
  "sparse_field1": { "a": 1, "b": 2 },
  "sparse_field2": { "c": 3, "d": 4, "e": 5},
  "dense_field": [1, 2, 3]
}

PUT my-index-1/_doc/2
{
  "sparse_field1": { "f": 1, "g": 2, "h": 3, "i": 4},
  "sparse_field2": { "j": 5, "k": 6, "l": 7},
  "dense_field": [2, 3, 4]
}

PUT my-index-2/_doc/1
{
  "sparse_field1": { "m": 1, "n": 2 },
  "sparse_field2": { "o": 3, "p": 4, "q": 5},
  "sparse_field3": { "a": 1, "b": 2 },
  "nonsparse_field": "cupcakes"
}

PUT my-index-2/_doc/2
{
  "nonsparse_field": "eclairs"
}

POST my-index-1/_refresh

POST my-index-2/_refresh

// _all returns 7 sparse_vector fields, which are then broken down by index
GET _stats/sparse_vector

// correctly returns 4 sparse_vector fields
GET my-index-1/_stats/sparse_vector

// correctly returns 3 sparse_vector fields
GET my-index-2/_stats/sparse_vector

GET /_nodes/stats

GET /_cat/shards?h=i,dvc,svc

@github-actions
Copy link
Copy Markdown
Contributor

Documentation preview:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the docs refactoring could belong to a separate PR, labeled with "docs", so it's easier to review

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this here for now as it's consistent with other PRs, apologies for the whitespace formatting. If you feel strongly about it I can create 2 PRs.

@kderusso kderusso force-pushed the kderusso/sparse-vector-stats branch from eadbf60 to 254a1d6 Compare May 24, 2024 17:41
@kderusso kderusso changed the title WIP: Add SparseVectorStats Add SparseVectorStats May 29, 2024
@kderusso kderusso force-pushed the kderusso/sparse-vector-stats branch 3 times, most recently from 1158acc to eff9791 Compare May 30, 2024 12:43
@kderusso kderusso force-pushed the kderusso/sparse-vector-stats branch from 4c244dd to 65374ff Compare June 10, 2024 17:40
@kderusso
Copy link
Copy Markdown
Member Author

kderusso commented Jun 13, 2024

Note: This models after some of the dense vector changes in #107962

@kderusso kderusso marked this pull request as ready for review June 13, 2024 21:08
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Jun 13, 2024
@kderusso kderusso added >enhancement and removed needs:triage Requires assignment of a team area label labels Jun 13, 2024
@kderusso kderusso added :Search/Search Search-related issues that do not fall into other categories Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Jun 13, 2024
@kderusso kderusso requested review from a team, carlosdelest and jimczi June 13, 2024 21:08
@elasticsearchmachine elasticsearchmachine added the Team:Search Meta label for search team label Jun 13, 2024
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Hi @kderusso, I've created a changelog YAML for you.

Copy link
Copy Markdown
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>enhancement :Search/Search Search-related issues that do not fall into other categories Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch Team:Search Meta label for search team v8.15.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants