Skip to content

Add the total dense vector count in the indices stats output#98275

Merged
jimczi merged 22 commits intoelastic:mainfrom
jimczi:dense_vector_stats
Aug 11, 2023
Merged

Add the total dense vector count in the indices stats output#98275
jimczi merged 22 commits intoelastic:mainfrom
jimczi:dense_vector_stats

Conversation

@jimczi
Copy link
Copy Markdown
Contributor

@jimczi jimczi commented Aug 8, 2023

This change adds the total dense vector count to the output of the indices stats. This is useful for observability in order to track the number of indexed vectors in a cluster.

This change adds the total dense vector count to the output of the indices stats.
This is useful for observability in order to track the number of indexed vectors
in a cluster.
@jimczi jimczi added >feature :Search/Search Search-related issues that do not fall into other categories :Core/Infra/Stats Statistics tracking and retrieval APIs v8.10.0 labels Aug 8, 2023
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Aug 8, 2023

Documentation preview:

@elasticsearchmachine elasticsearchmachine added Team:Data Management (obsolete) DO NOT USE. This team no longer exists. Team:Search Meta label for search team labels Aug 8, 2023
@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-search (Team:Search)

@elasticsearchmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Aug 8, 2023

I was unsure about the appropriate location to place the count, so I opted for the simplest solution, which involves adding it into the docs section. If there is a strong consensus that it doesn't belong there, I am willing to create a new section. However, it's essential to note that all the failing tests currently anticipate the docs section to be without the extra field. Until we finalize the best section for the new metric, I will hold off on making any fixes to these tests

@ChrisHegarty
Copy link
Copy Markdown
Contributor

I was unsure about the appropriate location to place the count, so I opted for the simplest solution, which involves adding it into the docs section. If there is a strong consensus that it doesn't belong there, I am willing to create a new section.

For me this kinda depends on what, if any, related data we may want to add in the future. E.g. Separately, the total number of byte / float vectors? Or maybe the total size of byte / float vectors. (What else would be interesting for Observability purposes? ) But maybe these things are not all that interesting, or more appropriate at a different (lower) level API.

@benwtrent
Copy link
Copy Markdown
Member

@jimczi could I understand what actions are useful for this o11y?

Are we talking about just knowing vectors for telemetry? Or do we want to know how much off-heap ram would be required given a vector count (if this is the case, we need to know dims & kind or store size...)?

It seems like a "doc_field_stats" object should be added if all we want to do is count the number of fields a document has that fits within a certain mapped category kind.

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Aug 8, 2023

Are we talking about just knowing vectors for telemetry? Or do we want to know how much off-heap ram would be required given a vector count (if this is the case, we need to know dims & kind or store size...)?

Yes, this is just to know the number of vectors indexed per deployment for telemetry.

It seems like a "doc_field_stats" object should be added if all we want to do is count the number of fields a document has that fits within a certain mapped category kind.

Not sure I understand, do you mean a top level section? This is not about the number of fields though, we want to know the total number of vectors indexed.

@benwtrent
Copy link
Copy Markdown
Member

Not sure I understand, do you mean a top level section? This is not about the number of fields though, we want to know the total number of vectors indexed.

I explained my idea poorly. I am talking about indexed field kind stats.

So, we would have "text_value_count" or "keyword_value_count" or "numeric_value_count" or "dense_vector_value_count"

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Aug 8, 2023

So, we would have "text_value_count" or "keyword_value_count" or "numeric_value_count" or "dense_vector_value_count"

Ah, thank you for the explanation. Although I'm uncertain about the necessity of the value count for the other types. In my opinion, for a detailed examination of the fields and their costs, the disk usage API should be the preferred option.

Continuing on the concept introduced in the completion section, I wonder if a straightforward approach like this could suffice:

"dense_vector": {
  "value_count": 0
}

This would enable the incorporation of additional statistics related to dense_vector in the future, as hinted by @ChrisHegarty and would keep the focus on vectors since that's the original intent for these stats.

@jimczi
Copy link
Copy Markdown
Contributor Author

jimczi commented Aug 9, 2023

I proceeded with the implementation of the new section concept and introduced the dense_vector at the root level. This approach allows for the potential inclusion of additional statistics related to the indexed dense vector without disrupting other sections. Consequently, the integrity of tests and external expectations for the docs section remains unaffected.

Copy link
Copy Markdown
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the new top level thing

Copy link
Copy Markdown
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@jimczi jimczi merged commit a5d21ce into elastic:main Aug 11, 2023
@jimczi jimczi deleted the dense_vector_stats branch August 11, 2023 14:17
csoulios pushed a commit to csoulios/elasticsearch that referenced this pull request Aug 18, 2023
…#98275)

This change adds the total dense vector count to the output of the indices stats.
This is useful for observability in order to track the number of indexed vectors
in a cluster.

---------

Co-authored-by: Benjamin Trent <ben.w.trent@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Core/Infra/Stats Statistics tracking and retrieval APIs >feature :Search/Search Search-related issues that do not fall into other categories Team:Data Management (obsolete) DO NOT USE. This team no longer exists. Team:Search Meta label for search team v8.10.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants