Skip to content

Field usage API #73944

@ywelsch

Description

@ywelsch

Understanding which fields are being used and how often helps users make a choice between indexed and runtime fields. The ones not being accessed (or not frequently being accessed) might be good candidates to be turned into runtime fields. Conversely, frequent queries on a runtime field might make them good candidates to be turned into proper indexed fields. This is complementary to disk usage stats on a per-field level. Having both storage costs as well as frequency of use at disposal will make it much easier for a user to determine when runtime fields should be used.

This will also be a building block for further usage classification, e.g. we can record which type of queries are being run against a given field. If we then see for example that a text or keyword field is mostly receiving wildcard queries, then we can recommend using a wildcard field type instead Similarly, we can recommend the use of the match_only_text field type in case where queries on a text field are never making use of term frequencies and positions.

First step

Add a field usage API that reports shard-level statistics about which Lucene fields have been accessed, and which parts of the Lucene data structures have been accessed:

  • terms (incl. postings)
  • stored_fields
  • doc_values
  • points
  • norms
  • term_vectors
  • frequencies
  • positions
  • offsets
  • payloads

Together with the disk usage API this will yield a list of candidate fields that can be turned into runtime fields, or where various Lucene data structures can be disabled (e.g. by adapting index_options in the mappings)

Second step

Field usage during searches happens in a number of contexts, which we would like to distinguish:

  • query: Usage of a field as part of a query, e.g. a term query
  • aggregation: Usage of field to aggregate on, e.g. a terms aggregation. Interesting here is that aggregations can also run queries, e.g. as part of a filter aggregation. We will classify this as query usage.
  • post-filter
  • rescore
  • suggest
  • fetch phase (incl. top_hits)
  • sort (count as query or agg usage, or separately?)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions