Field usage API

Understanding which fields are being used and how often helps users make a choice between indexed and runtime fields. The ones not being accessed (or not frequently being accessed) might be good candidates to be turned into runtime fields. Conversely, frequent queries on a runtime field might make them good candidates to be turned into proper indexed fields.  This is complementary to [disk usage stats on a per-field level](https://github.com/elastic/elasticsearch/issues/68508). Having both storage costs as well as frequency of use at disposal will make it much easier for a user to determine when runtime fields should be used.

This will also be a building block for further usage classification, e.g. we can record which type of queries are being run against a given field. If we then see for example that a text or keyword field is mostly receiving wildcard queries, then we can recommend using a [wildcard field type](https://www.elastic.co/guide/en/elasticsearch/reference/7.13/keyword.html#wildcard-field-type) instead Similarly, we can recommend the use of the [`match_only_text` field type](https://www.elastic.co/guide/en/elasticsearch/reference/7.x/text.html#match-only-text-field-type) in case where queries on a `text` field are never making use of term frequencies and positions.

**First step**

Add a field usage API that reports shard-level statistics about which Lucene fields have been accessed, and which parts of the Lucene data structures have been accessed:

- terms (incl. postings)
- stored_fields
- doc_values
- points
- norms
- term_vectors
- frequencies
- positions
- offsets
- payloads

Together with the [disk usage API](https://github.com/elastic/elasticsearch/issues/68508) this will yield a list of candidate fields that can be turned into runtime fields, or where various Lucene data structures can be disabled (e.g. by adapting [`index_options`](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-options.html) in the mappings)

**Second step**

Field usage during searches happens in a number of contexts, which we would like to distinguish:

- query: Usage of a field as part of a query, e.g. a term query
- aggregation: Usage of field to aggregate on, e.g. a terms aggregation. Interesting here is that aggregations can also run queries, e.g. as part of a filter aggregation. We will classify this as query usage.
- post-filter
- rescore
- suggest
- fetch phase (incl. top_hits)
- sort (count as query or agg usage, or separately?)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Field usage API #73944

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Field usage API #73944

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions