Understanding which fields are being used and how often helps users make a choice between indexed and runtime fields. The ones not being accessed (or not frequently being accessed) might be good candidates to be turned into runtime fields. Conversely, frequent queries on a runtime field might make them good candidates to be turned into proper indexed fields. This is complementary to disk usage stats on a per-field level. Having both storage costs as well as frequency of use at disposal will make it much easier for a user to determine when runtime fields should be used.
This will also be a building block for further usage classification, e.g. we can record which type of queries are being run against a given field. If we then see for example that a text or keyword field is mostly receiving wildcard queries, then we can recommend using a wildcard field type instead Similarly, we can recommend the use of the match_only_text field type in case where queries on a text field are never making use of term frequencies and positions.
First step
Add a field usage API that reports shard-level statistics about which Lucene fields have been accessed, and which parts of the Lucene data structures have been accessed:
- terms (incl. postings)
- stored_fields
- doc_values
- points
- norms
- term_vectors
- frequencies
- positions
- offsets
- payloads
Together with the disk usage API this will yield a list of candidate fields that can be turned into runtime fields, or where various Lucene data structures can be disabled (e.g. by adapting index_options in the mappings)
Second step
Field usage during searches happens in a number of contexts, which we would like to distinguish:
- query: Usage of a field as part of a query, e.g. a term query
- aggregation: Usage of field to aggregate on, e.g. a terms aggregation. Interesting here is that aggregations can also run queries, e.g. as part of a filter aggregation. We will classify this as query usage.
- post-filter
- rescore
- suggest
- fetch phase (incl. top_hits)
- sort (count as query or agg usage, or separately?)
Understanding which fields are being used and how often helps users make a choice between indexed and runtime fields. The ones not being accessed (or not frequently being accessed) might be good candidates to be turned into runtime fields. Conversely, frequent queries on a runtime field might make them good candidates to be turned into proper indexed fields. This is complementary to disk usage stats on a per-field level. Having both storage costs as well as frequency of use at disposal will make it much easier for a user to determine when runtime fields should be used.
This will also be a building block for further usage classification, e.g. we can record which type of queries are being run against a given field. If we then see for example that a text or keyword field is mostly receiving wildcard queries, then we can recommend using a wildcard field type instead Similarly, we can recommend the use of the
match_only_textfield type in case where queries on atextfield are never making use of term frequencies and positions.First step
Add a field usage API that reports shard-level statistics about which Lucene fields have been accessed, and which parts of the Lucene data structures have been accessed:
Together with the disk usage API this will yield a list of candidate fields that can be turned into runtime fields, or where various Lucene data structures can be disabled (e.g. by adapting
index_optionsin the mappings)Second step
Field usage during searches happens in a number of contexts, which we would like to distinguish: