[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance

### Is your feature request related to a problem? Please describe

DocValue type for keyword field is always set as `SORTED_SET`, this works well for cases with low/medium cardinality fields, however, for high cardinality fields, its an overhead as it unnecessarily iterate over ordinals and lookup ordinals using term dictionaries. 
Lucene 9 also started always compressing the term dictionaries for sorted doc values (https://issues.apache.org/jira/browse/LUCENE-9843) and disregarding compression mode associated with codec. This makes ordinal lookup even slower when sorted doc values are used, making high cardinality agg queries even slower. 

### Describe the solution you'd like

Use of binary doc values for high cardinality fields can improve the performance significantly for cardinality aggregation and other aggregations too. The catch is, its an index time setting to set the doc value type and we can't set both as it will significantly increase the index size involving keyword fields. 

We can do one of the following, feel free to add any other solution - 
1. Introduce a new field type for such high cardinality fields and use doc value type as binary for them. 
2. Introduce a configuration within keyword field (this is what i did for poc as a hack); I'm against this solution due to complexity it adds to keyword field type. 

Shortcoming of having just binary doc value for a given field type compared to sorted set DV - 
1. Larger index size depending on the amount of duplications present. Also, as lucene 9 always compresses term dict for sorted DV, which not the case for binary DV, so that will also add to higher index size when default `best_speed` compression mode is used.
2.  aggregations or any other codepath involving ordinals like ordinalCollector for CardinalityAggregation can never be used. I believe for high cardinality fields, this will anyways be the case where ordinals overhead will always be very high and shouldn't be used. 

### Related component

Search:Performance

### Describe alternatives you've considered

_No response_

### Additional context

I [tweaked the code](https://github.com/rishabhmaurya/OpenSearch/commit/49fe52ad694507ca6cbc655541d08fd8a2f9a2a6) to add both sorted set and binary doc values for keyword field type. Also, added a way to configure what to use for `FieldData` which is used for aggregations. 
On running osb against Big5 workload for a high cardinality field, the improvement was significant - almost 10x from 28.8 sec to 3.2 sec: 

Query: 
```json
{ 
  "size": 0, 
  "aggs": {
    "agent": {
      "cardinality": {
        "field": "event.id.keyword"
      }
    }
  }
}
```

Using sorted set doc value 
```
{
  "took" : 28851,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agent" : {
      "value" : 180250
    }
  }
}
```

Using binary doc value: 
```
{
  "took" : 3266,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agent" : {
      "value" : 180250
    }
  }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request] Use of Binary DocValue for high cardinality fields to improve aggregations performance #16837

Description

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Related component

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions