String field mappings and fielddata / doc values settings

This issue addresses a few topics:
- separating `string` fields out into `text` and `keyword` fields (#11901)
- deprecating in-memory fielddata for field-types that support doc-values (to remove in 3.x)
- `fielddata` and `doc_values` settings (#8693) and `norms`
- good out-of-the-box dynamic mappings for string fields
## `string` → `text`/ `keyword`

Today, we use `string` both for full-text and for structured keywords.  We don't support doc-values on `analyzed` string fields, which means that strings which are essentially keywords (but eg need to be lowercased) cannot use doc-values.

**Proposal:**
- deprecate `string` fields
- add `text` fields which support the full analysis chain and don't support doc-values
- add `keywords` fields which support only the `keyword` tokenizer, and have doc-values enabled by default
- change `index` to accept `true` | `false`

Question:  Should `keyword` fields allow token filters that introduce new tokens? 
## Deprecating fielddata for fields that support doc values

In-memory fielddata is limited by the size of the heap, and has been one of the biggest pain-points for users. Doc-values are slightly slower but: (1) don't suffer from the same latency as fielddata, (2) are not limited by heap size, (3) don't impact garbage collection, (4) allow much greater scaling.

All fields that support doc values already have them enabled by default.  

Proposal:
- Deprecate fielddata implementations (except for analyzed string fields) in 2.x 
- Remove them in 3.x.  

The question arises: what happens if the user disables doc values then decides that actually they DO want to aggregate on that field after all?  The answer is the same as if they have set a field to `index:false` - they have to reindex.
## Fielddata and doc values settings 

Today we have these settings:
- `doc_values`: `true`|`false`
- `fielddata.format`: `disabled` | `doc_values` | `paged_bytes` | `array`
- `fielddata.loading`: `lazy` | `eager` | `eager_global_ordinals`
- `fielddata.filters`: `frequency:{}`, `regex:{}`

These become a lot easier to simplify if we deprecate fielddata for all but analyzed string fields. 

**Proposal for fields that support doc values:**
- `doc_values` : `true` (default) | `false`
- `global_ordinals` : `lazy` (default) | `eager`

**Proposal for analyzed string fields:**
- `fielddata`: `disabled` (default) | `lazy` | `eager`
- `global_ordinals` : `lazy` (default) | `eager`
- `fielddata.filters`: `frequency:{}`, `regex:{}`

If, in the future, we can automatically figure out which global ordinals need to be built eagerly, then we can remove the `global_ordinals` setting.
## Norms settings

Similar to the above, we have:
- `norms.enabled` : `true` | `false`
- `norms.loading` : `lazy` | `eager`

In Lucene 5.3, norms are disk based, so the lazy/eager issue is less important (eager in this case would mean force-loading the norms into the file system cache, a decision which we can probably make automatically in the future).

**Proposal:**
- `norms`: `true` | `false`
- only supported on `text` fields
## Good out-of-the-box dynamic mappings for string fields

Today, when we detect a new string field, we add it as an `analyzed` `string`, with `lazy` fielddata loading enabled.  While this allows users to get going with full text search, sorting and aggregations (with limitations, eg `new` + `york`), it's a poor default for heap usage.

**Proposal:**

Add a `text` main field (with fielddata loading disabled) and a `keyword` multi-field by default, ie:

```
{
  "my_string": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}
```

With the default settings these fields would look like this:

```
{
  "my_string": {
    "type":                "text",
    "analyzer":            "default",
    "boost":               1,
    "fielddata":           "disabled",
    "fielddata_filters":   {},
    "ignore_above":        -1,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "positions",
    "norms":               true,
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "default",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}

{
  "my_string.keyword": {
    "type":                "keyword",
    "analyzer":            "keyword",
    "boost":               1,
    "doc_values":          true,
    "ignore_above":        256,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "docs",
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "keyword",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String field mappings and fielddata / doc values settings #12394

`string` → `text`/ `keyword`

Deprecating fielddata for fields that support doc values

Fielddata and doc values settings

Norms settings

Good out-of-the-box dynamic mappings for string fields

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

String field mappings and fielddata / doc values settings #12394

Description

string → text/ keyword

Deprecating fielddata for fields that support doc values

Fielddata and doc values settings

Norms settings

Good out-of-the-box dynamic mappings for string fields

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`string` → `text`/ `keyword`