Skip to content

String field mappings and fielddata / doc values settings #12394

@clintongormley

Description

@clintongormley

This issue addresses a few topics:

stringtext/ keyword

Today, we use string both for full-text and for structured keywords. We don't support doc-values on analyzed string fields, which means that strings which are essentially keywords (but eg need to be lowercased) cannot use doc-values.

Proposal:

  • deprecate string fields
  • add text fields which support the full analysis chain and don't support doc-values
  • add keywords fields which support only the keyword tokenizer, and have doc-values enabled by default
  • change index to accept true | false

Question: Should keyword fields allow token filters that introduce new tokens?

Deprecating fielddata for fields that support doc values

In-memory fielddata is limited by the size of the heap, and has been one of the biggest pain-points for users. Doc-values are slightly slower but: (1) don't suffer from the same latency as fielddata, (2) are not limited by heap size, (3) don't impact garbage collection, (4) allow much greater scaling.

All fields that support doc values already have them enabled by default.

Proposal:

  • Deprecate fielddata implementations (except for analyzed string fields) in 2.x
  • Remove them in 3.x.

The question arises: what happens if the user disables doc values then decides that actually they DO want to aggregate on that field after all? The answer is the same as if they have set a field to index:false - they have to reindex.

Fielddata and doc values settings

Today we have these settings:

  • doc_values: true|false
  • fielddata.format: disabled | doc_values | paged_bytes | array
  • fielddata.loading: lazy | eager | eager_global_ordinals
  • fielddata.filters: frequency:{}, regex:{}

These become a lot easier to simplify if we deprecate fielddata for all but analyzed string fields.

Proposal for fields that support doc values:

  • doc_values : true (default) | false
  • global_ordinals : lazy (default) | eager

Proposal for analyzed string fields:

  • fielddata: disabled (default) | lazy | eager
  • global_ordinals : lazy (default) | eager
  • fielddata.filters: frequency:{}, regex:{}

If, in the future, we can automatically figure out which global ordinals need to be built eagerly, then we can remove the global_ordinals setting.

Norms settings

Similar to the above, we have:

  • norms.enabled : true | false
  • norms.loading : lazy | eager

In Lucene 5.3, norms are disk based, so the lazy/eager issue is less important (eager in this case would mean force-loading the norms into the file system cache, a decision which we can probably make automatically in the future).

Proposal:

  • norms: true | false
  • only supported on text fields

Good out-of-the-box dynamic mappings for string fields

Today, when we detect a new string field, we add it as an analyzed string, with lazy fielddata loading enabled. While this allows users to get going with full text search, sorting and aggregations (with limitations, eg new + york), it's a poor default for heap usage.

Proposal:

Add a text main field (with fielddata loading disabled) and a keyword multi-field by default, ie:

{
  "my_string": {
    "type": "text",
    "fields": {
      "keyword": {
        "type": "keyword",
        "ignore_above": 256
      }
    }
  }
}

With the default settings these fields would look like this:

{
  "my_string": {
    "type":                "text",
    "analyzer":            "default",
    "boost":               1,
    "fielddata":           "disabled",
    "fielddata_filters":   {},
    "ignore_above":        -1,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "positions",
    "norms":               true,
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "default",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}

{
  "my_string.keyword": {
    "type":                "keyword",
    "analyzer":            "keyword",
    "boost":               1,
    "doc_values":          true,
    "ignore_above":        256,
    "include_in_all":      true,
    "index":               true,
    "index_options":       "docs",
    "null_value":          null,
    "position_offset_gap": 0,
    "search_analyzer":     "keyword",
    "similarity":          "default",
    "store":               false,
    "term_vector":         "no"
  }
}

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions