This issue addresses a few topics:
string → text/ keyword
Today, we use string both for full-text and for structured keywords. We don't support doc-values on analyzed string fields, which means that strings which are essentially keywords (but eg need to be lowercased) cannot use doc-values.
Proposal:
- deprecate
string fields
- add
text fields which support the full analysis chain and don't support doc-values
- add
keywords fields which support only the keyword tokenizer, and have doc-values enabled by default
- change
index to accept true | false
Question: Should keyword fields allow token filters that introduce new tokens?
Deprecating fielddata for fields that support doc values
In-memory fielddata is limited by the size of the heap, and has been one of the biggest pain-points for users. Doc-values are slightly slower but: (1) don't suffer from the same latency as fielddata, (2) are not limited by heap size, (3) don't impact garbage collection, (4) allow much greater scaling.
All fields that support doc values already have them enabled by default.
Proposal:
- Deprecate fielddata implementations (except for analyzed string fields) in 2.x
- Remove them in 3.x.
The question arises: what happens if the user disables doc values then decides that actually they DO want to aggregate on that field after all? The answer is the same as if they have set a field to index:false - they have to reindex.
Fielddata and doc values settings
Today we have these settings:
doc_values: true|false
fielddata.format: disabled | doc_values | paged_bytes | array
fielddata.loading: lazy | eager | eager_global_ordinals
fielddata.filters: frequency:{}, regex:{}
These become a lot easier to simplify if we deprecate fielddata for all but analyzed string fields.
Proposal for fields that support doc values:
doc_values : true (default) | false
global_ordinals : lazy (default) | eager
Proposal for analyzed string fields:
fielddata: disabled (default) | lazy | eager
global_ordinals : lazy (default) | eager
fielddata.filters: frequency:{}, regex:{}
If, in the future, we can automatically figure out which global ordinals need to be built eagerly, then we can remove the global_ordinals setting.
Norms settings
Similar to the above, we have:
norms.enabled : true | false
norms.loading : lazy | eager
In Lucene 5.3, norms are disk based, so the lazy/eager issue is less important (eager in this case would mean force-loading the norms into the file system cache, a decision which we can probably make automatically in the future).
Proposal:
norms: true | false
- only supported on
text fields
Good out-of-the-box dynamic mappings for string fields
Today, when we detect a new string field, we add it as an analyzed string, with lazy fielddata loading enabled. While this allows users to get going with full text search, sorting and aggregations (with limitations, eg new + york), it's a poor default for heap usage.
Proposal:
Add a text main field (with fielddata loading disabled) and a keyword multi-field by default, ie:
{
"my_string": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
With the default settings these fields would look like this:
{
"my_string": {
"type": "text",
"analyzer": "default",
"boost": 1,
"fielddata": "disabled",
"fielddata_filters": {},
"ignore_above": -1,
"include_in_all": true,
"index": true,
"index_options": "positions",
"norms": true,
"null_value": null,
"position_offset_gap": 0,
"search_analyzer": "default",
"similarity": "default",
"store": false,
"term_vector": "no"
}
}
{
"my_string.keyword": {
"type": "keyword",
"analyzer": "keyword",
"boost": 1,
"doc_values": true,
"ignore_above": 256,
"include_in_all": true,
"index": true,
"index_options": "docs",
"null_value": null,
"position_offset_gap": 0,
"search_analyzer": "keyword",
"similarity": "default",
"store": false,
"term_vector": "no"
}
}
This issue addresses a few topics:
stringfields out intotextandkeywordfields (Rethink string versus not_analyzed string mappings and support #11901)fielddataanddoc_valuessettings (Improve fielddata mappings #8693) andnormsstring→text/keywordToday, we use
stringboth for full-text and for structured keywords. We don't support doc-values onanalyzedstring fields, which means that strings which are essentially keywords (but eg need to be lowercased) cannot use doc-values.Proposal:
stringfieldstextfields which support the full analysis chain and don't support doc-valueskeywordsfields which support only thekeywordtokenizer, and have doc-values enabled by defaultindexto accepttrue|falseQuestion: Should
keywordfields allow token filters that introduce new tokens?Deprecating fielddata for fields that support doc values
In-memory fielddata is limited by the size of the heap, and has been one of the biggest pain-points for users. Doc-values are slightly slower but: (1) don't suffer from the same latency as fielddata, (2) are not limited by heap size, (3) don't impact garbage collection, (4) allow much greater scaling.
All fields that support doc values already have them enabled by default.
Proposal:
The question arises: what happens if the user disables doc values then decides that actually they DO want to aggregate on that field after all? The answer is the same as if they have set a field to
index:false- they have to reindex.Fielddata and doc values settings
Today we have these settings:
doc_values:true|falsefielddata.format:disabled|doc_values|paged_bytes|arrayfielddata.loading:lazy|eager|eager_global_ordinalsfielddata.filters:frequency:{},regex:{}These become a lot easier to simplify if we deprecate fielddata for all but analyzed string fields.
Proposal for fields that support doc values:
doc_values:true(default) |falseglobal_ordinals:lazy(default) |eagerProposal for analyzed string fields:
fielddata:disabled(default) |lazy|eagerglobal_ordinals:lazy(default) |eagerfielddata.filters:frequency:{},regex:{}If, in the future, we can automatically figure out which global ordinals need to be built eagerly, then we can remove the
global_ordinalssetting.Norms settings
Similar to the above, we have:
norms.enabled:true|falsenorms.loading:lazy|eagerIn Lucene 5.3, norms are disk based, so the lazy/eager issue is less important (eager in this case would mean force-loading the norms into the file system cache, a decision which we can probably make automatically in the future).
Proposal:
norms:true|falsetextfieldsGood out-of-the-box dynamic mappings for string fields
Today, when we detect a new string field, we add it as an
analyzedstring, withlazyfielddata loading enabled. While this allows users to get going with full text search, sorting and aggregations (with limitations, egnew+york), it's a poor default for heap usage.Proposal:
Add a
textmain field (with fielddata loading disabled) and akeywordmulti-field by default, ie:With the default settings these fields would look like this: