Problem
In Elasticsearch, you can currently support indexing your string data as either analyzed strings, which is great for unstructured, full text search, or as not_analyzed strings, which is great for structured search (e.g., exact matches). However, there is frequently an in-between where you want exact matches, but you want them to ignore case or accented characters (AA == aa == Ââ). This forces the use of analyzers for normalization.
Partial Workaround
For those scenarios, you are currently forced to use the analyzed string variant with a specific analyzer. This generally leads to users forgetting to disable a lot of things like norms, positions, and frequencies. Even if you do happen to do all of that right, you simply cannot take advantage of doc values.
Potential Solution
It would be interesting to potentially rethink strings and how they're mapped. In the case of analyzed strings, there really isn't much need for improvement except potentially the naming of it. For not_analyzed strings, there's a lot of room for improvement.
Mockup
PUT /my-index
{
"mappings" : {
"my-type" : {
"properties" : {
"full_text" : {
"type" : "string",
"analyzer" : "standard"
},
"constant_string" : {
"type" : "constant_string",
"filter" : [ "lowercase", "trim" ],
"char_filter" : [ "..." ]
}
}
}
}
}
Note: the difference is that analyzed strings stay "string" and not_analyzed strings become "constant_string". It's unlikely that we could easily change from "string" for analyzed text, but if we could, then perhaps we could switch analyzed strings to be text and not_analyzed strings to be just string.
This avoids a lot of questions and regular problems. It also provides the exact same functionality as we have today, if you choose to not supply a filter or char_filter for constant_strings, but it also provides more flexibility in that users can finally use doc values with filtered text that still can be reasonably sorted and aggregated in a normalized format, without the ability to confusingly tokenize the string and unnecessarily use norms, position, or frequency data.
Problem
In Elasticsearch, you can currently support indexing your string data as either
analyzedstrings, which is great for unstructured, full text search, or asnot_analyzedstrings, which is great for structured search (e.g., exact matches). However, there is frequently an in-between where you want exact matches, but you want them to ignore case or accented characters (AA == aa == Ââ). This forces the use of analyzers for normalization.Partial Workaround
For those scenarios, you are currently forced to use the analyzed string variant with a specific analyzer. This generally leads to users forgetting to disable a lot of things like norms, positions, and frequencies. Even if you do happen to do all of that right, you simply cannot take advantage of doc values.
Potential Solution
It would be interesting to potentially rethink strings and how they're mapped. In the case of analyzed strings, there really isn't much need for improvement except potentially the naming of it. For not_analyzed strings, there's a lot of room for improvement.
Mockup
Note: the difference is that analyzed strings stay "string" and not_analyzed strings become "constant_string". It's unlikely that we could easily change from "string" for analyzed text, but if we could, then perhaps we could switch analyzed strings to be
textand not_analyzed strings to be juststring.This avoids a lot of questions and regular problems. It also provides the exact same functionality as we have today, if you choose to not supply a filter or char_filter for constant_strings, but it also provides more flexibility in that users can finally use doc values with filtered text that still can be reasonably sorted and aggregated in a normalized format, without the ability to confusingly tokenize the string and unnecessarily use norms, position, or frequency data.