Add doc values support for JSON fields.#40069
Add doc values support for JSON fields.#40069jtibshirani merged 17 commits intoelastic:object-fieldsfrom
Conversation
|
Pinging @elastic/es-search |
|
@elasticmachine run elasticsearch-ci/default-distro |
a0f1f77 to
b2e0b99
Compare
jimczi
left a comment
There was a problem hiding this comment.
It looks great @jtibshirani, I left some general comments regarding the implementation but I think that it's the right approach.
There was a problem hiding this comment.
Since ordinals are sorted it should be possible to compute the range of ordinals that belongs to a single prefix (field). You can perform a binary search to find the first and last ordinals that contains the field prefix and use this information to remove the need to call lookupOrd on each value.
There was a problem hiding this comment.
Nice, will try that out!
There was a problem hiding this comment.
I looked into this more closely, and wanted to check I understand your suggestion: when creating a new KeyedJsonAtomicFieldData, we can do a one-time calculation to figure out the relevant range of ordinals. If the set of documents considered during the search is very restricted, this might be slightly more expensive than the current approach, but in general it should cut down a lot on lookups/ bytes comparisons.
There was a problem hiding this comment.
That's the idea yes, we'll need to recompute the min/max for each query (since we want to cache the fielddata only once for all fields) but since it's a binary search it shouldn't be too expensive.
There was a problem hiding this comment.
This is not related to this pr but KeyedJsonFieldType#existsQuery should also use the _field_names field instead of a prefix query ?
There was a problem hiding this comment.
Thanks, I will give this some thought and follow up in a separate issue/ PR.
There was a problem hiding this comment.
The SortField will use the original doc values to perform the sort so you'll need to create a custom SortedSetSortField that uses a KeyedJsonDocValues in order to filter the values that should not participate in the sort.
There was a problem hiding this comment.
I took a closer look at SortedSetDVOrdinalsIndexFieldData to understand how this should be implemented, and was trying to figure out whether this optimization was relevant: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/fielddata/plain/SortedSetDVOrdinalsIndexFieldData.java#L76-L79. Would you be able to give a little bit of background on this logic (originally introduced in #23827)?
There was a problem hiding this comment.
This is not an optimization but a change that was needed to handle index sorting. Index sorting at the index writer level accepts the Lucene's FieldSort only and since we compare index sort with the query sort to activate early termination we needed to make the query sort compatible. This shouldn't be an issue for the json field since this should not be allowed to use it for index sorting.
There was a problem hiding this comment.
That makes sense, thanks!
server/src/test/java/org/elasticsearch/index/fielddata/IndexFieldDataServiceTests.java
Outdated
Show resolved
Hide resolved
6a2375b to
5a83e1f
Compare
b2e0b99 to
ff4ce21
Compare
romseygeek
left a comment
There was a problem hiding this comment.
This looks great! One nit, other than that I think @jimczi covered suggestions I would make
There was a problem hiding this comment.
This should be delegated as well
|
Thanks @jimczi and @romseygeek for reviewing. I'll remove the WIP label and will work on the changes -- will ping you when it's ready for another look. |
5a83e1f to
577b7f6
Compare
c41fd03 to
24ff0e2
Compare
577b7f6 to
1093962
Compare
5e06859 to
b6ae960
Compare
1093962 to
3e35ccd
Compare
b6ae960 to
6da1e1d
Compare
|
@jimczi @romseygeek this is now ready for another review. There is no rush though, as I'll be focused on another project for the rest of the week. |
jimczi
left a comment
There was a problem hiding this comment.
It looks great, I left some minor comments and an idea to optimize the memory needed for aggregations that we discussed with Adrien earlier (remapping of ordinals).
There was a problem hiding this comment.
If there is no match for this key (minOrd==-1) you can directly return DocValues#emptySortedSet ?
There was a problem hiding this comment.
+1 it might help simplify KeyedJsonDocValues a bit as well
There was a problem hiding this comment.
👍 this is certainly cleaner.
There was a problem hiding this comment.
One small optimization that could save some memory in the terms aggregation is to remap the documents ordinals to [0, (maxOrd-minOrd)] during the collection. With global ordinals the terms aggregation allocates one big array based on the size reported by getValueCount so if you have a lot inner fields with lots of different values we'll allocate much more than what we really need. The remapping should happen in lookupOrd (to retrieve the original ordinal) and nextOrd (to remap based on minOrd), getValueCount can return maxOrd-minOrd.
There was a problem hiding this comment.
+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.
jpountz
left a comment
There was a problem hiding this comment.
I left some suggestions but this looks great in general.
There was a problem hiding this comment.
+1 it might help simplify KeyedJsonDocValues a bit as well
There was a problem hiding this comment.
Doc values already have an optimized way to find the first term that has a prefix via their terms enum, so we could do something like that. In general the underlying implementation does a binary search as well, but often it is a bit more efficient eg. by working directly on the compressed data.
TermsEnum te = sortedSetDocValues.termsEnum();
if (te.seekCeil(prefix) != SeekStatus.END && StringHelper.startsWith(te.term(), prefix)) {
return te.ord();
} else {
return -1;
}There was a problem hiding this comment.
For SortedSetDocValues it looks like this will call into SortedDocValues#lookupTerm, which is very similar to the current implementation. I find the current approach cleaner in that findMinOrd and findMaxOrd follow the same template, and that comparisons happen only on the prefixes. Maybe I could keep it as is for now, but see if it makes a difference when I run some benchmarks (on the TODO list)?
There was a problem hiding this comment.
Calls to advanceExact() are often followed by calls to nextOrd(). With such an implementation, we will iterate ords of the underlying doc values twice. Could we somehow cache the first ord that is greater than or equal to minOrd so that we don't have to call advanceExact again on the underlying doc values here and the first call to nextOrd() will return the cached first matching ord?
There was a problem hiding this comment.
This is a nice idea, will try it out.
There was a problem hiding this comment.
Even though it's impossible here because we are dealing with longs, a good practice when implementing binary search with signed indexes is to use unsigned shifts or unsigned division to avoid issues in case of overflow, ie. (low + high) >>> 1, or Long.divideUnsigned(low + high, 2).
There was a problem hiding this comment.
UncheckedIOException would be a better fit.
There was a problem hiding this comment.
+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.
There was a problem hiding this comment.
We should fail with out-of-range ords, some features are going to call this method on out-of-range ords, such as if you pass min_doc_count: 0 to a terms agg. For the record, rebasing ordinals would address this issue (but has some challenges as highlighted in another comment).
There was a problem hiding this comment.
We should probably document this limitation.
There was a problem hiding this comment.
This is good to know, I wasn't aware that we scanned the full ordinals array in some places. I added an explicit check for out-of-range ordinals.
I started to add documentation around the aggregations features that were not supported, but it was hard to justify and came off as unintuitive. I'm going to look into rebasing the global ordinals in a follow-up PR to try to solve the issue (and if that's not possible will brainstorm some compromise/ explanation).
3e35ccd to
4ff0edd
Compare
…rch. This strategy reduces the number of lookups + comparisons needed when filtering the prefixed doc values.
1cc1387 to
54ac640
Compare
|
This PR is now ready for another look. I had to force-push the branch because of CI failures -- the first commit since you reviewed is I looked into 'rebasing' the ordinals to lie in the range
|
|
@elasticmachine run elasticsearch-ci/1 |
Now that we error on out-of-range ordinals, supplying a missing sort value no longer works. This will be addressed in a follow-up PR.
jimczi
left a comment
There was a problem hiding this comment.
The change looks great, thanks @jtibshirani
I agree that getOrdinalMap requires some refactoring so let's discuss the remapping in a follow up.
jpountz
left a comment
There was a problem hiding this comment.
LGTM, the caching of the first matching ord looks good to me.
| } | ||
|
|
||
| long ord = delegate.nextOrd(); | ||
| if (ord != NO_MORE_ORDS && ord <= maxOrd) { |
There was a problem hiding this comment.
nit: can we add an assert ord >= minOrd?
| public boolean advanceExact(int target) throws IOException { | ||
| if (delegate.advanceExact(target)) { | ||
| for (long ord = delegate.nextOrd(); ord != NO_MORE_ORDS; ord = delegate.nextOrd()) { | ||
| if (minOrd <= ord && ord <= maxOrd) { |
There was a problem hiding this comment.
nit: we could actually break the loop if ord > maxOrd
|
@elasticmachine run elasticsearch-ci/bwc |
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.
This is a 'work in progress' PR: I am new to the doc values/ aggregations area, and was hoping to get some early feedback on the approach.
When
doc_valuesare enabled, we now add twoSortedSetDocValuesFieldsfor each token: one containing the rawvalue, and another withkey\0value. The root JSON field uses the standardSortedSetDVOrdinalsIndexFieldData. For keyed fields, this PR introduces a new typeKeyedJsonIndexFieldDatathat wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations liketerms.One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.