Add doc values support for JSON fields. by jtibshirani · Pull Request #40069 · elastic/elasticsearch

jtibshirani · 2019-03-14T19:35:29Z

This is a 'work in progress' PR: I am new to the doc values/ aggregations area, and was hoping to get some early feedback on the approach.

When doc_values are enabled, we now add two SortedSetDocValuesFields for each token: one containing the raw value, and another with key\0value. The root JSON field uses the standard SortedSetDVOrdinalsIndexFieldData. For keyed fields, this PR introduces a new type KeyedJsonIndexFieldData that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like terms.

One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.

elasticmachine · 2019-03-14T19:35:31Z

Pinging @elastic/es-search

jtibshirani · 2019-03-14T21:20:22Z

@elasticmachine run elasticsearch-ci/default-distro

jimczi

It looks great @jtibshirani, I left some general comments regarding the implementation but I think that it's the right approach.

jimczi · 2019-03-14T22:03:45Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

Since ordinals are sorted it should be possible to compute the range of ordinals that belongs to a single prefix (field). You can perform a binary search to find the first and last ordinals that contains the field prefix and use this information to remove the need to call lookupOrd on each value.

Nice, will try that out!

I looked into this more closely, and wanted to check I understand your suggestion: when creating a new KeyedJsonAtomicFieldData, we can do a one-time calculation to figure out the relevant range of ordinals. If the set of documents considered during the search is very restricted, this might be slightly more expensive than the current approach, but in general it should cut down a lot on lookups/ bytes comparisons.

That's the idea yes, we'll need to recompute the min/max for each query (since we want to cache the fielddata only once for all fields) but since it's a binary search it shouldn't be too expensive.

jimczi · 2019-03-14T22:16:17Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

This is not related to this pr but KeyedJsonFieldType#existsQuery should also use the _field_names field instead of a prefix query ?

Thanks, I will give this some thought and follow up in a separate issue/ PR.

jimczi · 2019-03-14T22:28:38Z

server/src/main/java/org/elasticsearch/index/mapper/JsonFieldMapper.java

The SortField will use the original doc values to perform the sort so you'll need to create a custom SortedSetSortField that uses a KeyedJsonDocValues in order to filter the values that should not participate in the sort.

I took a closer look at SortedSetDVOrdinalsIndexFieldData to understand how this should be implemented, and was trying to figure out whether this optimization was relevant: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/index/fielddata/plain/SortedSetDVOrdinalsIndexFieldData.java#L76-L79. Would you be able to give a little bit of background on this logic (originally introduced in #23827)?

This is not an optimization but a change that was needed to handle index sorting. Index sorting at the index writer level accepts the Lucene's FieldSort only and since we compare index sort with the query sort to activate early termination we needed to make the query sort compatible. This shouldn't be an issue for the json field since this should not be allowed to use it for index sorting.

That makes sense, thanks!

server/src/test/java/org/elasticsearch/index/fielddata/IndexFieldDataServiceTests.java

romseygeek

This looks great! One nit, other than that I think @jimczi covered suggestions I would make

romseygeek · 2019-03-17T11:24:06Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

This should be delegated as well

jtibshirani · 2019-03-18T19:06:02Z

Thanks @jimczi and @romseygeek for reviewing. I'll remove the WIP label and will work on the changes -- will ping you when it's ready for another look.

jtibshirani · 2019-03-19T21:29:44Z

@jimczi @romseygeek this is now ready for another review. There is no rush though, as I'll be focused on another project for the rest of the week.

jimczi

It looks great, I left some minor comments and an idea to optimize the memory needed for aggregations that we discussed with Adrien earlier (remapping of ordinals).

jimczi · 2019-03-22T14:39:36Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

If there is no match for this key (minOrd==-1) you can directly return DocValues#emptySortedSet ?

+1 it might help simplify KeyedJsonDocValues a bit as well

👍 this is certainly cleaner.

jimczi · 2019-03-22T14:52:07Z

docs/reference/mapping/types/json.asciidoc

WHen -> When

jimczi · 2019-03-22T15:17:10Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

One small optimization that could save some memory in the terms aggregation is to remap the documents ordinals to [0, (maxOrd-minOrd)] during the collection. With global ordinals the terms aggregation allocates one big array based on the size reported by getValueCount so if you have a lot inner fields with lots of different values we'll allocate much more than what we really need. The remapping should happen in lookupOrd (to retrieve the original ordinal) and nextOrd (to remap based on minOrd), getValueCount can return maxOrd-minOrd.

+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.

jpountz

I left some suggestions but this looks great in general.

jpountz · 2019-03-25T18:05:01Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

+1 it might help simplify KeyedJsonDocValues a bit as well

jpountz · 2019-03-25T18:14:39Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

Doc values already have an optimized way to find the first term that has a prefix via their terms enum, so we could do something like that. In general the underlying implementation does a binary search as well, but often it is a bit more efficient eg. by working directly on the compressed data.

TermsEnum te = sortedSetDocValues.termsEnum(); if (te.seekCeil(prefix) != SeekStatus.END && StringHelper.startsWith(te.term(), prefix)) { return te.ord(); } else { return -1; }

For SortedSetDocValues it looks like this will call into SortedDocValues#lookupTerm, which is very similar to the current implementation. I find the current approach cleaner in that findMinOrd and findMaxOrd follow the same template, and that comparisons happen only on the prefixes. Maybe I could keep it as is for now, but see if it makes a difference when I run some benchmarks (on the TODO list)?

Fine with me.

jpountz · 2019-03-25T18:17:48Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

Calls to advanceExact() are often followed by calls to nextOrd(). With such an implementation, we will iterate ords of the underlying doc values twice. Could we somehow cache the first ord that is greater than or equal to minOrd so that we don't have to call advanceExact again on the underlying doc values here and the first call to nextOrd() will return the cached first matching ord?

This is a nice idea, will try it out.

jpountz · 2019-03-25T18:23:33Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

Even though it's impossible here because we are dealing with longs, a good practice when implementing binary search with signed indexes is to use unsigned shifts or unsigned division to avoid issues in case of overflow, ie. (low + high) >>> 1, or Long.divideUnsigned(low + high, 2).

jpountz · 2019-03-25T18:34:36Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

UncheckedIOException would be a better fit.

jpountz · 2019-03-25T18:35:58Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

+1 We might need to keep it for a follow-up however because I think it's going to make IndexOrdinalsFieldData#getOrdinalMap a bit hard to implement given that OrdinalMap is not designed for being extended.

jpountz · 2019-03-25T18:44:04Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

We should fail with out-of-range ords, some features are going to call this method on out-of-range ords, such as if you pass min_doc_count: 0 to a terms agg. For the record, rebasing ordinals would address this issue (but has some challenges as highlighted in another comment).

We should probably document this limitation.

This is good to know, I wasn't aware that we scanned the full ordinals array in some places. I added an explicit check for out-of-range ordinals.

I started to add documentation around the aggregations features that were not supported, but it was hard to justify and came off as unintuitive. I'm going to look into rebasing the global ordinals in a follow-up PR to try to solve the issue (and if that's not possible will brainstorm some compromise/ explanation).

…rch. This strategy reduces the number of lookups + comparisons needed when filtering the prefixed doc values.

jtibshirani · 2019-03-28T22:41:57Z

This PR is now ready for another look. I had to force-push the branch because of CI failures -- the first commit since you reviewed is 7058eeb.

I looked into 'rebasing' the ordinals to lie in the range [0, (maxOrd-minOrd)], and as @jpountz anticipated it's quite tricky due to the changes needed to KeyedJsonIndexFieldData#getOrdinalMap. Here are the changes I'm planning for follow-up PRs:

Try to rebase the ordinals for keyed JSON fields to start at 0.
Fix or document the issues around scanning through the whole ordinals array (for example setting min_doc_count: 0 fails). This is related to the above work, as it wouldn't be an issue if the ordinals always laid in the range [0, (maxOrd-minOrd)].
I also realized that eager_global_ordinals is only applied to the root JSON field, and is being skipped for the keyed field data. I plan to address this in a follow-up, since it might require a small refactor to JsonFieldMapper.

jtibshirani · 2019-03-29T19:15:02Z

@elasticmachine run elasticsearch-ci/1
@elasticmachine run elasticsearch-ci/2
@elasticmachine run elasticsearch-ci/bwc

Now that we error on out-of-range ordinals, supplying a missing sort value no longer works. This will be addressed in a follow-up PR.

jimczi

The change looks great, thanks @jtibshirani
I agree that getOrdinalMap requires some refactoring so let's discuss the remapping in a follow up.

jpountz

LGTM, the caching of the first matching ord looks good to me.

jpountz · 2019-04-01T13:18:56Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

+            }
+
+            long ord = delegate.nextOrd();
+            if (ord != NO_MORE_ORDS && ord <= maxOrd) {


nit: can we add an assert ord >= minOrd?

jpountz · 2019-04-01T13:20:58Z

server/src/main/java/org/elasticsearch/index/mapper/KeyedJsonAtomicFieldData.java

+        public boolean advanceExact(int target) throws IOException {
+            if (delegate.advanceExact(target)) {
+                for (long ord = delegate.nextOrd(); ord != NO_MORE_ORDS; ord = delegate.nextOrd()) {
+                     if (minOrd <= ord && ord <= maxOrd) {


nit: we could actually break the loop if ord > maxOrd

jtibshirani · 2019-04-01T19:00:41Z

@elasticmachine run elasticsearch-ci/bwc

When `doc_values` are enabled, we now add two `SortedSetDocValuesFields` for each token: one containing the raw `value`, and another with `key\0value`. The root JSON field uses the standard `SortedSetDVOrdinalsIndexFieldData`. For keyed fields, this PR introduces a new type ` KeyedJsonIndexFieldData` that wraps the standard ordinals field data and filters out values that do not match the right prefix. This gives support for sorting on JSON fields, as well as simple keyword-style aggregations like `terms`. One slightly tricky aspect is caching of these doc values. Given a keyed JSON field, we need to make sure we don't store values filtered on a certain prefix under the same cache key as ones filtered on a different prefix. However, we also want to load and cache global ordinals only once per keyed JSON field, as opposed to having a separate cache entry per prefix.

jtibshirani added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types labels Mar 14, 2019

jtibshirani requested review from jimczi and romseygeek March 14, 2019 19:36

jtibshirani added the WIP label Mar 14, 2019

jtibshirani force-pushed the json-doc-values branch from a0f1f77 to b2e0b99 Compare March 14, 2019 21:56

jimczi reviewed Mar 14, 2019

View reviewed changes

jtibshirani force-pushed the object-fields branch from 6a2375b to 5a83e1f Compare March 14, 2019 22:46

jtibshirani force-pushed the json-doc-values branch from b2e0b99 to ff4ce21 Compare March 14, 2019 22:48

romseygeek requested changes Mar 17, 2019

View reviewed changes

jtibshirani removed the WIP label Mar 18, 2019

jtibshirani force-pushed the object-fields branch from 5a83e1f to 577b7f6 Compare March 19, 2019 01:35

jtibshirani force-pushed the json-doc-values branch from c41fd03 to 24ff0e2 Compare March 19, 2019 01:39

jtibshirani force-pushed the object-fields branch from 577b7f6 to 1093962 Compare March 19, 2019 03:32

jtibshirani force-pushed the json-doc-values branch 2 times, most recently from 5e06859 to b6ae960 Compare March 19, 2019 17:39

jtibshirani force-pushed the object-fields branch from 1093962 to 3e35ccd Compare March 19, 2019 20:09

jtibshirani force-pushed the json-doc-values branch from b6ae960 to 6da1e1d Compare March 19, 2019 20:17

jimczi reviewed Mar 22, 2019

View reviewed changes

jpountz reviewed Mar 25, 2019

View reviewed changes

jtibshirani requested a review from romseygeek March 28, 2019 22:20

jtibshirani force-pushed the object-fields branch from 3e35ccd to 4ff0edd Compare March 28, 2019 22:30

jtibshirani added 4 commits March 28, 2019 15:30

Create doc values fields for JSON fields.

1719b02

Support filtered doc values for keyed JSON fields.

d634d3b

Add an eager_global_ordinals setting.

ef3851c

Add checks around caching to IndexFieldDataServiceTests.

94d4425

jtibshirani added 11 commits March 28, 2019 15:30

Add javadoc.

4f9598e

Make sure to delegate getChildResources.

5124a6f

Make sure we always filter non-matching prefixes when sorting.

08cc504

Calculate the first and last ordinals for a prefix through binary sea…

7dd443a

…rch. This strategy reduces the number of lookups + comparisons needed when filtering the prefixed doc values.

Update the reference documentation.

3097ed1

Refactor to return DocValues.emptySortedSet() when no terms match.

7058eeb

Use an unsigned shift in the binary search methods.

19e7d2a

Fix a typo in the docs.

f482d3e

Use UncheckedIOException.

e0a5505

Cache the first ordinal in a document.

a390337

Error when an out-of-bounds ordinal is passed to lookupOrd.

54ac640

jtibshirani force-pushed the json-doc-values branch from 1cc1387 to 54ac640 Compare March 28, 2019 22:40

This was referenced Mar 28, 2019

Flattened object fields design + implementation #33003

Closed

Support flattened field type from Elasticsearch elastic/kibana#25820

Open

Fix FieldSortIT#testJsonField.

be00cf4

Now that we error on out-of-range ordinals, supplying a missing sort value no longer works. This will be addressed in a follow-up PR.

jimczi approved these changes Apr 1, 2019

View reviewed changes

jpountz approved these changes Apr 1, 2019

View reviewed changes

Break out of advanceExact when ord > maxOrd.

c0c1a9b

jtibshirani merged commit 80b0c08 into elastic:object-fields Apr 1, 2019

jtibshirani deleted the json-doc-values branch April 1, 2019 22:49

jtibshirani mentioned this pull request Apr 15, 2019

[DRAFT] Rebase keyed JSON ordinals to start from zero. #41220

Closed

Conversation

jtibshirani commented Mar 14, 2019

Uh oh!

elasticmachine commented Mar 14, 2019

Uh oh!

jtibshirani commented Mar 14, 2019

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

romseygeek left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jtibshirani commented Mar 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jtibshirani commented Mar 19, 2019

Uh oh!

jimczi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jpountz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jtibshirani commented Mar 18, 2019 •

edited

Loading

jtibshirani Mar 28, 2019 •

edited

Loading