TSDB: Add support for composite aggregation on _tsid field#81998
TSDB: Add support for composite aggregation on _tsid field#81998csoulios merged 17 commits intoelastic:masterfrom
_tsid field#81998Conversation
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
1 similar comment
|
Pinging @elastic/es-analytics-geo (Team:Analytics) |
| } else if (obj instanceof Map<?, ?> && configs[i].fieldType().getClass() == TimeSeriesIdFieldType.class) { | ||
| // If input is a _tsid map, encode the map to the _tsid BytesRef | ||
| // TODO: Must add validations | ||
| values[i] = configs[i].format().parseBytesRef(obj); |
There was a problem hiding this comment.
In the after part of the query, the _tsid field is passed as a Map.
Here we encode the Map to its BytesRef representation. From this point on, we treat the tsid after key as a binary value
There was a problem hiding this comment.
Another option could be to encode the json map in a string ? These values are not formatted for visualizations, they are part of the after key so the content an be obfuscated.
There was a problem hiding this comment.
We could base64 encode the tsid and slap it in the string. That'd obfuscate it plenty. I didn't like that idea before Christmas but I can't really remember why now....
There was a problem hiding this comment.
For things like search_after I think it's nice to have a readable after key. That we're formatting dates in composite makes me think we want to format tsid - and the map is the formatted version. But if we just need some key - the base64 encoded version is easier.
server/src/main/java/org/elasticsearch/search/searchafter/SearchAfterBuilder.java
Outdated
Show resolved
Hide resolved
jimczi
left a comment
There was a problem hiding this comment.
The approach makes sense, another option would be to consider the _tsid as an encoded string. The benefit of the navigable map is to expose all dimensions individually so I understand why it is preferred. Although, it should be the responsibility of the _tsid mapper to validate that the map is parseable.
| - skip: | ||
| version: " - 8.0.99" | ||
| reason: introduced in 8.1.0 | ||
|
|
There was a problem hiding this comment.
Since it's a new field type, I'd recommend to also add unit tests in CompositeAggregatorTests for coverage.
_tsid field_tsid field
| - skip: | ||
| version: " - 8.0.99" | ||
| reason: introduced in 8.1.0 | ||
|
|
| SortedMap<String, BytesReference> dimensionFields = context.doc().getDimensionBytes(); | ||
| if (dimensionFields.isEmpty()) { | ||
| BytesReference timeSeriesId = buildTsidField(dimensionFields); | ||
| assert timeSeriesId != null : "In time series mode _tsid cannot be null"; |
There was a problem hiding this comment.
I don't think the method can really return null anyway. Maybe not worth checking.
| } else if (obj instanceof Map<?, ?> && configs[i].fieldType().getClass() == TimeSeriesIdFieldType.class) { | ||
| // If input is a _tsid map, encode the map to the _tsid BytesRef | ||
| // TODO: Must add validations | ||
| values[i] = configs[i].format().parseBytesRef(obj); |
There was a problem hiding this comment.
We could base64 encode the tsid and slap it in the string. That'd obfuscate it plenty. I didn't like that idea before Christmas but I can't really remember why now....
| } else if (obj instanceof Map<?, ?> && configs[i].fieldType().getClass() == TimeSeriesIdFieldType.class) { | ||
| // If input is a _tsid map, encode the map to the _tsid BytesRef | ||
| // TODO: Must add validations | ||
| values[i] = configs[i].format().parseBytesRef(obj); |
There was a problem hiding this comment.
For things like search_after I think it's nice to have a readable after key. That we're formatting dates in composite makes me think we want to format tsid - and the map is the formatted version. But if we just need some key - the base64 encoded version is easier.
nik9000
left a comment
There was a problem hiding this comment.
LGTM, though I pinged @romseygeek for a second opinion And! Can you look at search_after? It's quite related to composite and it probably should also "just work" after this PR. Or, at least, it should be close.
|
|
||
| @Override | ||
| public BytesRef parseBytesRef(Object value) { | ||
| if (value instanceof Map<?, ?> m) { |
There was a problem hiding this comment.
I think an early return would be more readable here, even if you have to sacrifice the pattern matching. I don't know enough about Java's pattern matching to know if you have to though.
There was a problem hiding this comment.
Added an early return and a cast to Map to make it more readable. Dropped the pattern matching
| @@ -199,7 +200,12 @@ private static Object convertValueFromSortType(String fieldName, SortField.Type | |||
|
|
|||
| case STRING_VAL: | |||
| case STRING: | |||
There was a problem hiding this comment.
Oh boy. This is a funny case where STRING actually means "sort by a byte array. Oh yeah, and usually that's utf-8 encoded string, but not always". Yikes. Sad Nik.
There was a problem hiding this comment.
But I don't think there is much you can do about it because its a lucene concept.
There was a problem hiding this comment.
@romseygeek may be able to double check on this. I think it's right, but still sad.
There was a problem hiding this comment.
BytesRefs compare byte-by-byte, no need to worry about UTF-8 encoding or whatever.
There was a problem hiding this comment.
TSID is stored as a SortedSetDocValuesField, so what will actually get compared here is their ordinal number - SSDV is stored as a terms dictionary containing all the (sorted and deduplicated) values, and then a per-document ordinal. So to retrieve the value you get the ord for the current doc and then get the value from the terms dictionary, but to sort on it you can just compare their ordinal values - much faster than comparing byte by byte.
There was a problem hiding this comment.
what will actually get compared here is their ordinal number
Yeah. I should have said "sort as though by byte array" - it's using ordinals. The thing that bothers me is that it uses the word STRING here when it means BYTES. But I'm being grumbly.
_tsid field_tsid field
Currently, composite aggregation supports only string or numeric values for the fields in the
sourcespart of the composite aggregation. Although_tsidis encoded and stored as a byte array, it is formatted as map{"dim1": "value1", "dim2": value2", ...}for input/output.This PR adds support for composite aggregation on
_tsidfield.Relates to #74660