Skip to content

TSDB: Add support for composite aggregation on _tsid field#81998

Merged
csoulios merged 17 commits intoelastic:masterfrom
csoulios:tsdb-composite
Jan 19, 2022
Merged

TSDB: Add support for composite aggregation on _tsid field#81998
csoulios merged 17 commits intoelastic:masterfrom
csoulios:tsdb-composite

Conversation

@csoulios
Copy link
Copy Markdown
Contributor

@csoulios csoulios commented Dec 21, 2021

Currently, composite aggregation supports only string or numeric values for the fields in the sources part of the composite aggregation. Although _tsid is encoded and stored as a byte array, it is formatted as map {"dim1": "value1", "dim2": value2", ...} for input/output.

This PR adds support for composite aggregation on _tsid field.

Relates to #74660

@elasticmachine elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Dec 21, 2021
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

1 similar comment
@elasticmachine
Copy link
Copy Markdown
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

Comment on lines +241 to +244
} else if (obj instanceof Map<?, ?> && configs[i].fieldType().getClass() == TimeSeriesIdFieldType.class) {
// If input is a _tsid map, encode the map to the _tsid BytesRef
// TODO: Must add validations
values[i] = configs[i].format().parseBytesRef(obj);
Copy link
Copy Markdown
Contributor Author

@csoulios csoulios Dec 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the after part of the query, the _tsid field is passed as a Map.

Here we encode the Map to its BytesRef representation. From this point on, we treat the tsid after key as a binary value

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option could be to encode the json map in a string ? These values are not formatted for visualizations, they are part of the after key so the content an be obfuscated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could base64 encode the tsid and slap it in the string. That'd obfuscate it plenty. I didn't like that idea before Christmas but I can't really remember why now....

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For things like search_after I think it's nice to have a readable after key. That we're formatting dates in composite makes me think we want to format tsid - and the map is the formatted version. But if we just need some key - the base64 encoded version is easier.

@csoulios csoulios requested a review from jimczi December 22, 2021 12:59
Copy link
Copy Markdown
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach makes sense, another option would be to consider the _tsid as an encoded string. The benefit of the navigable map is to expose all dimensions individually so I understand why it is preferred. Although, it should be the responsibility of the _tsid mapper to validate that the map is parseable.

- skip:
version: " - 8.0.99"
reason: introduced in 8.1.0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it's a new field type, I'd recommend to also add unit tests in CompositeAggregatorTests for coverage.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@csoulios csoulios changed the title WIP: Add support for composite aggregation on _tsid field [TSDB] Add support for composite aggregation on _tsid field Dec 22, 2021
- skip:
version: " - 8.0.99"
reason: introduced in 8.1.0

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

SortedMap<String, BytesReference> dimensionFields = context.doc().getDimensionBytes();
if (dimensionFields.isEmpty()) {
BytesReference timeSeriesId = buildTsidField(dimensionFields);
assert timeSeriesId != null : "In time series mode _tsid cannot be null";
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the method can really return null anyway. Maybe not worth checking.

Comment on lines +241 to +244
} else if (obj instanceof Map<?, ?> && configs[i].fieldType().getClass() == TimeSeriesIdFieldType.class) {
// If input is a _tsid map, encode the map to the _tsid BytesRef
// TODO: Must add validations
values[i] = configs[i].format().parseBytesRef(obj);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could base64 encode the tsid and slap it in the string. That'd obfuscate it plenty. I didn't like that idea before Christmas but I can't really remember why now....

Comment on lines +241 to +244
} else if (obj instanceof Map<?, ?> && configs[i].fieldType().getClass() == TimeSeriesIdFieldType.class) {
// If input is a _tsid map, encode the map to the _tsid BytesRef
// TODO: Must add validations
values[i] = configs[i].format().parseBytesRef(obj);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For things like search_after I think it's nice to have a readable after key. That we're formatting dates in composite makes me think we want to format tsid - and the map is the formatted version. But if we just need some key - the base64 encoded version is easier.

@csoulios csoulios requested a review from nik9000 January 5, 2022 14:44
Copy link
Copy Markdown
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I pinged @romseygeek for a second opinion And! Can you look at search_after? It's quite related to composite and it probably should also "just work" after this PR. Or, at least, it should be close.


@Override
public BytesRef parseBytesRef(Object value) {
if (value instanceof Map<?, ?> m) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an early return would be more readable here, even if you have to sacrifice the pattern matching. I don't know enough about Java's pattern matching to know if you have to though.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an early return and a cast to Map to make it more readable. Dropped the pattern matching

@@ -199,7 +200,12 @@ private static Object convertValueFromSortType(String fieldName, SortField.Type

case STRING_VAL:
case STRING:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy. This is a funny case where STRING actually means "sort by a byte array. Oh yeah, and usually that's utf-8 encoded string, but not always". Yikes. Sad Nik.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I don't think there is much you can do about it because its a lucene concept.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romseygeek may be able to double check on this. I think it's right, but still sad.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BytesRefs compare byte-by-byte, no need to worry about UTF-8 encoding or whatever.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TSID is stored as a SortedSetDocValuesField, so what will actually get compared here is their ordinal number - SSDV is stored as a terms dictionary containing all the (sorted and deduplicated) values, and then a per-document ordinal. So to retrieve the value you get the ord for the current doc and then get the value from the terms dictionary, but to sort on it you can just compare their ordinal values - much faster than comparing byte by byte.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will actually get compared here is their ordinal number

Yeah. I should have said "sort as though by byte array" - it's using ordinals. The thing that bothers me is that it uses the word STRING here when it means BYTES. But I'm being grumbly.

@csoulios csoulios changed the title [TSDB] Add support for composite aggregation on _tsid field TSDB: Add support for composite aggregation on _tsid field Jan 19, 2022
@csoulios csoulios merged commit 8ae9781 into elastic:master Jan 19, 2022
@csoulios csoulios deleted the tsdb-composite branch May 3, 2022 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>non-issue :StorageEngine/TSDB You know, for Metrics Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.1.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants