feat(discover2) Add more aggregate field support#14321
Conversation
This is an alternate approach to modifying the discover endpoints to fit the requirements of discover2/events search. By extending the existing events endpoint with rollup and additional functions/aliases we can get what we need. This approach will require some shim code in the UI to convert saved discover queries into a compatible field list and query string, but that work now seems simpler than building out the remaining ancillary endpoints in discover. Refs SEN-811
| raise OrganizationEventsError(exc.message) | ||
|
|
||
| # Filter out special aggregates. | ||
| self._filter_unspecified_special_fields_in_conditions(snuba_args, set()) |
There was a problem hiding this comment.
This was removed as we won't be supporting filtering on aggregate columns in the short term, and with fully dynamic aggregate columns we would need expand the query parser to accomodate aggregate filtering.
There was a problem hiding this comment.
Can we go ahead and remove the function completely? sorry, just saw you did this
| if any(field for field in groupby if field not in ALLOWED_GROUPINGS): | ||
| message = ('Invalid groupby value requested. Allowed values are ' + | ||
| ', '.join(ALLOWED_GROUPINGS)) | ||
| return Response({'detail': message}, status=400) |
There was a problem hiding this comment.
This was removed because the endpoint no longer allows the end user to define groupby. Instead groupings are implicit based on the fields requested.
| 'device.simulator', 'error.handled', 'issue.id', 'stack.colno', | ||
| 'stack.in_app', 'stack.lineno', 'stack.stack_level', | ||
| # OrganizationEvents aggregations | ||
| 'event_count', 'user_count', |
There was a problem hiding this comment.
Removing these as they aren't a thing anymore.
src/sentry/api/event_search.py
Outdated
| # OrganizationEvents aggregations | ||
| 'event_count', 'user_count', | ||
|
|
||
| 'duration', |
There was a problem hiding this comment.
Is this going to break regular event/issue search if someone is using these field names?
There was a problem hiding this comment.
The removed fields were only really available on events-v2, which no customers have access to.
There was a problem hiding this comment.
This code is not just used in events-v2 though, it's used for event search generally. Try typing duration:something into issue or event search and you will see what I mean
There was a problem hiding this comment.
I doubt anyone knows about these fields since they were never documented
There was a problem hiding this comment.
Unfortunately people do use duration as a tag though. And 500ing on a query with this term in it doesn't seem great
There was a problem hiding this comment.
I can remove duration for now. Given that some customers are using this as a tag would we want to expose the real duration as event.duration?
| u'{}_{}'.format(match.group('function'), match.group('column')) | ||
| ]) | ||
|
|
||
| rollup = snuba_args.get('rollup') |
There was a problem hiding this comment.
Rollup gets used by timeseries endpoints, which I'm planning to update soon.
There was a problem hiding this comment.
Is rollup synonymous with granularity in snuba?
src/sentry/api/event_search.py
Outdated
| # Ensure fields we require to build a functioning interface | ||
| # are present. We don't add fields when using a rollup as the additional fields | ||
| # would be aggregated away. | ||
| if 'project.id' not in columns: |
There was a problem hiding this comment.
Similar to id on the line below, won't the aggregations check be needed?
There was a problem hiding this comment.
We always need a project id in order to generate a link to the event in the modal. This complicates cross project aggregation. Another possibility is to use argMax(project_id, timestamp) to get the project for the latest event.
There was a problem hiding this comment.
What if the logic in this block that is related to building a functioning interface and modal links lived in the user interface instead instead of here? I think the API could be solely responsible for making sure queries are valid and then running them, not modifying queries so that the UI gets the field it needs. Otherwise this behavior would be super confusing if someone is using the API directly and getting back random fields that they didn't request just because our UI needs it.
There was a problem hiding this comment.
My concern with that is we'll be storing field lists in the URL and in the database. Storing all the decomposed fields makes URLs more fragile as removing an expanded field could cause incorrect rendering. It also makes future changes harder. If we want to modify how a field is expanded we can't easily do that because the expansion is saved incorrectly in the database.
| sortField: 'event_count', | ||
| // TODO generalize this. | ||
| 'count(id)': { | ||
| sortField: 'count_id', |
There was a problem hiding this comment.
Can you only sort by one column?
There was a problem hiding this comment.
The links only add/change a single sort condition.
There was a problem hiding this comment.
Ok. For events we use -timestamp, -event_id or timestamp, event_id so we have a stable order, which we don't have here. Not sure if it matters though.
| columns.append(field) | ||
| continue | ||
|
|
||
| validate_aggregate(field, match) |
There was a problem hiding this comment.
It could be nice if all the validations were done in one place up front, but i'm guessing this is not as easy?
There was a problem hiding this comment.
This function (resolve_field_list) is called pretty early in the organization_events endpoint processing. I would like to use a DRF serializer for this work but other request validation steps get_filter_params() are embedded in endpoint logic.
src/sentry/api/event_search.py
Outdated
| if aggregations and 'latest_event' not in fields: | ||
| columns.extend(deepcopy(FIELD_ALIASES['latest_event']['fields'])) | ||
|
|
||
| basic_fields = [col for col in columns if isinstance(col, six.string_types)] |
There was a problem hiding this comment.
When is a column name not a string? Wouldn't that be a validation error if that ever happened?
There was a problem hiding this comment.
The column name can be a non-string value when a field alias is used. For example the latest_event field gets expanded into a function using argMax.
There was a problem hiding this comment.
I've added validation to disallow end users from providing non-string field names.
There was a problem hiding this comment.
Gotcha. IMO it's easier to read if you put the special expanded fields (which are actually aggregations) in a separate list since they have a different type and need to be handled differently
There was a problem hiding this comment.
All the field aliases that expand into aggregates now update aggregations. I've had to hack in a string function call for latest_event as snuba doesn't support multiple columns in aggregate functions just yet. I've opened a pull to fix that though.
src/sentry/api/event_search.py
Outdated
|
|
||
| basic_fields = [col for col in columns if isinstance(col, six.string_types)] | ||
| if rollup and basic_fields: | ||
| raise InvalidSearchQuery('You cannot combine rollup with non-aggregate fields') |
There was a problem hiding this comment.
Isn't that ok since they get added to groupby clause?
There was a problem hiding this comment.
Maybe, my understanding of rollup is that it cuts the data up into timeslices. Getting a non-aggregate column in that kind of result set is not helpful as it breaks the rollup slices.
There was a problem hiding this comment.
Why won't it be useful? I could see something like Errors grouped by version rolled up by week being pretty useful.
There was a problem hiding this comment.
Fair point. I'll take this out.
There was a problem hiding this comment.
On second thought, this scenario is an error if there are no aggregates being included. However, basic fields + aggregates should be valid.
| VALID_AGGREGATES = { | ||
| 'count_unique': { | ||
| 'snuba_name': 'uniq', | ||
| 'fields': '*', |
There was a problem hiding this comment.
I can imagine also wanting to do all fields excluding some small list
There was a problem hiding this comment.
@lynnagara im wondering, is there an aggregate function where we would want to exclude certain columns?
There was a problem hiding this comment.
I personally would prefer to use safe-lists rather than ban-lists for the acceptable aggregate columns.
There was a problem hiding this comment.
The situation where I was thinking we might need a ban list is where we want to support the aggregate on custom user-defined tag columns (can't use a safe list) but might need to exclude some internal stuff
Generally not a useful scenario
Don't inject a project.id field when aggregating across non-project fields. Instead use argMax() to get a project.name from the latest event.
| }, | ||
| # This doesn't work yet, but is an illustration of how it could work | ||
| 'p75': { | ||
| 'snuba_name': 'quantileTiming(0.75)', |
There was a problem hiding this comment.
I think we wanted to stop passing function calls like this explicitly, and try something like
[quantileTiming, [0.75]]. Not sure if this is something that would be directly passed to snuba though?
There was a problem hiding this comment.
I'm fine with an array form as well. The quantile functions are a bit funny as they need to end up in clickhouse like quantileTiming(0.75)(duration) I'm not sure we've done that kind of function call in snuba before.
There was a problem hiding this comment.
We might need snuba to translate that for us, not sure. I feel like we had something similar to this before, can't remember exactly what it was though. Could potentially do [quantileTiming, [0.75, 'duration']] and have snuba sort it out. I think other patterns we've done were to extract information from the name, but that's probably not a great practice (something like [quantileTiming_75, ['duration']])
There was a problem hiding this comment.
Yeah the implementation of the quantile functions is still imcomplete, but we can figure that out once storage for transactions is sorted.
src/sentry/api/event_search.py
Outdated
| # OrganizationEvents aggregations | ||
| 'event_count', 'user_count', | ||
|
|
||
| 'duration', |
There was a problem hiding this comment.
I doubt anyone knows about these fields since they were never documented
dashed
left a comment
There was a problem hiding this comment.
Looks good to me.
I'll defer to @wedamija and @lynnagara for approval.
We need to use string function names in the short term as snuba's json schema validation prevents aggregates on more than one column. I've also changed the logic to allow rollup with basic fields as long as at least one aggregate is also specified.
We have a use-case in sentry where we want to use `argMax()` which requires multiple columns to work. While this function is allowed in `selected_columns` it is actually an aggregate value and should probably be in `aggregations`. This expands the schema and query builder to allow lists of columns in aggregations functions. Refs getsentry/sentry#14321
We have a use-case in sentry where we want to use `argMax()` which requires multiple columns to work. While this function is allowed in `selected_columns` it is actually an aggregate value and should probably be in `aggregations`. This expands the schema and query builder to allow lists of columns in aggregations functions. Refs getsentry/sentry#14321
Duration is currently in use as a tag by a few customers.
wedamija
left a comment
There was a problem hiding this comment.
lgtm, probably worth lyn having final signoff
We have a use-case in sentry where we want to use `argMax()` which requires multiple columns to work. While this function is allowed in `selected_columns` it is actually an aggregate value and should probably be in `aggregations`. This expands the schema and query builder to allow lists of columns in aggregations functions. Refs getsentry/sentry#14321
We have a use-case in sentry where we want to use `argMax()` which requires multiple columns to work. While this function is allowed in `selected_columns` it is actually an aggregate value and should probably be in `aggregations`. This expands the schema and query builder to allow lists of columns in aggregations functions. Refs getsentry/sentry#14321
This is an alternate approach to modifying the discover endpoints to fit the requirements of discover2/events search. By extending the existing events endpoint with rollup and additional functions/aliases we can get what we need.
The rollup parameter isn't used right now but will be used when we update events-stats which will be one of the next tasks.
This approach will require some shim code in the UI to convert saved discover queries into a compatible field list and query string, but that work now seems simpler than building out the remaining ancillary endpoints in discover.
Refs SEN-810