-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
feat(occurrences on eap): Implement double reads of bucketed group counts #105691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…nto double-read-query-groups-past-counts-eap
|
bugbot run |
wedamija
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it matter much that we're adding a couple more queries here and slowing down processing? If so, it could be worth delegating the extra queries to a separate thread
| results = EAPOccurrencesComparator.check_and_choose( | ||
| snuba_results, | ||
| eap_results, | ||
| "issues.escalating.query_groups_past_counts", | ||
| is_experimental_data_a_null_result=len(eap_results) == 0, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be helpful to include a reasonable match comparator or something along those lines. I think that we're likely to have small differences between the two systems just due to the fact that we don't ingest data at the same rate. It'd be good to have some kind of similarity % to compare even better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't include a reasonable match comparator in this case (since it's a little more involved), but we have that on some of the other read paths we've migrated so far. Right now we're only issuing reads in S4S (where we've been writing to EAP for almost 3 months at this point, so we should be almost at the point where the two data stores are in parity for groups in retention).
Re increased latency by adding more queries: yep this is on my todo list. We've only migrated a handful of read paths over to double reads so far, but as we do more we'll just generally be adding latency across the app which is not ideal. I'm thinking that the best solution here is to switch to a sampling approach (where we double read only a portion of the time) rather than double reading on every query, so that we'll still get representative metrics for comparison but reduce the latency impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the sampling approach sounds decent. It's also not too hard to delegate to threads if you want to go down that path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of data accuracy - historically it'll probably match up, but recently ingested data can be out of sync since there's no guarantee that writes to EAP are synced up with writes to snuba. It'll be a small discrepancy, but might be worth taking into account if you're seeing some queries not match up. Something you can handle later on, anyway
thetruecpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM, one Q
| all_results += _query_groups_eap_by_org(error_groups) | ||
| all_results += _query_groups_eap_by_org(other_groups) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q here: why not make a single call with error_groups + other_groups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was only to keep parity with the Snuba implementation, since that query orders the results such that error groups come first and then issue platform groups. Of course we don't have the same distinction in EAP, so we could definitely combine the queries, but then there's no guarantee that we get results in the same order as Snuba and therefore have exact_match on the comparator be informative.
Maybe we could keep this double query for the rollout to make sure we maintain parity and I can make a ticket to eventually condense this into one query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that sounds good! Definitely not a blocker.
Adds EAP double-reading support for the
query_groups_past_countsfunction used in escalating issue detection.Changes:
_query_groups_past_counts_eapthat queries hourly event counts grouped byproject_idandgroup_idquery_groups_past_countsto use the double-read pattern withEAPOccurrencesComparatorRan some test queries to compare results on local devserver: