-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Analytics service: possible improvements and ideas for the future #2311
Description
Is your feature request related to a problem? Please describe.
This has been opened to collect all possible improvements, in case of need, for the analytics API after questions about performance have been raised in #2307 prior to the refactoring in #2308 (AFAIK yet to be measured/verified for the use case that lead to #2307).
Describe the solution you'd like
There are possible incremental solutions (not all alternatives to each other, some can be applied incrementally). Some suggested solutions require a trivial amount of effort, others require more work.
Theoretically all should be measured against the status quo to understand if needed or not.
Possible solutions
Hard limiting the date range of analytics that can be requested
The proposed solution would forcibly limit how wide the range of dates for the analytics can a client request. As mentioned by @Zhao-Andy some public API's like Twitter limit requests up to 30 days.
Pros
- setting a cap on the amount of data a client can request is a good idea for obvious reasons
- most analytics clients will use rolling windows on dates anyway or are interested in a specific period
Cons
- would probably require special care if any client is to be treated differently (for commercial partnership where the client asks to access more data at same time)
- it could lead to more HTTP requests for clients interested in a wider range and hence more queries if the client is after data for more than 30 days
My recommendation
I think it's a great idea and should definitely be considered, though applied as a result of a clear data policy (maybe even explicitly explained in the documentation) not just for performance reasons (those if encountered should be solved at the best of our abilities, not just "hidden" by limiting the date range)
Adding indexes to columns in where conditions
Most queries in the AnalyticsService work on created_at columns of the tables reactions, comments, page_views. The proposed solution is to add indexes on those columns to improve speed of retrieval for large data sets.
Pros
- a request that asks for a large number of articles over a large interval of date should be aided by indexes
Cons
- indexes can slow down writes for tables with a high frequency of writes
- as it currently stands the code splits the queries over each day, the difference in most cases might not be noticeable
My recommendation
I think this is a quick win and should seriously be considered as a companion to any other solution. Where conditions on dates are a classic place where an index can benefit. Ideally one show open the SQL console and measure the before and after of the query with explain analyze but it's not mandatory. I also don't believe, AFAIK, there's going to be any impact on writes because what I perceive to be the biggest table among the three, reactions, doesn't get millions of INSERTs at a time. Anyhow that can be monitored with the slow query tab on Heroku in case the problem presents itself in the future.
Use grouping
The proposed solution is to let the DB group data and return aggregates instead of computing aggregates with following queries per each day in the range of dates.
Pros
- the database is generally really good at mixing grouping and where conditions (provided there are indexes on the needed columns)
- it would lower the amount of queries
Cons
- it requires testing. Ideally one should write a test that populates the DB with a data sample over 2 or 3 days, write an assertion/expectation about the result, then go about improving the internals of the service to make sure the expected data matches the new structure of querying (usual testing, nothing to see here hehe)
- it requires some trial and error and knowledge of PostgreSQL group by clause (the link points to a tutorial)
My recommendation
I'm obviously biased in favor of this (let the DB do its job as a mantra) if the expected result allow for it (I haven't looked deeply into it so it might not be feasible). Rails natively supports GROUP BY and HAVING (both clauses needed to grouping and filtering on groups) through ActiveRecord methods group and having. @lightalloy has mentioned the existence of the gem groupdate geared specifically at grouping by dates using an agnostic interface that works on all supported DBs, that could come in handy for this and future date based aggregations (though the syntax in vanilla Rails is pretty straightforward:
[7] pry(main)> PageView.where(created_at: 2.days.ago..Time.current).group("date(created_at)").sum(:counts_for_number_of_views)
(0.8ms) SELECT SUM("page_views"."counts_for_number_of_views") AS sum_counts_for_number_of_views, date(created_at) AS date_created_at FROM "page_views" WHERE "page_views"."created_at" BETWEEN $1 AND $2 GROUP BY date(created_at) [["created_at", "2019-04-03 08:47:39.869415"], ["created_at", "2019-04-05 08:47:39.869509"]] [sql_query]
=> {Thu, 04 Apr 2019=>7}you can verify how the query works correctly by a quick look at the data:
[8] pry(main)> PageView.all.pluck(:counts_for_number_of_views, :created_at)
(0.5ms) SELECT "page_views"."counts_for_number_of_views", "page_views"."created_at" FROM "page_views" [sql_query]
=> [[1, Thu, 04 Apr 2019 21:28:07 UTC +00:00],
[1, Thu, 04 Apr 2019 21:28:56 UTC +00:00],
[1, Thu, 04 Apr 2019 21:39:25 UTC +00:00],
[1, Thu, 04 Apr 2019 22:39:33 UTC +00:00],
[1, Thu, 04 Apr 2019 22:41:03 UTC +00:00],
[1, Thu, 04 Apr 2019 22:43:49 UTC +00:00],
[1, Thu, 04 Apr 2019 22:45:07 UTC +00:00]]Pre-compute data using batching and compute the difference in real time
The proposed solution is to add an analytics table that contain aggregated data populated by a scheduled recurring job
Pros
- this has the obvious pro of having near instantaneous read time. The client asks for a date range, a
SELECTis issued to the aggregated table and the data is returned - it can lead to further optimizations down the line (if combined to caching or other changing business requirements I can't foresee right now)
Cons
- it requires a bit of effort upfront
- it requires a merge function (if the aggregated data is less than the required data) that still has to go to the live tables
- it may require an eviction policy (how long are aggregates kept in storage?)
- what happens to the old rows of aggregated data if there's a bug in the analytics function or if new metrics are added?
My recommendation
I'd consider this when the other options have been exhausted. If there's a massive amount of data continuously requested by HTTP clients and indexes and grouping are failing you, then a switch in the approach is a way to go. I'm always a big reluctant into going "batching first" because most of the times there are avenues to be exhausted before. Batching also requires some logic on merging, eviction and coping with stale data (or having to recompute older rows in the background)
Low level caching on results
The proposed solution is to guard one or each metric with a cache, so that an individual client requiring a specific range twice will have its data returned from memory the second time
Pros
- as with batching the data can be returned quickly once the cache is primed
- by knowing in advance one could pre-compute cached values for specific users/organizations that have particularly massive computations ahead of them
- it shouldn't have a big impact on the memory occupied by the caching server
Cons
- it's unlikely that an organization is going to ask for the same date range multiple times often, hence caching might defeat the purpose (organization usually write scheduled scripts that retrieve analytics and update internal dashboards :D)
- caching individual dates would require too many trips to the caching server, and a SQL query might be faster in that case
- as with batching (which is basically a cache written to disk), caches need to be handled in the sense that if tomorrow a new metric is added, there has to be a mechanism to re-populate older caches or at least empty all existing ones, this requires a bit of analysis
My recommendation
I'd go down this route after exhausting all "live optimizations" though it can be extremely beneficial for bigger organizations or extremely popular users or organization/users whose analytics are asked in a continuos manner though usually aggregated analytics are asked for
once a day by each client so it's easy to plan the impact on the systems.