Simplify AnalyticsService code by rhymes · Pull Request #2308 · forem/forem

rhymes · 2019-04-04T17:20:41Z

What type of PR is this? (check all applicable)

Refactor
Feature
Bug Fix
Documentation Update

Description

This refactoring does mainly three things:

simplifies AnalyticsService interface
uses SQL queries instead of in memory calculations to retrieve counts
uses date objects for ranges

A few things I've left out that should be discussed/decided:

AnalyticsService should probably be renamed to Analytics::Pro as suggested by @lightalloy
date.strftime("%a, %m/%d") is locale dependent and potentially repeats itself (if start_date and end_date go past a year there could be dates with the same string version). Also, users outside the US could misinterpret dates as "03/04" (3rd of march or 4th of april?). I suggest using the ISO "2019-04-01" as key for the hash.
I think the date validation using regexps, now that dates are parsed, is unnecessary:

      def valid_date_params?
        date_regex = /\A\d{4}-\d{1,2}-\d{1,2}\Z/ # for example, 2019-03-22 or 2019-2-1
        if params[:end]
          (params[:start] =~ date_regex)&.zero? && (params[:end] =~ date_regex)&.zero?
        else
          (params[:start] =~ date_regex)&.zero?
        end
      end

Related Tickets & Documents

Closes #2307

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

Added to documentation?

docs.dev.to
readme
no documentation needed

rhymes · 2019-04-04T17:21:27Z

@Zhao-Andy I'm pinging you here

(let me know what you think 😀)

Zhao-Andy · 2019-04-04T17:36:54Z

Thanks for the refactor! One concern I had though with using SQL queries was that it would be a lot of queries if we're covering a timeframe of n days. For example, with 7 days, we would be calling 7 * 9 queries to get the data formatted properly, and that's not accounting for the beginning queries that happen at initialization. What are your thoughts on using SQL queries vs memory?

rhymes · 2019-04-04T17:50:19Z

@Zhao-Andy you're right, that could be a concern.

If the requester asks for a huge range of dates that could mean a lot of queries but it's still inside the range of request dates, so there's no risk of querying millions of records in theory. Also, we can add indexes in the future to speed up queries (for example all those tables don't have indexes on created_at which I'd consider adding in both cases)

One could benchmark the two versions in terms of speed and memory and see what happens but you'd need a realistic DB (and a bit of spare time) to do that and I can't with the development DB

There's a third option, which is to see if we can re-organize some of the queries to take advantage of group by and lower the amount of queries.

What do you think about the other points I've left out?

Zhao-Andy

Left some comments!

app/services/analytics_service.rb

Co-Authored-By: rhymes <rhymesete@gmail.com>

Zhao-Andy · 2019-04-04T18:17:39Z

If the requester asks for a huge range of dates that could mean a lot of queries but it's still inside the range of request dates, so there's no risk of querying millions of records in theory.

We might want to add a hard limit to how large the date range can be. Twitter's API only allows up to requesting 4 weeks at a time. A 30 day period would mean 270 queries, but I really don't know enough about performance to say whether or not this is better/worse than memory. :(

Also, we can add indexes in the future to speed up queries (for example all those tables don't have indexes on created_at which I'd consider adding in both cases)

I think adding indexes would make sense.

One could benchmark the two versions in terms of speed and memory and see what happens but you'd need a realistic DB (and a bit of spare time) to do that and I can't with the development DB

I might try this; I'll let you know how it goes if I do.

There's a third option, which is to see if we can re-organize some of the queries to take advantage of group by and lower the amount of queries.

I actually used group_by before in chart_decorator.rb:
https://github.com/thepracticaldev/dev.to/blob/fc39f24ee36a7bcc816e78f4ca69edf368c04183/app/decorators/chart_decorator.rb#L15-L17

The issue with it was that it was leaving out dates that were had no data. I guess we can solve that with setting a default value of sorts before grouping. Personally leaning toward an implementation that uses SQL + group_by, but again don't know enough to make a call (no pun intended).

aspittel · 2019-04-04T18:25:26Z

Quick idea: make a model/table to store the data by day/week and only calculate for new data since the old data will stay the same. We can use whenever jobs to create the "cached" data each day/week. Then it's only one query when the user retrieves the data and a more expensive one once a week or once a day.

rhymes · 2019-04-04T18:29:59Z

We might want to add a hard limit to how large the date range can be. Twitter's API only allows up to requesting 4 weeks at a time. A 30 day period would mean 270 queries, but I really don't know enough about performance to say whether or not this is better/worse than memory. :(

Oh, that's a good idea. Maybe in a separate PR though

I think adding indexes would make sense.

I'd add these in another PR as well, first I would merge this and see if we still have those timeouts you mentioned in #2307. One step at the time, performance improvement should always have some kind of benchmarking or self evident reason behind them eheh

I actually used group_by before in chart_decorator.rb:

This is group by for an enumerable, which requires objects to be all in memory though. The group by I was referring to was SQL/ActiveRecord's: https://guides.rubyonrails.org/active_record_querying.html#group

But once again, I would do this only if and when needed

The issue with it was that it was leaving out dates that were had no data. I guess we can solve that with setting a default value of sorts before grouping. Personally leaning toward an implementation that uses SQL + group_by, but again don't know enough to make a call (no pun intended).

ahhahaha great pun :D

rhymes · 2019-04-04T18:32:40Z

Quick idea: make a model/table to store the data by day/week and only calculate for new data since the old data will stay the same. We can use whenever jobs to create the "cached" data each day/week. Then it's only one query when the user retrieves the data and a more expensive one once a week or once a day.

Hi @aspittel I have a question: is it still slow with this PR? I can't see it from my end.

Because going from real time to batching requires a lot more code and coordination (as you said you need a new model and recurring jobs to update the table/view for each pro user and to calculate the "diff")

aspittel · 2019-04-04T18:44:16Z

Everything has been fast for me so far, my concern is that since we're using this for a data dashboard with presumably a lot of weeks on it, we may want to consider that down the road! Think this is great for now, but we may want to think about further optimizations in the future!

benhalpern · 2019-04-04T20:06:45Z

It's not crystal clear what the right answer is but I'm down to merge this.

I think @aspittel's serialization ideas make sense but as @rhymes noted, it's a lot of new code!

What if we make use of low-level/Russion-doll Rails caching wherever possible to solve these problems for now. And once the API stabilizes and our bigger concerns are perf/scaling, we can look to solve those problems then.

rhymes · 2019-04-04T20:34:59Z

@benhalpern

No, it's not clear what the right answer is, there's a little of trial and error to go through.

A summary of possible improvements for posterity:

adding a hard limit on the range of dates that can be asked by the client
adding indexes to created_at in each queried table for analytics
optimizing the queries using grouping inside the DB
using batching
using low level caching on results

I think that's all that has been proposed in this discussion.

Thanks for the merge!

lightalloy · 2019-04-05T06:23:12Z

I missed the whole discussion, but still want to note that it's a great refactoring, thanks, @rhymes
Maybe the results of the discussion should be moved to a separate issue to help further optimizations.

There's a gem that could possibly be used to help with more effective queries.

rhymes · 2019-04-05T09:14:10Z

@lightalloy I collected all possible improvements discussed and my thoughts here #2311 (it's a longish read :D)

rhymes added 7 commits April 4, 2019 17:43

Add basic tests for Analytics service

0f5248a

User/Organization are always required

1311863

Add some data to the test

be55d8c

Use queries and date times

1893380

Fix AnalyticsService interface in the controller

09ff4f2

s/HTTP_API_KEY/api-key

6c7da53

Defend against nils and support datetime objects

54d30f6

Zhao-Andy reviewed Apr 4, 2019

View reviewed changes

Zhao-Andy and others added 2 commits April 4, 2019 20:10

Update app/services/analytics_service.rb

afd9e4b

Co-Authored-By: rhymes <rhymesete@gmail.com>

Use SQL sum, not Rails sum

f15d26e

rhymes added 3 commits April 4, 2019 20:19

Group average_read_time

d68db9c

Use SQL, not Rails

86f6bbb

Use ISO date

be31ce5

Fix specs

cbd7b6e

rhymes changed the title ~~[WIP] Simplify AnalyticsService code~~ Simplify AnalyticsService code Apr 4, 2019

pr-triage bot added the PR: unreviewed bot applied label for PR's with no review label Apr 4, 2019

benhalpern approved these changes Apr 4, 2019

View reviewed changes

pr-triage bot added PR: reviewed-approved bot applied label for PR's where reviewer approves changes and removed PR: unreviewed bot applied label for PR's with no review labels Apr 4, 2019

benhalpern merged commit 5e9867e into forem:master Apr 4, 2019

pr-triage bot removed the PR: reviewed-approved bot applied label for PR's where reviewer approves changes label Apr 4, 2019

pr-triage bot added the PR: merged bot applied label for PR's that are merged label Apr 4, 2019

rhymes deleted the rhymes/simplify-analytics branch April 4, 2019 20:35

rhymes mentioned this pull request Apr 5, 2019

Analytics service: possible improvements and ideas for the future #2311

Closed

Conversation

rhymes commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

Added to documentation?

Uh oh!

rhymes commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zhao-Andy commented Apr 4, 2019

Uh oh!

rhymes commented Apr 4, 2019

Uh oh!

Zhao-Andy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Zhao-Andy commented Apr 4, 2019

Uh oh!

aspittel commented Apr 4, 2019

Uh oh!

rhymes commented Apr 4, 2019

Uh oh!

rhymes commented Apr 4, 2019

Uh oh!

aspittel commented Apr 4, 2019

Uh oh!

benhalpern commented Apr 4, 2019

Uh oh!

rhymes commented Apr 4, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lightalloy commented Apr 5, 2019

Uh oh!

rhymes commented Apr 5, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

rhymes commented Apr 4, 2019 •

edited

Loading

rhymes commented Apr 4, 2019 •

edited

Loading

rhymes commented Apr 4, 2019 •

edited

Loading