Skip to content

Simplify AnalyticsService code#2308

Merged
benhalpern merged 13 commits intoforem:masterfrom
rhymes:rhymes/simplify-analytics
Apr 4, 2019
Merged

Simplify AnalyticsService code#2308
benhalpern merged 13 commits intoforem:masterfrom
rhymes:rhymes/simplify-analytics

Conversation

@rhymes
Copy link
Copy Markdown
Contributor

@rhymes rhymes commented Apr 4, 2019

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Documentation Update

Description

This refactoring does mainly three things:

  • simplifies AnalyticsService interface
  • uses SQL queries instead of in memory calculations to retrieve counts
  • uses date objects for ranges

A few things I've left out that should be discussed/decided:

  • AnalyticsService should probably be renamed to Analytics::Pro as suggested by @lightalloy
  • date.strftime("%a, %m/%d") is locale dependent and potentially repeats itself (if start_date and end_date go past a year there could be dates with the same string version). Also, users outside the US could misinterpret dates as "03/04" (3rd of march or 4th of april?). I suggest using the ISO "2019-04-01" as key for the hash.
  • I think the date validation using regexps, now that dates are parsed, is unnecessary:
      def valid_date_params?
        date_regex = /\A\d{4}-\d{1,2}-\d{1,2}\Z/ # for example, 2019-03-22 or 2019-2-1
        if params[:end]
          (params[:start] =~ date_regex)&.zero? && (params[:end] =~ date_regex)&.zero?
        else
          (params[:start] =~ date_regex)&.zero?
        end
      end

Related Tickets & Documents

Closes #2307

Mobile & Desktop Screenshots/Recordings (if there are UI changes)

Added to documentation?

  • docs.dev.to
  • readme
  • no documentation needed

@rhymes
Copy link
Copy Markdown
Contributor Author

rhymes commented Apr 4, 2019

@Zhao-Andy I'm pinging you here

(let me know what you think 😀)

@Zhao-Andy
Copy link
Copy Markdown
Contributor

Thanks for the refactor! One concern I had though with using SQL queries was that it would be a lot of queries if we're covering a timeframe of n days. For example, with 7 days, we would be calling 7 * 9 queries to get the data formatted properly, and that's not accounting for the beginning queries that happen at initialization. What are your thoughts on using SQL queries vs memory?

@rhymes
Copy link
Copy Markdown
Contributor Author

rhymes commented Apr 4, 2019

@Zhao-Andy you're right, that could be a concern.

If the requester asks for a huge range of dates that could mean a lot of queries but it's still inside the range of request dates, so there's no risk of querying millions of records in theory. Also, we can add indexes in the future to speed up queries (for example all those tables don't have indexes on created_at which I'd consider adding in both cases)

One could benchmark the two versions in terms of speed and memory and see what happens but you'd need a realistic DB (and a bit of spare time) to do that and I can't with the development DB

There's a third option, which is to see if we can re-organize some of the queries to take advantage of group by and lower the amount of queries.

What do you think about the other points I've left out?

Copy link
Copy Markdown
Contributor

@Zhao-Andy Zhao-Andy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments!

Zhao-Andy and others added 2 commits April 4, 2019 20:10
@Zhao-Andy
Copy link
Copy Markdown
Contributor

If the requester asks for a huge range of dates that could mean a lot of queries but it's still inside the range of request dates, so there's no risk of querying millions of records in theory.

We might want to add a hard limit to how large the date range can be. Twitter's API only allows up to requesting 4 weeks at a time. A 30 day period would mean 270 queries, but I really don't know enough about performance to say whether or not this is better/worse than memory. :(

Also, we can add indexes in the future to speed up queries (for example all those tables don't have indexes on created_at which I'd consider adding in both cases)

I think adding indexes would make sense.

One could benchmark the two versions in terms of speed and memory and see what happens but you'd need a realistic DB (and a bit of spare time) to do that and I can't with the development DB

I might try this; I'll let you know how it goes if I do.

There's a third option, which is to see if we can re-organize some of the queries to take advantage of group by and lower the amount of queries.

I actually used group_by before in chart_decorator.rb:
https://github.com/thepracticaldev/dev.to/blob/fc39f24ee36a7bcc816e78f4ca69edf368c04183/app/decorators/chart_decorator.rb#L15-L17

The issue with it was that it was leaving out dates that were had no data. I guess we can solve that with setting a default value of sorts before grouping. Personally leaning toward an implementation that uses SQL + group_by, but again don't know enough to make a call (no pun intended).

@aspittel
Copy link
Copy Markdown
Contributor

aspittel commented Apr 4, 2019

Quick idea: make a model/table to store the data by day/week and only calculate for new data since the old data will stay the same. We can use whenever jobs to create the "cached" data each day/week. Then it's only one query when the user retrieves the data and a more expensive one once a week or once a day.

@rhymes
Copy link
Copy Markdown
Contributor Author

rhymes commented Apr 4, 2019

We might want to add a hard limit to how large the date range can be. Twitter's API only allows up to requesting 4 weeks at a time. A 30 day period would mean 270 queries, but I really don't know enough about performance to say whether or not this is better/worse than memory. :(

Oh, that's a good idea. Maybe in a separate PR though

I think adding indexes would make sense.

I'd add these in another PR as well, first I would merge this and see if we still have those timeouts you mentioned in #2307. One step at the time, performance improvement should always have some kind of benchmarking or self evident reason behind them eheh

I actually used group_by before in chart_decorator.rb:

This is group by for an enumerable, which requires objects to be all in memory though. The group by I was referring to was SQL/ActiveRecord's: https://guides.rubyonrails.org/active_record_querying.html#group

But once again, I would do this only if and when needed

The issue with it was that it was leaving out dates that were had no data. I guess we can solve that with setting a default value of sorts before grouping. Personally leaning toward an implementation that uses SQL + group_by, but again don't know enough to make a call (no pun intended).

ahhahaha great pun :D

@rhymes
Copy link
Copy Markdown
Contributor Author

rhymes commented Apr 4, 2019

Quick idea: make a model/table to store the data by day/week and only calculate for new data since the old data will stay the same. We can use whenever jobs to create the "cached" data each day/week. Then it's only one query when the user retrieves the data and a more expensive one once a week or once a day.

Hi @aspittel I have a question: is it still slow with this PR? I can't see it from my end.

Because going from real time to batching requires a lot more code and coordination (as you said you need a new model and recurring jobs to update the table/view for each pro user and to calculate the "diff")

@aspittel
Copy link
Copy Markdown
Contributor

aspittel commented Apr 4, 2019

Everything has been fast for me so far, my concern is that since we're using this for a data dashboard with presumably a lot of weeks on it, we may want to consider that down the road! Think this is great for now, but we may want to think about further optimizations in the future!

@rhymes rhymes changed the title [WIP] Simplify AnalyticsService code Simplify AnalyticsService code Apr 4, 2019
@pr-triage pr-triage bot added the PR: unreviewed bot applied label for PR's with no review label Apr 4, 2019
@benhalpern
Copy link
Copy Markdown
Contributor

It's not crystal clear what the right answer is but I'm down to merge this.

I think @aspittel's serialization ideas make sense but as @rhymes noted, it's a lot of new code!

What if we make use of low-level/Russion-doll Rails caching wherever possible to solve these problems for now. And once the API stabilizes and our bigger concerns are perf/scaling, we can look to solve those problems then.

@pr-triage pr-triage bot added PR: reviewed-approved bot applied label for PR's where reviewer approves changes and removed PR: unreviewed bot applied label for PR's with no review labels Apr 4, 2019
@benhalpern benhalpern merged commit 5e9867e into forem:master Apr 4, 2019
@pr-triage pr-triage bot removed the PR: reviewed-approved bot applied label for PR's where reviewer approves changes label Apr 4, 2019
@pr-triage pr-triage bot added the PR: merged bot applied label for PR's that are merged label Apr 4, 2019
@rhymes
Copy link
Copy Markdown
Contributor Author

rhymes commented Apr 4, 2019

@benhalpern

No, it's not clear what the right answer is, there's a little of trial and error to go through.

A summary of possible improvements for posterity:

  • adding a hard limit on the range of dates that can be asked by the client
  • adding indexes to created_at in each queried table for analytics
  • optimizing the queries using grouping inside the DB
  • using batching
  • using low level caching on results

I think that's all that has been proposed in this discussion.

Thanks for the merge!

@rhymes rhymes deleted the rhymes/simplify-analytics branch April 4, 2019 20:35
@lightalloy
Copy link
Copy Markdown
Contributor

I missed the whole discussion, but still want to note that it's a great refactoring, thanks, @rhymes
Maybe the results of the discussion should be moved to a separate issue to help further optimizations.

There's a gem that could possibly be used to help with more effective queries.

@rhymes
Copy link
Copy Markdown
Contributor Author

rhymes commented Apr 5, 2019

@lightalloy I collected all possible improvements discussed and my thoughts here #2311 (it's a longish read :D)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

PR: merged bot applied label for PR's that are merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants