Skip to content

Tracking is done on the main thread#11293

Merged
kidinov merged 7 commits intotrunkfrom
11292-tracking-is-done-on-the-main-thread
Apr 24, 2024
Merged

Tracking is done on the main thread#11293
kidinov merged 7 commits intotrunkfrom
11292-tracking-is-done-on-the-main-thread

Conversation

@kidinov
Copy link
Copy Markdown
Contributor

@kidinov kidinov commented Apr 11, 2024

Closes: #11292

Description

While working on another issue, I noticed that in many cases, we call the "track" method on the main thread. Inside this, we hit the device's IO. This can make the app less responsive and jerky. The PR wraps this code in a courtines with appCoroutineScope scope. As this potentially dangerous change, @hichamboushaba @JorgeMucientes, would you mind thinking of potential issues with this approach?

Testing instructions

  • Use the app
  • Notice that tracking of the events works (live view can be used)
  • I have considered if this change warrants release notes and have added them to RELEASE-NOTES.txt if necessary. Use the "[Internal]" label for non-user-facing changes.

@dangermattic
Copy link
Copy Markdown
Collaborator

dangermattic commented Apr 11, 2024

1 Warning
⚠️ This PR is assigned to the milestone 18.4. This milestone is due in less than 2 days.
Please make sure to get it merged by then or assign it to a milestone with a later deadline.
1 Message
📖

This PR contains changes to Tracks-related logic. Please ensure (author and reviewer) the following are completed:

  • The tracks events must be validated in the Tracks system.
  • Verify the internal Tracks spreadsheet has also been updated.
  • Please consider registering any new events.
  • The PR must be assigned the category: tracks label.

Generated by 🚫 Danger

@kidinov kidinov added the category: tracks Related to analytics, including Tracks Events. label Apr 11, 2024
@kidinov kidinov added this to the 18.2 milestone Apr 11, 2024
@wpmobilebot
Copy link
Copy Markdown
Collaborator

wpmobilebot commented Apr 11, 2024

📲 You can test the changes from this Pull Request in WooCommerce Android by scanning the QR code below to install the corresponding build.

App Name WooCommerce Android
FlavorJalapeno
Build TypeDebug
Commit84d4aea
Direct Downloadwoocommerce-prototype-build-pr11293-84d4aea.apk

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 32 lines in your changes are missing coverage. Please review.

Project coverage is 41.39%. Comparing base (94c3c42) to head (ca058e9).
Report is 61 commits behind head on trunk.

❗ Current head ca058e9 differs from pull request most recent head 84d4aea. Consider uploading reports for the commit 84d4aea to get more accurate results

Files Patch % Lines
.../woocommerce/android/analytics/AnalyticsTracker.kt 0.00% 26 Missing ⚠️
...n/kotlin/com/woocommerce/android/AppInitializer.kt 0.00% 6 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##              trunk   #11293      +/-   ##
============================================
- Coverage     41.40%   41.39%   -0.02%     
  Complexity     5217     5217              
============================================
  Files          1057     1057              
  Lines         61889    61911      +22     
  Branches       8450     8452       +2     
============================================
  Hits          25627    25627              
- Misses        33959    33981      +22     
  Partials       2303     2303              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kidinov kidinov modified the milestones: 18.2, 18.3 Apr 11, 2024
Base automatically changed from 11264-track-wc-core-version to trunk April 11, 2024 14:04
@kidinov kidinov linked an issue Apr 11, 2024 that may be closed by this pull request
Copy link
Copy Markdown
Member

@hichamboushaba hichamboushaba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kidinov sorry for the delay, I was AFK last week.

I have some concerns about this, switching to a Coroutine here will cause the code to be dispatched to a different thread (because of the usage of the default Dispatchers.Default), which means two issues:

  1. The event timestamp will be slightly behind the actual tracking, this would be in the order of tens or hundreds of milliseconds depending on the load of the Dispatchers.Default queue.
  2. The ordering of events won't be guaranteed anymore.

Regarding 2, I googled a bit, and it seems it's the case, the official guarantee for coroutines is that it's almost FIFO, so not exactly FIFO, and I was able to reproduce it by adding some logging:

Index: WooCommerce/src/main/kotlin/com/woocommerce/android/analytics/AnalyticsTracker.kt
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/WooCommerce/src/main/kotlin/com/woocommerce/android/analytics/AnalyticsTracker.kt b/WooCommerce/src/main/kotlin/com/woocommerce/android/analytics/AnalyticsTracker.kt
--- a/WooCommerce/src/main/kotlin/com/woocommerce/android/analytics/AnalyticsTracker.kt	(revision 1ee5b1704eb57f0098c441d555620d24e039b131)
+++ b/WooCommerce/src/main/kotlin/com/woocommerce/android/analytics/AnalyticsTracker.kt	(date 1713181146718)
@@ -68,6 +68,7 @@
     }
 
     private fun track(stat: AnalyticsEvent, properties: Map<String, *>) {
+        WooLog.i(T.UTILS, "Tracking ${stat.name.lowercase()}, Properties: $properties")
         appCoroutineScope.launch {
             if (tracksClient == null) {
                 return@launch

Which resulted in the following:

2024-04-15 12:39:47.565 11021-11021 WooCommerce-UTILS       com.woocommerce.android.dev          I  Tracking app_feedback_prompt, Properties: {action=shown}
2024-04-15 12:39:47.565 11021-11021 WooCommerce-UTILS       com.woocommerce.android.dev          I  Tracking view_shown, Properties: {name=MyStoreFragment}
2024-04-15 12:39:47.568 11021-11193 WooCommerce-UTILS       com.woocommerce.android.dev          I  🔵 Tracked: view_shown, Properties: {"name":"MyStoreFragment","is_debug":true,"site_url":"https:\/\/hichamwootest.blog","cached_woo_core_version":"8.7.0"}
2024-04-15 12:39:47.568 11021-11192 WooCommerce-UTILS       com.woocommerce.android.dev          I  🔵 Tracked: app_feedback_prompt, Properties: {"action":"shown","blog_id":226227340,"is_wpcom_store":true,"was_ecommerce_trial":true,"plan_product_slug":"wooexpress-small-bundle-yearly","store_id":"1d798d53-bded-499f-b0ef-a2a7a312f35d","is_debug":true,"site_url":"https:\/\/hichamwootest.blog","cached_woo_core_version":"8.7.0"}

Notice that view_shown was logged before app_feedback_prompt even though they were requested in a different order.


I get that the issue here is that we added a DB call when logging the WC version in GetWooCorePluginCachedVersion, if I'm right in this assumption, then I suggest we cache the version in memory and use it instead of using the DB for all events, this would also offer a better performance due to the frequency of Tracks.
Since we fetch the site's plugins on app launch and when switching sites here, it would be a good candidate for when caching the value 🤔, WDYT?

@ThomazFB ThomazFB modified the milestones: 18.3, 18.4 Apr 18, 2024
@kidinov
Copy link
Copy Markdown
Contributor Author

kidinov commented Apr 22, 2024

@hichamboushaba

Thank you for the really good points here! 🙏

I get that the issue here is that we added a DB call when logging the WC version in GetWooCorePluginCachedVersion, if I'm right in this assumption

Yep, this is correct. But I'd still really want to fix that in a generic way, as we hit the IO not only in GetWooCorePluginCachedVersion, but inSelectedSite::get at least once and appPrefs.getWCStoreID seems to be called almost on every tracking. And potentially more as the app evolves, and as we can see, that not really clear that this ran on the main thread in many of the cases, so people keep adding more and more work here

Overall, we should strive to do as little as possible work on the main thread, and that looks like a relatively simple way of improving the situation with the app being jerky.

then I suggest we cache the version in memory and use it instead of using the DB for all events, this would also offer a better performance due to the frequency of Tracks.

That is cached on the backend itself, and then we cache it in the DB, too. It's already quite challenging to make sure that we track the relevant value, and we actually already know that the value won't be relevant all the time. Introducing another level of cache will create yet another level of ambiguity and complexity, which I'd like to avoid.

The event timestamp will be slightly behind the actual tracking, this would be in the order of tens or hundreds of milliseconds depending on the load of the Dispatchers.Default queue.

Using the local time timestamp may not be accurate because the time on the device can be set to any value. Additionally, setting the time on the main thread does not guarantee that we record the actual event time since we record the time when the main thread managed to hit TracksClient::track but before it doesn't all the preparation of the event, including different IO operations. Moreover, we sometimes call tracking from coroutines already. Therefore, it is unclear whether we can or should pursue milliseconds in this case. I wouldn't worry about that much because, by design, this is not something we should think of super precise and rely on. Wdyt?

The ordering of events won't be guaranteed anymore.

This sounds like a more important issue to me, as tracks may be used for debugging sometimes. Having said that, we already do not really have a guaranteed order, as we sometimes call that from inside coroutines. I added mutex to mitigate that c464f20
wdyt?

@hichamboushaba
Copy link
Copy Markdown
Member

Thanks @kidinov for the remarks, one suggestion that might work without the issues of the current solution is to launch the coroutine on a dispatcher that uses a single thread (newSingleThreadContext), or even alternatively not using coroutines at all, and instead using a separate thread that handles the logging of the events. I think that by forcing a single background thread, we'll avoid the issues that I shared above, but I didn't test it to confirm.

@kidinov
Copy link
Copy Markdown
Contributor Author

kidinov commented Apr 23, 2024

@hichamboushaba

Do you mean to have a constantly running thread that reads from some kind of queue of events? Does that solve the timing issue?

Regarding using a native thread. What advantages do you see? To me, it sounds like:

  • We need to stop that somehow to not leak
  • That is just way heavier than coroutine in terms of resources

The timing issue could be solved by recording the time before putting it to any parallel running code, but to do that, we will have to change the tracks library; as of now, it sets time inside. And as I mentioned above I don't think that this is something we should worry about

As I realized that the mutex approach doesn't give us the guarantee of sequential/ordered tracking, I changed that to producer-consumer with the channel in between. Wdyt?
ca058e9

@kidinov kidinov requested a review from hichamboushaba April 23, 2024 13:42
Copy link
Copy Markdown
Member

@hichamboushaba hichamboushaba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean to have a constantly running thread that reads from some kind of queue of events? Does that solve the timing issue?

Yes, it should fix the ordering issue, as it'll handle events one by one, which means no parallelism.
The timestamp would still be impacted, but it's the same for all solutions that involve some background work.

Regarding using a native thread. What advantages do you see?

I didn't say it's better, I offered it just as a similar suggestion, as it's exactly similar to the single-threaded dispatcher with Coroutines, just with different API.

We need to stop that somehow to not leak

Why? the AnalyticsTracker runs for the whole duration of the app, same as AppCoroutineScope that we don't have to cancel.

That is just way heavier than coroutine in terms of resources

I don't think so, Coroutines are more lightweight than threads when we talk about a big number of them, but for this case, we'll create a Thread for either the single-threaded dispatcher or this solution, so actually the native thread solution might be lighter as it doesn't involve any additional Coroutines logic.

And as I mentioned above I don't think that this is something we should worry about

I agree, it's a minor issue, the ordering is the most serious issue.

As I realized that the mutex approach doesn't give us the guarantee of sequential/ordered tracking, I changed that to producer-consumer with the channel in between. Wdyt?

Yes, this solution looks better, thanks for working on it, as I didn't test the changes, I'll wait for @JorgeMucientes to review the PR as well.

@JorgeMucientes
Copy link
Copy Markdown
Contributor

Hey @kidinov thanks for working on this! Great job 🥇
The code looks good and after following all the discussion the proposed solution looks good to me, it gets off the main thread the "heavy" work while keeping event's order 👍🏼.
I smoke-tested the app while checking the logcat for tracking events being fired in order, and everything looks good.

I just left a minor suggestion.

@wzieba
Copy link
Copy Markdown
Contributor

wzieba commented Apr 23, 2024

hi there @kidinov 👋 I've spotted this PR and have a question:

Inside this, we hit the device's IO.

Could you please point me to the line you mean? I think I can't spot it in AnalyticsTracker#track (other than SharedPreferences reference) and TracksClient seems to have a safety mechanism, similar to the one here, already implemented - a thread for reading from buffer to DB, a thread to read from DB and one for sending to remote.

…AnalyticsTracker.kt

Co-authored-by: Jorge Mucientes <jorge.mucientes@automattic.com>
@kidinov
Copy link
Copy Markdown
Contributor Author

kidinov commented Apr 24, 2024

@wzieba 👋

Could you please point me to the line you mean?

Yeah. We hit SharedPreferences here https://github.com/woocommerce/woocommerce-android/pull/11293/files#diff-1b467124790123903cb4bc17bf7d091c5076e40bcc4d2eaa72f6c030fda0a6d9R123

which is IO. SharedPreferences is just an XML file on disk

Also, the site is stored in DB, but we cache that internally in memory, but the first hit will be IO still here
https://github.com/woocommerce/woocommerce-android/pull/11293/files#diff-1b467124790123903cb4bc17bf7d091c5076e40bcc4d2eaa72f6c030fda0a6d9R116

@kidinov kidinov enabled auto-merge April 24, 2024 07:50
@hichamboushaba
Copy link
Copy Markdown
Member

hichamboushaba commented Apr 24, 2024

which is IO. SharedPreferences is just an XML file on disk

Just pointing out something here 🙂, I think it's generally OK to read the SharedPreferences on main thread, once they are initialized, they are cached in-memory, the original issue with them was when writing with commit which was writing to the disk synchronously, and which apply fixed it.
That being said, there are some issues with SharedPreferences when having a very large number of keys or large objects saved, but I don't think this impacts us in our app, nor it's the subject here.

@kidinov kidinov merged commit f093c01 into trunk Apr 24, 2024
@kidinov kidinov deleted the 11292-tracking-is-done-on-the-main-thread branch April 24, 2024 08:24
@wzieba
Copy link
Copy Markdown
Contributor

wzieba commented Apr 24, 2024

@kidinov Have you noticed or measured any improvements related to this change? I'm concerned that this PR makes AnalyticsTracker implementation more complex and easier to misuse in the future - e.g. anyone can easily add db queries in track method now. As I mentioned, we already have the very similar mechanism (buffer) implemented internally in Automattic-Tracks-Android which is addressing a similar issue. This library is designed to work in sync manner, so maybe we can work more on just getting the data from cache instead? It seems we don't have to do this for SharedPreferences as @hichamboushaba explained above.

@kidinov
Copy link
Copy Markdown
Contributor Author

kidinov commented Apr 24, 2024

@wzieba 👋

I have measured that not in an exact manner, though, by just recording time before/after method execution. That sometimes takes 2ms, sometimes 14, sometimes 40. On some screens, we track a few events at the same time, so overall lag can be noticeable and that adds up to the whole experience of the app being quite jerky

implementation more complex and easier to misuse in the future - e.g. anyone can easily add db queries in track method now.

If we do it in the background, I don't see why it would be misused. Seems legit use to me

This library is designed to work in sync manner
Is there anything specific regarding why it has to be sync manner? And what exactly do you mean by sync: only from the main thread or nonconcurrent access?

We already call not from the main thread in quite a few places. And I don't add any concurrency - it's called one-by-one, the same as before, just without using the main thread

It seems we don't have to do this for SharedPreferences as @hichamboushaba explained above.

Yep, It is cached in the memory, but at the same time, there is some blocking synchronization going on, too. While it may be "ok" to access it from the main thread occasionally, here we do it 3 or more times per user's action, right after this action, which in my opinion is not necessary and better to avoid

Having said that if you have a specific issue you see, I am totally fine with fixing that

@hichamboushaba
Copy link
Copy Markdown
Member

Thanks @kidinov for the comment.

Yep, It is cached in the memory, but at the same time, there is some blocking synchronization going on

For my learning, can you please elaborate on this? It's not clear to me what's meants by a blocking synchronization.

That sometimes takes 2ms, sometimes 14, sometimes 40

Just to confirm before @wzieba can share his opinion, were these measurements done with the DB call for getting the WooCommere version or without it? If yes, then I think a more faire approach would be to measure with the value cached in memory.

@kidinov
Copy link
Copy Markdown
Contributor Author

kidinov commented Apr 24, 2024

@hichamboushaba 👋

For my learning, can you please elaborate on this? It's not clear to me what's meants by a blocking synchronization.

Every access method by your link has

    synchronized (mLock) {
       awaitLoadedLocked();

That lock seems to be acquired not only on read access, but also on writes to memory and to disk. There is also 'startReloadIfChangedUnexpectedly` method which is public so not clear when it's called but it also locks all the read access (don't want to spend time on researching when that happens, tbh)

So basically, that shared lock is for the whole Shared Pref, not by key. So if there is any access to any key in shared pref, then we are waiting for them to be finished.

Just to confirm before @wzieba can share his opinion, were these measurements done with the DB call for getting the WooCommere version or without it? If yes, then I think a more faire approach would be to measure with the value cached in memory.

I measured without DB access

@hichamboushaba
Copy link
Copy Markdown
Member

That lock seems to be acquired not only on read access, but also on writes to memory and to disk. There is also 'startReloadIfChangedUnexpectedly` method which is public so not clear when it's called but it also locks all the read access (don't want to spend time on researching when that happens, tbh)

So basically, that shared lock is for the whole Shared Pref, not by key. So if there is any access to any key in shared pref, then we are waiting for them to be finished.

Thanks so much @kidinov for the explanation 🙏, I checked the code, and it seems the lock mLock is used only as a way to prevent reading values while loading data, it's used only for in-memory read/write, and for the initial load from disk, from the look of it, it won't cause any delaying as awaitLoadedLocked will be an empty function when mLoaded is set to true, and my understanding of startReloadIfChangedUnexpectedly is that it's used by the Context to reload the shared preferences if the file was modified manually outside of the app's process (a case that doesn't happen except on rooted phones).
IMHO since the documentation doesn't tell us to access the SharedPreferences from outside of main, and the class is not deprecated, I'd keep the official statement that it's fine for synchronous access if we keep the size of the preferences small.

In all cases, this was just for my learning and I appreciated, while when it comes to the changes of this PR, I'm OK with whatever approach, and I think the concern raised by @wzieba is a valid one that should be discussed.

@wzieba
Copy link
Copy Markdown
Contributor

wzieba commented Apr 24, 2024

@kidinov @hichamboushaba thanks for sharing more context and leaving thoughts!

If we do it in the background, I don't see why it would be misused. Seems legit use to me

I'd discourage to pursuing ideas of having db queries for each tracking method invocations, especially, as you say a few of them might be happening at the same time. I just think it's a inefficient use of resources that might, on a bigger scale (in terms of access frequency), slow down the app (database is at the end, a single resource that has to be locked and unlocked).

This library is designed to work in sync manner

Is there anything specific regarding why it has to be sync manner? And what exactly do you mean by sync: only from the main thread or nonconcurrent access?

What I meant is that the Tracks SDK is designed to be synchronous to save consumers from problems like the one we're addressing here. I'd argue that "we double the effort" rather than "we shouldn't do this because a side efect will happen".


To better understand the problem we discuss here, I've deep dived into profiles we collect via Sentry Performance Monitoring. I've researched for outliers that could emphesize any issues. And e.g., let's take this profile: https://a8c.sentry.io/profiling/profile/woocommerce-android/010df37d5406486f86ff194484b78b61/flamegraph/?colorCoding=by+system+frame&fov=2945841152%2C19%2C1205526016%2C17&query=AnalyticsTracker.track&sorting=call+order&tid=17511&view=top+down

and look for AnalyticsTracker#track execution. The execution time of it is 1.21s (!) which is definitely longer than expected.

Going deeper into the execution tree, we can see that at the end of it BinderProxy.transactNative is executed. It's used to take DeviceInformation.isBluetoothEnabled value. Something is wrong there as this call takes 1.19s.

Screenshot 2024-04-24 at 21 01 28

Here's another outlier: https://a8c.sentry.io/profiling/profile/woocommerce-android/81299bea922e4f2dbfe2ea7a785ec6a6/flamegraph/?colorCoding=by+application+frame&fov=929282944%2C24%2C710713984%2C21&frameName=com.woocommerce.android.analytics.AnalyticsTracker.track%28com.woocommerce.android.analytics.AnalyticsEvent%2C+java.util.Map%29&framePackage=com.woocommerce.android.analytics&query=&sorting=left+heavy&tid=20761&view=top+down

Screenshot 2024-04-24 at 21 13 57

getCurrentNetworkOperator takes 710 ms 😬 .

Looking at the source code of the Tracks SDK, I can see that, even the SDK was designed to be synchronous, before adding an event to a queue it performs multiple operations which can be time-consuming.

I believe this PR addresses this problem, so we should experience the improvement. That's a good news 😃 .

But generally, whenever we'll address those performance problems in Tracks SDK, I'd suggest reconsidering introducing this change and validate if logging of main thread is still necessary.

Thanks for discussion @kidinov @hichamboushaba, I've learned more thanks to it 🙌

@JorgeMucientes
Copy link
Copy Markdown
Contributor

JorgeMucientes commented Apr 25, 2024

Thanks for the excellent investigation @wzieba, and thanks @kidinov and @hichamboushaba for keeping the discussion going. I'm learning so much from it 👏🏼 🥇

@wzieba really nice find there, on what tasks running inside Tracks lib are really time consuming. TIL about the Sentry profiles.
Just a question out of complete ignorance. When looking at those Sentry profiles, how do you see average/global numbers? How do we know that, for example DeviceInformation.isBluetoothEnabled taking ~ 1.19s is not something anecdotical or specific to a device OS, hardware, or device running low on CPU resources when that happened?

@wzieba
Copy link
Copy Markdown
Contributor

wzieba commented Apr 25, 2024

Thanks!

When looking at those Sentry profiles how do see average numbers?

This is a good question @JorgeMucientes ! For now, I was only able to see average time of specific methods by summaries on the main page (https://a8c.sentry.io/profiling/?project=1459556&statsPeriod=90d)
Screenshot 2024-04-25 at 08 23 01

From it we can see P75 of AnalyticsTracker#track method duration is ~14ms.

How do we know that, for example DeviceInformation.isBluetoothEnabled taking ~ 1.19s is not something anecdotical or specific to a device OS, hardware, or device running low on CPU resources when that happened?

I think it is indeed anecdotical. In fact we profile 1% of 2% of transactions events, which is I guess far from statistical sigificance.

But the fact that DeviceInformation.isBluetoothEnabled could take ~ 1.19s and that it sometimes happens, connected with the fact that we use AnalyticsTracker#track dozens of times for each user action is probably bad enough to prioritise working on fixing this issue.

@wzieba
Copy link
Copy Markdown
Contributor

wzieba commented Apr 25, 2024

I found this view: https://a8c.sentry.io/profiling/summary/woocommerce-android/?environment=release&project=1459556&statsPeriod=90d&transaction=OrderDetails&view=flamegraph that shows "Aggregated duration of this frame across different samples".

On it we can see that some track methods will average duration to as high as ~600ms (74 samples), while some to only ~25ms.
Screenshot 2024-04-25 at 10 11 35


Screenshot 2024-04-25 at 10 11 04

@hichamboushaba
Copy link
Copy Markdown
Member

Really nice find @wzieba, thank you so much.

But generally, whenever we'll address those performance problems in Tracks SDK, I'd suggest reconsidering introducing this change and validate if logging of main thread is still necessary.

➕ to this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: tracks Related to analytics, including Tracks Events.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tracking is done on the main thread

8 participants