How to go from “Looks like something changed in a Firefox Desktop version” to “Here is a list of potential culprit bugs”

This will mostly be helpful to Firefox Desktop folks, so if you’re not one of those, please instead enjoy a different blogpost. I recommend this one about the three roles of data engagements.

So you’ve found yourself a plot that looks like this:

A timeseries bar plot that begins as an uptake curve then has a sudden drop around February 22. There is no legend and no y-axis as they are unimportant, and sometimes I like to be cagey about absolute figures.

You suspect this has something to do with a code change because, wouldn’t you know it, the sharp decline starts around Feb 22 and we released Firefox 123 on Feb 20. But where do you go from here? Here’s a step-by-step of how I went from this plot arriving in Slack#data-help to finding the bugfix that most likely caused the change:

1. Ensure this is actually a version-specific change

It’s interesting that the cliff in the plot happened near a release day, and it’s an excellent intuition to consider code releases for these sorts of sea-changes in data volume or character. But we should verify that this is the case by grouping by mozfun.norm.truncate_version(app_version, 'major') AS major_version which in our case gives us:

The same timeseries bar plot as before, but coloured to show groups by major Firefox Desktop version. The cliff happens solely in the colours for Firefox 123 and above.

Sure enough, in this case the volume cliff happens entirely within the Firefox 123+ colours. If this isn’t what you get, then it’s somewhat less likely that this is caused by a client code change and this guide might not help you. But for us this is near-certain confirmation that the change in the data is caused by a code change that landed in Firefox 123… but which one?

( This is where I spent a little time checking some frequent “gotcha” changes that could’ve happened. I checked: was it because data went from all-channel to pre-release-only? (No, the probe definitions didn’t change and the fall isn’t severe enough for that (would look more like an order of magnitude)) Was it because specific instrumentation within the group happened to expire in Fx123? (No, the first plot is grouped by specific probe, and all of the groups shared the same shape as their sum) Was it an incredibly-successful engagement-boosting experiment that ended? (No, there haven’t been any relevant experiments since last July) )

2. Figure out which Nightly builds are affected

Firefox Desktop releases new software versions twice a day on the Nightly channel. We can look at the numbers reported by these builds to narrow down what specific 12h period the code landed that caused this drastic shift. Or, well, you’d think we could, but when you group by build_id you get:

Another bar plot, but instead of the x-axis being time it is now "build id" which is a timestamp of a sort. The data is all over the place and patchy with no or little clear pattern.

Because our Nightly population isn’t randomly distributed across timezones, there are usage patterns that affect the population who use which build on which day. And sometimes there are “respins” where specific days will have more than 2 nightlies. And since our Nightly population is so small (You Can Help! Download Nightly Today!), and this data is a little sparse to begin with, little changes have big effects.

No, far more commonly the correct thing to do is to look at what I call a “build day”. This is how GLAM makes things useful, and this is how I make patterns visible. So group by SUBSTR(build_id, 1, 8) AS build_day, and you get:

It looks like a timeseries bar plot, but the x-axis is "build day" so it isn't quite. Notably, there's a sudden cliff starting with the nightlies for January 18.

Much better. We can see that the change likely landed in Jan 18’s nightlies. That Jan 18-20 are all of a level suggests to me that it probably ended up in all of Jan 18’s nightly builds (if it only landed in one of the (normally) two nightly builds we’d expect to see a short fall-off where Jan 18 would be more like an average between Jan 17 and 19.).

Regardless of when during the day, we’re pretty sure we have this nailed down to only one day’s worth of patches! That’s good… but it could be better.

3. Going from build days to pushlog

Ever since I was the human glue keeping the (now-decommissioned) automated regression detection system “alerts.tmo” working, I’ve had a document on my disk reminding me how to transform build days or build_ids into a “pushlog” of changes that landed in the suspect builds. This is how it works:

  1. Get the hg revisions of the suspect builds by looking through this list of all firefox releases for the suspect builds’ ids. You want the final build of the day before the first suspect build day and the final build of the final suspect build day, which in this case are Jan 17 and Jan 18, so we get f593f07c9772 and 9c0c2aab123:
A visual excerpt of the firefox releases list https://hg.mozilla.org/mozilla-central/firefoxreleases. For illustration only.
  1. Put them into this template: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange={}&tochange={} — which gives us https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f593f07c9772&tochange=9c0c2aaab123

This gives you a list of all changes that are in the suspect builds, plus links to the specific code changes and the relevant bugs, with the topic sentence from each commit right there for you. Handy!

4. Going from a pushlog to a culprit

This is where human pattern matching, domain expertise, organizational memory, culture and practices, and institutional conventions all combine… or, to put it another way, I don’t know how to help you get from the list of all code that could have caused your data change to the one (or more) likely suspects. My brain has handily built me a heuristic and not handed me the source code, alas. But I’ve noticed some patterns:

  • Any change that is backed out can be disregarded. Often for reasons of test failures changes will be backed out and relanded later. Sometimes that’s later the same day. Sometimes that’s outside our pushlog. Skip any changes that have been backed out by disregarding any commits from a bug that is mentioned before a commit that says “Backed out N changesets (bug ###)…”.
  • You can often luck out by just text searching for keywords. It is custom at Mozilla to try to be descriptive about the “what” of a change in the commit’s topic, so you could try looking for “telemetry” or “ping” or “glean” to see if there’s anything from the data collection system itself in there. Or, since this particular example had to do with Firefox Relay’s integration with Firefox Desktop, I looked for “relay” (no hits) and then “form” (which hit a few times, like on the word “information”, … but also on the culprit which was in the form detector code.)
  • This is a web view on the source code, so you’re not limited to what it gives you. If you have a mozilla-central checkout yourself, you can pull up the commits (if you’re using git-cinnabar you can use its hg2git functionality to change the revs from hg to git) and dump their sum-total changes to a viewer, or pipe it through grep, or turn it into a spreadsheet you can go through row-by-row, or anything you want. I’m lazy so I always try keywording on the pushlog first, but these are always there for when I strike out.

5. Getting it wrong

Just because you found the one and only commit that landed in a suspect build that is at all related, even if that commit’s bug specifically mentions that it fixed a double-counting issue, even if there’s commentary in the code review that explains that they expect to see this exact change you just saw… you might be wrong.

Do not be brusque in your reporting. Do not cast blame. And for goodness’ sake be kind. Even if you are correct, being the person who caused a change that resulted in this investigation can be a not-fun experience. Ask Me How I Know.

Firefox Desktop is a complex system, and complex systems fail. It’s in their nature.


And that’s it! If you have any comments, question, or (better yet) improvements, please find me on the #glean:mozilla.org channel on Matrix and I’d love to chat.

:chutten

Never Look at the Data: Why did we start getting so many pings from Korea?

Something happened on January 5, 2023. All of a sudden we abruptly started receiving a number of pings from Firefox Desktop clients in Korea equal to two times the size of the entire Korean Firefox Desktop population.

What happened? How did we notice it? What did we do about it?

Let’s back up.

I can’t remember where I learned it, but I’d already started reciting as dogma in my first year of University: “The most important part about any feature is the ability to turn it off”. It’s served me well through my studies and my career. I’ve also found it to be especially true for data collection systems where, for whatever reason, as a user you might decide you no longer want the software you’re using to continue to send data. In some places this is even enshrined in laws where you can request the deletion of data that has already been collected.

Law or not, Mozilla has before, does now, and will always make it easy for you to decide whether to send data to Mozilla. We may not understand why you make that choice, and it definitely will make it harder for us to ensure our products meet your needs, but we’ll respect the heck out of your choice in our processes and in our products.

This is why, when Mozilla’s data collection system Glean is told the user went from allowing data upload to forbidding it, we send one final “deletion-request” ping before shutting down. The “deletion-request” ping contains all the internal identifiers we’ve used to longitudinally group data (if we receive ten crash reports it’s important to know whether it’s the same Firefox crashing ten times or if it’s ten Firefoxes crashing once), and we use those identifiers to (well) identify what data we’ve collected that we’re now going to delete.

For the purposes of this story you’ll need to know that there’s two times when Glean notices the product’s gone from “data upload: on” to “data upload: off”: while Glean is running, and during Glean startup. If Glean’s running, then we just handle things – we were told the setting changed from “data upload: on” to “data upload: off” and away we go. But Glean knows that it isn’t always listening to the data upload setting, so if it it starts up with “data upload: off” and the last time it shut down we were “data upload: on” we’ll send a specific “at_init”-reason “deletion-request” ping.

We in the Data Org monitor how Glean is behaving. One thing we’ve learned about how Glean behaves is that the number of “deletion-request” pings is roughly constant over time. And the proportion of “deletion-request” pings that have the “at_init” reason should remain a fairly fixed one.

What shouldn’t happen is for Firefox Desktop-sent “at_init”-reason “deletion-request” pings to spike like this on January 5:

time-series plot of ping volumes from December 2022 until mid-January 2023 showing abnormal abrupt increases in volume starting on January 5.

What we do when we notice things like this is file a bug. As the one responsible for Glean’s integration in Firefox Desktop, and as someone with a long history of looking into anomalies, I took a look. At this initial point I was pretty sure it’d be a single actor (a single user, a single company, a single internet cafe) doing something odd… but alas, the evidence was inconclusive:

Evidence consistent with a single actor being responsible for it all:

  • All the pings were coming from the same internet provider. Korea Telecom is responsible for a bare majority of Firefox Desktop data delivery from Korea, but the spikes were entirely from that ISP.
  • The Mozilla Community in Korea could offer no explanation of any wide-spread computer or software event that matched the timeline.
  • “at_init”-reason “deletion-request” pings could be a result of automation changing the files on disk to read “data upload: off” between runs of Firefox Desktop.

Evidence inconsistent with a single actor being responsible for it all:

  • The data came from a mix of Firefox Desktop versions: versions 101.0.1, 104.0, and 108.0.2.
  • The data came from a range of different regions, more or less following the population density of Korea itself.
  • “at_init”-reason “deletion-request” pings could instead be the result of users changing the setting to “data upload: off” early enough during Firefox Desktop startup that Glean hasn’t yet been initialized.

Regardless of why it was happening, it quickly became more important that we learn what we needed to do about it. We spun up an Incident, which is how we organize ourselves when there’s something happening that requires cross-functional collaboration and isn’t getting better on its own. Once there we ascertained that we could respond very quickly and decisively and do

Nothing at all.

The volume of these pings vastly eclipsed any other “deletion-request” pings we would otherwise have received, so you’d be forgiven for thinking that it was terribly expensive to receive, store, and process them all. In reality, we batch these requests. And even before this spike, every batch of requests required editing every partition of every table. Adding another list of identifiers to delete equal in size to two times the peak Firefox Desktop population in Korea just doesn’t matter all that much.

The pressure was off. Even if it got worse… which it did:

Time-series plot of "deletion-request" pings isolated to just those from Korea. Spikes begin January 25 and dwarf other reports. A plateau begins March 26 and continues to the right edge of the plot around April 10.

On March 26, when it reached and maintained a peak of five times the volume of the Firefox Desktop population in Korea, it still wasn’t harming our data platform’s ability to serve business needs or costing us all that much in operational spend. We didn’t need to invest effort into running down the source, so we didn’t.

And so I just kept an occasional eye on it until, just as suddenly but not quite as abruptly as it began, on April 12 the ping volumes began to decrease. By April 18, we were back to normal levels.

Time-series plot of "deletion-request" pings isolated to just those from Korea. Very similar to the previous plot, but continues until April 18. Spikes begin January 25 and dwarf other reports. A plateau begins March 26 and stays up there until April 12 when falls away to nothing over the course of five days or so.

We had successfully ignored it until it went away.

So what happened to Korean Firefox Desktop users from Jan 5 to April 12, 2023? We never figured it out. If you know about something happening across those dates in Korea: please get in touch. As little as it needed solving for the sake of business needs, it still needs solving for the sake of my curiosity. 

:chutten

This Week in Glean: Page Load Data, Three Ways (Or, How Expensive Are Events?)

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we make, among other things, Web Browsers which we tend to call Firefox. The central activity in a Web Browser like Firefox is loading a web page. It gets done a lot by each and every one of our users, and so you can imagine that data about pageloads is of important business interest to us.

But exactly because this is done a lot and by every one of our users, this inspires concerns of scale and cost. How much does it cost us to learn more about pageloads?[0]

As with all things in Data, the answer is the same: “Well, it depends.”

In this case it depends on how you record the data. How you record the data depends on what questions you hope to answer with it. We’re going to stick to the simplest of questions to make this (highly-suspect) comparison even remotely comparable.

Option 1: Just the Counts, Ma’am

I say page loads are done a lot, but how much is “a lot”? If that’s our only question, maybe the data we need is simply a count of pageloads. Glean already has a metric type for counting things, so it should be fairly quick to implement.

This should be cheap, right? Just a single number? Well, it depends.

Scale 1: Frequency

The count of pageloads is just a single number. One, maybe as many as eight, bytes to record, store, transmit, retain, and analyze. But Firefox has to report it more than once, so we need to first scale our cost of “one, maybe as many as eight, bytes” by the number of times we send this information.

When we first implemented Firefox’s pageload count in Glean, I wanted to send it on the builtin “metrics” ping which is sent once a day from anyone running Firefox that day[1]. In an effort to gain more complete and timely data, we ended up adding it to the builtin “baseline” ping which is sent (on average for Firefox Desktop) 8 or more times per day.

For our frequency scale we thus use 8/day.

Scale 2: Population

These 8 recordings per day are sent by about 200M users over a month. Days and months aren’t easy to scale between as not all users use Firefox every day, and our population gains new users and loses old users at variable rates… so I recalculated the Frequency scale to be in terms of months and found that we get 68 pings per month from these roughly 200M users.

So the cost is pretty easy to calculate then? Whatever the cost is of storing and transmitting 200M x 68/month x eight bytes ~= 109 GB?

Not entirely. But until and unless those other costs are not comparable between options, we can just treat them as noise. This cost, rendered in the size of the data, of about 109GB? It’ll do.

Option 2: What an Event

Page loads are interesting not just in how many of them there are, but also about what type of load they are and how long the load took. The order of a page load in between other events might also be of interest: did it happen before or after some network trouble? Did a bunch of pageloads happen all at once, or spread across the day? We might wish to instrument page loads as Glean events.

Events are each more expensive than a count. They carry a timestamp (eight bytes) and repeat their names each time they’re recorded (some strings, say fifteen bytes).

(We are not counting the load type or how long the load took in our calculations of the size of an individual sample as we’re still trying to compare methods of answering the same “How many page loads are there?” question.)

Scale 3: Page Loads

“Each time they’re recorded”, huh. Guess that means we get to multiply by the number of page loads. Each Firefox Desktop user, over the course of a month, loads on average 1190 pages[2]. This means instead of sending 68 numbers a month, we’re sending 1190 batches of strings a month.

So the comparable cost is whatever the cost is of storing and transmitting 200M x (eight bytes and fifteen bytes) x 1190 ~= 5.47TB..

We’ve jumped an order of magnitude here. And we’re not done.

Option 3: Custom Pings, and Custom Pings Only

What if the context we wish to record alongside the event of a page load cannot fit inside Glean’s prudent “event” metric type limits? What if the collected pageload data would benefit from a retention limit or access control list different from other counts or events? What if you want to submit this data to be uploaded as soon as it has been recorded? In that case, we could send a pageload as a Glean custom ping.

We’ve not (yet) done this in Firefox Desktop (at least partially because it complicates ordering amongst other events: the Glean SDK expends a lot of effort to ensure the timestamps between events are reliable. Ping times are client times which are subject to the whims of the user.), so I’m going to get even hand-wavier than before as I try to determine how large each individual data sample will be.

A Glean custom ping without any metrics in it comes to around 500 bytes. When our data platform ingests the ping and turns it into a row in a dataset, we add some metadata which adds another 300 bytes or so (which only affects storage inside the Data Platform and doesn’t add costs to client storage or client bandwidth).

We could go deeper and cost out the network headers, the costs of using TLS to ensure the integrity of the connection… but we’d be here all day. So I’m gonna call that 200 bytes to make it a nice round 1000 bytes per ping.

We’re sending these pings per pageload, so the cost is whatever the cost is of storing and transmitting 200M x 1190 x 1000 bytes = 238TB.

Rule of Thumb: 50x

There you have it: for each step up the cost ladder you’re adding an extra 50x multiplier to the cost of storing and transmitting the data. The reality’s actually much worse if it’s harder to analyze and reason about the data as it gets more complex (which it in most cases is) because, as you might remember from one of my previous explorations in costing out metrics: it’s the human costs of things (like analysis) that really getcha.

But you have to balance it out. If adding more context and information ensures your analyses only have to look in one place for its data instead of trying to tie together loosely-coupled concepts from multiple locations… if using a custom ping ensures you have everything you need and don’t have to form a committee to resource an engineer to add implementation which needs to be deployed and individually validated… if you’re willing to bet 50x or 250x the cost on getting it right the first time, then that could be a good price to pay.

But is this the case for you and your data?

Well, it depends.

:chutten

[0]: Avid readers of this blog may notice that this isn’t the first time I’ve written on the costs of data. And it likely won’t be the last!

[1]: How often a “metrics” ping is sent is a little more complicated than “once a day”, but it averages out to about that much so I’m sticking with it for this napkin.

[2]: Yes there are some wild and wacky outliers included in the figure “an average of 1190 page loads” that I’m not bothering to clean up. You can Page Loads Georg to your hearts’ content.

[3]: This is about how many characters the JSON-encoded ping payload comes to, uncompressed.

This Week in Glean: What If I Want To Collect All The Data?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

Mozilla’s approach to data is “as little as necessary to get the job done” as espoused in our Firefox Privacy Promise and put in a shape you can import into your own organization in Mozilla’s Lean Data Practices. If you didn’t already know, you’d find out very quickly by using it that Glean is a Mozilla project. All of its systems are designed with the idea that you’ve carefully considered your instrumentation ahead of time, and you’ve done some review to ensure that the collection aligns with your values.

(This happens to have some serious knock-on benefits for data democratization and tooling that allows Mozilla’s small Data Org to offer some seriously-powerful insights on a shoestring budget, which you can learn more about in a talk I gave to Ubisoft at their Data Summit in 2021.)

Less Data, as the saying goes, implies Greater Data and Greatest Data. Or in a less memetic way, Mozilla wants to collect less data… but less than what?

Less than more, certainly. But how much more? How much is too much?

How much is “all”?

Since my brain’s weird I decided to pursue this thought experiment of “What is the _most_ data you could collect from a software project being used?”.

Well, looking at Firefox, every button press and page load and scroll and click and and and… all of that matters. As does the state of Firefox when it’s being clicked and scrolled and so forth. Typing in the urlbar is different if you already have a page loaded. Opening your first tab is different from opening your nine-thousand-two-hundred-and-fiftieth.

And, underneath it all, is the code. How fast is it running? How much memory are we using? All these performance questions that Firefox Telemetry was originally built to answer. Is code on line 123 of file XYZ.cpp running? Is it running well? What do we run next?

For software this means to record all of the data, we’ll need to know the full state of the program at every expression it runs in every line of code. At every advancement of the Program Counter, we’d need to dump the entire Stack and Heap.

Yikes! That’s gigabytes of data per clock cycle.

Well, maybe we can be cleverer than this. Another one of those projects Mozilla incubated that now has a whole community of contributors and users (like Rust) is a lightweight record-and-replay debugger called rr. The rr debugger collects traces of a running piece of software and can deterministically replay it over and over again (backwards, even!), meaning it has all the information we need in it.

So a decent size estimate for “all the data” might be the size of one of these trace recordings. They’re big, but not “full heap and stack at every program counter” big. A short test run of Firefox was about 2GB for a one minute run (albeit without any user interaction or graphics).

Could Glean collect traces like these? Or bigger ones after, say, a full day’s use? Not easily. Not without modification.

Let’s say we did those modifications. Let’s push this thought experiment further. What does that mean for analysis? Well, we’d have all these recordings we could spin up a VM to replay for us. If we want the number of open tabs, we could replay it and sample that count whenever we wanted.

This would be a seismic shift in how instrumentation interacted with analysis. We’d no longer have to ship code to instrument Firefox, we could “simply” (in quotes because using rr requires you to be a programming nerd) replay existing traces and extract the new data we needed.

It would also be absolutely horrible. We’d have to store every possible metric just in case we wanted _one_ of them. And there’s so much data in these traces that Mozilla doesn’t want to store: pictures you looked at, videos you watched, videos you uploaded… good grief. We don’t want any of that.

(( I’d like to take a second to highlight that this is a thought experiment: Mozilla doesn’t do this. We don’t have plans to do this. In fact, Mozilla’s Data Privacy Principles (specifically “Limited Data”) and Mozilla’s Manifesto (specifically Principle 4 “Individuals’ security and privacy on the internet are fundamental and must not be treated as optional.”) pretty clearly state how we think about data like this. ))

And processing these traces into a useful form for analysis to be performed would take the CPU processing power of a small country, over and over again.

(( And rr introduces a 20% performance penalty which really wouldn’t ingratiate us to our users. And it only works on Linux meaning the data we’d have access to wouldn’t be representative of our user base anyway. ))

And what was the point of this again? Right. We’re here to quantify what “less data” means. But how can we do that, even knowing as we do now what the size of “all data” is? Can we compare the string value of the profile directory’s random portion comparable to the url the user visits the most? Are those both 1 piece of data that we can compare to the N pieces of data we get in a full rr trace? Mozilla doesn’t think they’re the same, since we categorize (and thus treat) these collections differently.

All in all maybe figuring out the maximum amount of data you could collect in order to contextualize how much less of it you are collecting might not be meaningful.

Oh well.

I guess this means that the only way Mozilla (and you!) can continue to quantify “less data” is by comparing it to “no data” – the least possible amount of data.

:chutten

This Week in Glean: How Long Must I Wait Before I Can See My Data?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

You’ve heard about this cool Firefox on Glean thing and wish to instrument a part of Firefox Desktop. So you add a metrics.yaml definition for a new Glean metric, commit a piece of instrumentation to mozilla-central, and then Lando lands it for you.

When can you expect the data collected when users’ Firefoxes trigger that instrumentation to show up in a queryable place like BigQuery?

The answer is one of the more common phrases I say when it comes to data: Well, it depends.

In the broadest sense, we’re looking at two days:

1) A new Nightly build will be generated within 12h of your commit.

2) Users will pick up the new Nightly build fairly quickly after that, and start triggering the instrumentation.

3) The following 4am a “metrics” ping containing data from your instrumentation will be submitted (or some time later if Firefox isn’t running at 4am)

4) A new schema generated to include your new metric definition will have been deployed overnight

5) The following 12am UTC a new partition of our user-facing stable views will have the previous day’s submissions available.

And then you commence querying! Easy as that.

Any questions?

The Questions:

What if I added a new metrics.yaml file?

That file needs to land in gecko-dev (the github mirror of mozilla-central) first. Only then can we (and by “we” I here mean the Data Team, by means of a bug you file) update the data pipeline. Then you get to wait until the next weekday’s midnight UTC (ish) for a schema deploy as per Step 4.

Generally this doesn’t add too much delay, but if landing the file happens after the pipeline folks have gone home, we get to wait until the next weekday’s midnight UTC.

The Nightly population is small and weird. How long until we get data from release?

Uptake of code to release takes a while longer. Code lands in mozilla-central, and gets into the next Nightly within 12h. Getting to Beta from Nightly means waiting until the next merge day (these days that’s on the first Monday of the month, or thereabouts). Getting to Release from Beta means waiting until the merge day after that.

If you’re unlucky, you’ll be waiting over two months for your instrumentation to be in a Release build that users can pull down.

And then you get to wait for enough Release users to update that you’re getting a representative sample. (This could take a week or so.)

So… nine weeks?

That sounds really bad! Is there anything we can do?

Why yes.

The first thing we can do is adjust our expectations. There’s a four-week sway from the worst-case to best-case on this slow path. It isn’t likely that you’ll always be landing instrumentation immediately after a merge day and get to wait the whole month until Nightly merges to Beta.

Your average wait for that is only two weeks. And the best case is a matter of a day or two.

So cross your fingers, and hope your luck is with you.

Secondly, instrumentation is (by itself) very low-risk, so you can “uplift” the instrumentation change directly to Beta without waiting for merge day.

This can cut your route to release down to _two weeks_, by (e.g.) landing in Nightly on Monday Nov 22, verifying that it works on Tuesday, requesting uplift on Wednesday, getting uplifted in the last Beta on Thursday Nov 25, then making the merge from Beta to Release on Dec 6.

(You do still get to wait a third week for the release population to update to the latest version.)

Thirdly, what are the chances that your instrumentation is measuring a feature you just built or just turned on? You want that feature to benefit from the slow-roll exposure to the more tolerant audiences of Nightly and Beta before it reaches Release, right? Automated testing is great, but nothing can simulate the wild variety of use cases and hardware combinations your feature will experience in the Real World.

So what point is there getting your instrumentation into Release before the feature under instrumentation reaches it? Instead of measuring the interval between landing instrumentation and beginning analysis, perhaps measure the interval between the release of the feature you wish to instrument and beginning analysis?

That interval is only a day: gotta wait for that partition in the stable view. Sounds much better, doesn’t it?

Still, can I get data any faster?

The fastest time from Point A) Landing a metric, to Point B) Performing preliminary analysis on a metric, is about 12h:

1) Land your code just before a new Nightly is cut.

2) Hope that the number of Nightly users that update to the latest build over the next twelve hours is enough for your purposes.

If you didn’t luck out and have a schema deploy, you’ll need to dig your data out of the additional_properties JSON column. If you are lucky, you can use the friendly columns instead.

To get to the data before the nightly copy-deduplicate to stable views, you’ll be querying the live tables instead. You need to fully-qualify that table name. You need to realize that we haven’t deduped anything in here. And you need to take narrow slices, because we can’t cluster the data effectively here, so querying can get expensive, fast.

Can I get data that quickly from release?

Not yet.

I’ve seen a proposal internally for dynamically-defined metrics which get pushed to running Firefox instances (talk to :esmyth if you’re interested). Though its present form is proposing the process and possibility, not the technology, there’s a version of this I can see that would (for a subset of data collection) take the time from “I wish to instrument this” to “I am performing analysis on data received from (a subset of) the release Firefox population” down to within a business day.

Which is neat! But that speed brings risk, so it’ll take a while to design a system that doesn’t expose our users to that risk.

Don’t expect this for Christmas, is I guess what I mean : )

:chutten

This Week in Glean: The Three Roles of Data Engagements

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

I’ve just recently started my sixth year working at Mozilla on data and data-adjacent things. In those years I’ve started to notice some patterns in how data is approached, so I thought I’d set them down in a TWiG because Glean’s got a role to play in them.

Data Engagements

A Data Engagement is when there’s a question that needs to engage with data to be answered. Something like “How many bookmarks are used by Firefox users?”.

(No one calls these Data Engagements but me, and I only do because I need to call them _something_.)

I’ve noticed three roles in Data Engagements at Mozilla:

  1. Data Consumer: The Question-Asker. The Temperature-Taker. This is the one who knows what questions are important, and is frustrated without an answer until and unless data can be collected and analysed to provide it. “We need to know how many bookmarks are used to see if we should invest more in bookmark R&D.”
  2. Data Analyst: The Answer-Maker. The Stats-Cruncher. This is the one who can use Data to answer a Consumer’s Question. “Bookmarks are used by Canadians more than Mexicans most of the time, but only amongst profiles that have at least one bookmark.”
  3. Data Instrumentor: The Data-Digger. The Code-Implementor. This one can sift through product code and find the correct place to collect the right piece of data. “The Places database holds many things, we’ll need to filter for just bookmarks to count them.”

(diagrams courtesy of :brizental)

It’s through these three working in concert — The Consumer having a question that the Instrumentor instruments to generate data the Analyst can analyse to return an answer back to the Consumer — that a Data Engagement succeeds.

At Mozilla, Data Engagements succeed very frequently in certain circumstances. The Graphics team answers many deeply-technical questions about Firefox running in the wild to determine how well WebRender is working. The Telemetry team examines the health of the data collection system as a whole. Mike Conley’s old Tab Switcher Dashboard helped find and solve performance regressions in (unsurprisingly) Tab Switching. These go well, and there’s a common thread here that I think is the secret of why: 

In these and the other high-success-rate Data Engagements, all three roles (Consumer, Analyst, and Instrumentor) are embodied by the same person.

It’s a common problem in the industry. It’s hard to build anything at all, but it’s least hard to build something for yourself. When you are in yourself the Question-Asker, Answer-Maker, and Data-Digger, you don’t often mistakenly dig the wrong data to create an answer that isn’t to the question you had in mind. And when you accidentally do make a mistake (because, remember, this is hard), you can go back in and change the instrumentation, update the analysis, or reword the question.

But when these three roles are in different parts of the org, or different parts of the planet, things get harder. Each role is now trying to speak the others’ languages and infer enough context to do their jobs independently.

In comes the Data Org at Mozilla which has had great successes to date on the theme of “Making it easier for anyone to be their own Analyst”. Data Democratization. When you’re your own Analyst, then there’s fewer situations when the roles are disparate: Instrumentors who are their own Analysts know when data won’t be the right shape to answer their own questions and Consumers who are their own Analysts know when their questions aren’t well-formed.

Unfortunately we haven’t had as much success in making the other roles more accessible. Everyone can theoretically be their own Consumer: curiosity in a data-rich environment is as common as lanyards at an industry conference[1]. Asking _good_ questions is hard, though. Possible, but hard. You could just about imagine someone in a mature data organization becoming able to tell the difference between questions that are important and questions that are just interesting through self-serve tooling and documentation.

As for being your own Instrumentor… that is something that only a small fraction of folks have the patience to do. I (and Mozilla’s Community Managers) welcome you to try: it is possible to download and build Firefox yourself. It’s possible to find out which part of the codebase controls which pieces of UI. It’s… well, it’s more than possible, it’s actually quite pleasant to add instrumentation using Glean… but on the whole, if you are someone who _can_ Instrument Firefox Desktop you probably already have a copy of the source code on your hard drive. If you check right now and it’s not there, then there’s precious little likelihood that will change.

(Unless you come and work for Mozilla, that is.)

So let’s assume for now that democratizing instrumentation is impossible. Why does it matter? Why should it matter that the Consumer is a separate person from the Instrumentor?

Communication

Each role communicates with each other role with a different language:

  • Consumers talk to Instrumentors and Analysts in units of Questions and Answers. “How many bookmarks are there? We need to know whether people are using bookmarks.”
  • Analysts speak Data, Metadata, and Stats. “The median number of bookmarks is, according to a representative sample of Firefox profiles, twelve (confidence interval 99.5%).”
  • Instrumentors speak Data and Code. “There’s a few ways we delete bookmarks, we should cover them all to make sure the count’s correct when the next ping’s sent”

Some more of the Data Org and Mozilla’s greatest successes involve supplying context at the points in a Data Engagement where they’re most needed. We’ve gotten exceedingly good at loading context about data (metadata) to facilitate communication between Instrumentors and Analysts with tools like Glean Dictionary.

Ah, but once again the weak link appears to be the communication of Questions and Answers between Consumers and Instrumentors. Taking the above example, does the number of bookmarks include folders?

The Consumer knows, but the further away they sit from the Instrumentor, the less likely that the data coming from the product and fueling the analysis will be the “correct” one.

(Either including or excluding folders would be “correct” for different cases. Which one do you think was “more correct”?)

So how do we improve this?

Glean

Well, actually, Glean doesn’t have a solution for this. I don’t actually know what the solutions are. I have some ideas. Maybe we should share more context between Consumers and Instrumentors somehow. Maybe we should formalize the act of question-asking. Maybe we should build into the Glean SDK a high-enough level of metric abstraction that instead of asking questions, Consumers learn to speak a language of metrics.

The one thing I do know is that Glean is absolutely necessary to making any of these solutions possible. Without Glean, we have too many systems that are fractally complex for any context to be relevantly shared. How can we talk about sharing context about bookmark counts when we aren’t even counting things consistently[2]?

Glean brings that consistency. And from there we get to start solving these problems.

Expect me to come back to this realm of Engagements and the Three Roles in future posts. I’ve been thinking about:

  • how tooling affects the languages the roles speak amongst themselves and between each other,
  • how the roles are distributed on the org chart,
  • which teams support each role,
  • how Data Stewardship makes communication easier by adding context and formality,
  • how Telemetry and Glean handle the same situations in different ways, and
  • what roles Users play in all this. No model about data is complete without considering where the data comes from.

I’m not sure how many I’ll actually get to, but at least I have ideas.

:chutten

[1] Other rejected similes include “as common as”: maple syrup on Canadian breakfast tables, frustration in traffic, sense isn’t.

[2] Counting is harder than it looks.

This Week in Glean: Data Reviews are Important, Glean Parser makes them Easy

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we put a lot of stock in Openness. Source? Open. Bug tracker? Open. Discussion Forums (Fora?)? Open (synchronous and asynchronous).

We also have an open process for determining if a new or expanded data collection in a Mozilla project is in line with our Privacy Principles and Policies: Data Review.

Basically, when a new piece of instrumentation is put up for code review (or before, or after), the instrumentor fills out a form and asks a volunteer Data Steward to review it. If the instrumentation (as explained in the filled-in form) is obviously in line with our privacy commitments to our users, the Data Steward gives it the go-ahead to ship.

(If it isn’t _obviously_ okay then we kick it up to our Trust Team to make the decision. They sit next to Legal, in case you need to find them.)

The Data Review Process and its forms are very generic. They’re designed to work for any instrumentation (tab count, bytes transferred, theme colour) being added to any project (Firefox Desktop, mozilla.org, Focus) and being collected by any data collection system (Firefox Telemetry, Crash Reporter, Glean). This is great for the process as it means we can use it and rely on it anywhere.

It isn’t so great for users _of_ the process. If you only ever write Data Reviews for one system, you’ll find yourself answering the same questions with the same answers every time.

And Glean makes this worse (better?) by including in its metrics definitions almost every piece of information you need in order to answer the review. So now you get to write the answers first in YAML and then in English during Data Review.

But no more! Introducing glean_parser data-review and mach data-review: command-line tools that will generate for you a Data Review Request skeleton with all the easy parts filled in. It works like this:

  1. Write your instrumentation, providing full information in the metrics definition.
  2. Call python -m glean_parser data-review <bug_number> <list of metrics.yaml files> (or mach data-review <bug_number> if you’re adding the instrumentation to Firefox Desktop).
  3. glean_parser will parse the metrics definitions files, pull out only the definitions that were added or changed in <bug_number>, and then output a partially-filled-out form for you.

Here’s an example. Say I’m working on bug 1664461 and add a new piece of instrumentation to Firefox Desktop:

fog.ipc:
  replay_failures:
    type: counter
    description: |
      The number of times the ipc buffer failed to be replayed in the
      parent process.
    bugs:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_reviews:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_sensitivity:
      - technical
    notification_emails:
      - chutten@mozilla.com
      - glean-team@mozilla.com
    expires: never

I’m sure to fill in the `bugs` field correctly (because that’s important on its own _and_ it’s what glean_parser data-review uses to find which data I added), and have categorized the data_sensitivity. I also included a helpful description. (The data_reviews field currently points at the bug I’ll attach the Data Review Request for. I’d better remember to come back before I land this code and update it to point at the specific comment…)

Then I can simply use mach data-review 1664461 and it spits out:

!! Reminder: it is your responsibility to complete and check the correctness of
!! this automatically-generated request skeleton before requesting Data
!! Collection Review. See https://wiki.mozilla.org/Data_Collection for details.

DATA REVIEW REQUEST
1. What questions will you answer with this data?

TODO: Fill this in.

2. Why does Mozilla need to answer these questions? Are there benefits for users?
   Do we need this information to address product or business requirements?

TODO: Fill this in.

3. What alternative methods did you consider to answer these questions?
   Why were they not sufficient?

TODO: Fill this in.

4. Can current instrumentation answer these questions?

TODO: Fill this in.

5. List all proposed measurements and indicate the category of data collection for each
   measurement, using the Firefox data collection categories found on the Mozilla wiki.

Measurement Name | Measurement Description | Data Collection Category | Tracking Bug
---------------- | ----------------------- | ------------------------ | ------------
fog_ipc.replay_failures | The number of times the ipc buffer failed to be replayed in the parent process.  | technical | https://bugzilla.mozilla.org/show_bug.cgi?id=1664461


6. Please provide a link to the documentation for this data collection which
   describes the ultimate data set in a public, complete, and accurate way.

This collection is Glean so is documented
[in the Glean Dictionary](https://dictionary.telemetry.mozilla.org).

7. How long will this data be collected?

This collection will be collected permanently.
**TODO: identify at least one individual here** will be responsible for the permanent collections.

8. What populations will you measure?

All channels, countries, and locales. No filters.

9. If this data collection is default on, what is the opt-out mechanism for users?

These collections are Glean. The opt-out can be found in the product's preferences.

10. Please provide a general description of how you will analyze this data.

TODO: Fill this in.

11. Where do you intend to share the results of your analysis?

TODO: Fill this in.

12. Is there a third-party tool (i.e. not Telemetry) that you
    are proposing to use for this data collection?

No.

As you can see, this Data Review Request skeleton comes partially filled out. Everything you previously had to mechanically fill out has been done for you, leaving you more time to focus on only the interesting questions like “Why do we need this?” and “How are you going to use it?”.

Also, this saves you from having to remember the URL to the Data Review Request Form Template each time you need it. We’ve got you covered.

And since this is part of Glean, this means this is already available to every project you can see here. This isn’t just a Firefox Desktop thing. 

Hope this saves you some time! If you can think of other time-saving improvements we could add once to Glean so every Mozilla project can take advantage of, please tell us on Matrix.

If you’re interested in how this is implemented, glean_parser’s part of this is over here, while the mach command part is here.

:chutten

This Week in Glean: Firefox Telemetry is to Glean as C++ is to Rust

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

I had this goofy idea that, like Rust, the Glean SDKs (and Ecosystem) aim to bring safety and higher-level thought to their domain. This is in comparison to how, like C++, Firefox Telemetry is built out of flexible primitives that assume you very much know what you’re doing and cannot (will not?) provide any clues in its design as to how to do things properly.

I have these goofy thoughts a lot. I’m a goofy guy. But the more I thought about it, the more the comparison seemed apt.

In Glean wherever we can we intentionally forbid behaviour we cannot guarantee is safe (e.g. we forbid non-commutative operations in FOG IPC, we forbid decrementing counters). And in situations where we need to permit perhaps-unsafe data practices, we do it in tightly-scoped areas that are identified as unsafe (e.g. if a timing_distribution uses accumulate_raw_samples_nanos you know to look at its data with more skepticism).

In Glean we encourage instrumentors to think at a higher level (e.g. memory_distribution instead of a Histogram of unknown buckets and samples) thereby permitting Glean to identify errors early (e.g. you can’t start a timespan twice) and allowing Glean to do clever things about it (e.g. in our tooling we know counter metrics are interesting when summed, but quantity metrics are not). Speaking of those errors, we are able to forbid error-prone behaviour through design and use of language features (e.g. In languages with type systems we can prevent you from collecting the wrong type of data) and when the error is only detectable at runtime we can report it with a high degree of specificity to make it easier to diagnose.

There are more analogues, but the metaphor gets strained. (( I mean, I guess a timing_distribution’s `TimerId` is kinda the closest thing to a borrow checker we have? Maybe? )) So I should probably stop here.

Now, those of you paying attention might have already seen this relationship. After all, as we all know, glean-core (which underpins most of the Glean SDKs regardless of language) is actually written in Rust whereas Firefox Telemetry’s core of Histograms, Scalars, and Events is written in C++. Maybe we shouldn’t be too surprised when the language the system is written in happens to be reflected in the top-level design.

But! glean-core was (for a long time) written in Kotlin from stem to stern. So maybe it’s not due to language determinism and is more to do with thoughtful design, careful change processes, and a list of principles we hold to firmly as the number of supported languages and metric types continues to grow.

I certainly don’t know. I’m just goofing around.

:chutten

Responsible Data Collection is Good, Actually (Ubisoft Data Summit 2021)

In June I was invited to talk at Ubisoft’s Data Summit about how Mozilla does data. I’ve given a short talk on this subject before, but this was an opportunity to update the material, cover more ground, and include more stories. The talk, including questions, comes in at just under an hour and is probably best summarized by the synopsis:

Learn how responsible data collection as practiced at Mozilla makes cataloguing easy, stops instrumentation mistakes before they ship, and allows you to build self-serve analysis tooling that gets everyone invested in data quality. Oh, and it’s cheaper, too.

If you want to skip to the best bits, I included shameless advertising for Mozilla VPN at 3:20 and becoming a Mozilla contributor at 14:04, and I lose my place in my notes at about 29:30.

Many thanks to Mathieu Nayrolles, Sebastien Hinse and the Data Summit committee at Ubisoft for guiding me through the process and organizing a wonderful event.

:chutten

Data Science is Interesting: Why are there so many Canadians in India?

Any time India comes up in the context of Firefox and Data I know it’s going to be an interesting day.

They’re our largest Beta population:

pie chart showing India by far the largest at 33.2%

They’re our second-largest English user base (after the US):

pie chart showing US as largest with 37.8% then India with 10.8%

 

But this is the interesting stuff about India that you just take for granted in Firefox Data. You come across these factoids for the first time and your mind is all blown and you hear the perhaps-apocryphal stories about Indian ISPs distributing Firefox Beta on CDs to their customers back in the Firefox 4 days… and then you move on. But every so often something new comes up and you’re reminded that no matter how much you think you’re prepared, there’s always something new you learn and go “Huh? What? Wait, what?!”

Especially when it’s India.

One of the facts I like to trot out to catch folks’ interest is how, when we first released the Canadian English localization of Firefox, India had more Canadians than Canada. Even today India is, after Canada and the US, the third largest user base of Canadian English Firefox:

pie chart of en-CA using Firefox clients by country. Canada at 75.5%, US at 8.35%, then India at 5.41%

 

Back in September 2018 Mozilla released the official Canadian English-localized Firefox. You can try it yourself by selecting it from the drop down menu in Firefox’s Preferences/Options in the “Language” section. You may have to click ‘Search for More Languages’ to be able to add it to the list first, but a few clicks later and you’ll be good to go, eh?

(( Or, if you don’t already have Firefox installed, you can select which language and dialect of Firefox you want from this download page. ))

Anyhoo, the Canadian English locale quickly gained a chunk of our install base:

uptake chart for en-CA users in Firefox in September 2018. Shows a sharp uptake followed by a weekly seasonal pattern with weekends lower than week days

…actually, it very quickly gained an overlarge chunk of our install base. Within a week we’d reached over three quarters of the entire Canadian user base?! Say we have one million Canadian users, that first peak in the chart was over 750k!

Now, we Canadian Mozillians suspected that there was some latent demand for the localized edition (they were just too polite to bring it up, y’know)… but not to this order of magnitude.

So back around that time a group of us including :flod, :mconnor, :catlee, :Aryx, :callek (and possibly others) fell down the rabbit hole trying to figure out where these Canadians were coming from. We ran down the obvious possibilities first: errors in data, errors in queries, errors in visualization… who knows, maybe I was counting some clients more than once a day? Maybe I was counting other Englishes (like South African and Great Britain) as well? Nothing panned out.

Then we guessed that maybe Canadians in Canada weren’t the only ones interested in the Canadian English localization. Originally I think we made a joke about how much Canadians love to travel, but then the query stopped running and showed us just how many Canadians there must be in India.

We were expecting a fair number of Canadians in the US. It is, after all, home to Firefox’s largest user base. But India? Why would India have so many Canadians? Or, if it’s not Canadians, why would Indians have such a preference for the English spoken in ten provinces and three territories? What is it about one of two official languages spoken from sea to sea to sea that could draw their attention?

Another thing that was puzzling was the raw speed of the uptake. If users were choosing the new localization themselves, we’d have seen a shallow curve with spikes as various news media made announcements or as we started promoting it ourselves. But this was far sharper an incline. This spoke to some automated process.

And the final curiosity (or clue, depending on your point of view) was discovered when we overlaid British English (en-GB) on top of the Canadian English (en-CA) uptake and noticed that (after accounting for some seasonality at the time due to the start of the school year) this suddenly-large number of Canadian English Firefoxes was drawn almost entirely from the number previously using British English:

chart showing use of British and Canadian English in Firefox in September 2018. The rise in use of Canadian English is matched by a fall in the use of British English.

It was with all this put together that day that lead us to our Best Guess. I’ll give you a little space to make your own guess. If you think yours is a better fit for the evidence, or simply want to help out with Firefox in Canadian English, drop by the Canadian English (en-CA) Localization matrix room and let us know! We’re a fairly quiet bunch who are always happy to have folks help us keep on top of the new strings added or changed in Mozilla projects or just chat about language stuff.

Okay, got your guess made? Here’s ours:

en-CA is alphabetically before en-GB.

Which is to say that the Canadian English Firefox, when put in a list with all the other Firefox builds (like this one which lists all the locales Firefox 88 comes in for Windows 64-bit), comes before the British English Firefox. We assume there is a population of Firefoxes, heavily represented in India (and somewhat in the US and elsewhere), that are installed automatically from a list like this one. This automatic installation is looking for the first English build in this list, and it doesn’t care which dialect. Starting September of 2018, instead of grabbing British English like it’s been doing for who knows how long, it had a new English higher in the list: Canadian English.

But who can say! All I know is that any time India comes up in the data, it’s going to be an interesting day.

:chutten