Ten-Year Moziversary

I’m a few days late publishing this, but this October marks the tenth anniversary of my first day working at Mozilla. I’m on my third hardware refresh (a Dell XPS which I can’t recommend), still just my third CEO, and now 68 reorgs in.

For something as momentous as breaking into two-digit territory, there’s not really much that’s different from last year. I’m still trying to get Firefox Desktop to use Glean instead of Legacy Telemetry and I’m still not blogging nearly as much as I’d like. Though, I did get promoted earlier this year. I am now a Senior Staff Software Engineer, which means I’m continuing on the journey of doing fewer things myself and instead empowering other people to do things.

As for predictions, I was spot on about FOG Migration actually taking off a little — in fact, quite a lot. All data collection in Firefox Desktop now either passes through Glean to get to Legacy Telemetry, has Glean mirroring alongside it, or has been removed. This is in large part thanks to a big help from Florian Quèze and his willingness to stop asking when we could start and just migrate the codebase. Now we’re working on moving the business data calculations onto Glean-sent data, and getting individual teams to change over too. If you’re reading this and were looking for an excuse to remove Legacy Telemetry from your component, this is your excuse.

My prediction that there’d be an All Hands was wrong. Mozilla Leadership has decided that the US is neither a place they want to force people to travel to nor is it a place they want to force people to travel out of (and then need to attempt to return to) in the current political climate. This means that business gatherings of any size are… complicated. Some teams have had simultaneous summits in cities both within and without the US. Some teams have had one or the other side call in virtually from their usual places of work. And our team… well, we’ve not gathered at all. Which is a bummer, since we’ve had a few shuffles in the ranks and it’d be good to get us all in one place. (I will be in Toronto with some fellow senior Data Engineering folks before the end of the year, but that’s the extent of work travel.) I’m broadly in favour of removing the requirement and expectation of travel over the US border — too many people have been disappeared in too many ways. We don’t want to make anyone feel as though they have to risk it. But it seems as though we’re also leaning away from allowing people to risk it if they want to, which is a level of paternalism that I didn’t want to see.

I did have one piece of “work” travel in that I attended CSV Conf in Bologna, Italy. Finally spent my Professional Development budget, and wow what a great investment. I learned so much and had a great time, and that was despite the heat and humidity (goodness, Italy. I was in your North (ish). In September. Why you gotta 30degC me like this?). I’m on the lookout for other great conferences to attend in 2026, so if you know any, get in touch.

My prediction that I’d still be three CEOs in because the search for a new one wouldn’t have completed by now: spot on. Ditto on executing my hardware refresh, though I’m still using a personal monitor at work. I should do something about that.

My prediction that we’d stop putting AI in everything has partially come true. There’s been a noticeable shift away from “Put genAI in it and find a problem for it to (maybe) solve” towards “If you find a problem that genAI can help with, give it a try.” You wouldn’t notice it, necessarily, looking at feature announcements for Firefox, as quite a lot of the integration infrastructure all landed in the past couple of months, making headlines. My feelings on LLMs and genAI have gained layers and nuance since last year. They’re still plagiarism machines that are illegally built by the absolute worst people in ways that worsen the climate catastrophe and entrench existing inequalities. But now they’ve apparently become actually useful in some ways. I’ve read reports from very senior developers about use cases that LLMs have been able to assist with. They are narrow use cases — you must only use it to work on components you understand well, you must only use it on tasks you would do yourself if you had the time and energy — but they’re real. And that means my usual hard line of “And even if you ignore the moral, ethical, environmental, economic, and industry concerns about using LLMs: they don’t even work” no longer applies. And in situations like a for-profit corporation lead by people from industry… ignoring the moral, ethical, environmental, economic, and industry concerns is de rigeur.

Add these to the sorta-kinda-okay things LLMs can do like natural language processing and aiding in training and refinement of machine translation models, and it looks as though we’re figuring out the “reheat the leftovers” and “melt butter and chocolate” use cases for these microwave ovens.

It still remains to be seen if, after the bubble pops, these nuclear-powered lake-draining art-stealing microwaves will find a home in many kitchens. I expect the fully-burdened cost will be awfully prohibitive for individuals who just want it to poorly regurgitate Wikipedia articles in a chat interface. It might even be too spicy for enterprises who think (likely erroneously) that they confer some instantaneous and generous productivity multiplier. Who knows.

All I know is that I still don’t like it. But I’ll likely find myself using one before the end of the year. If so, I intend to write up the experience and hopefully address my blogging drought by publishing it here.

Another thing that happened this year that I alluded to in last year’s post was the Google v DOJ ruling in the US. Well, the first two rulings anyway. Still years of appeal to come, but even the existing level of court seemed to agree that the business model that allows Mozilla to receive a bucketload of dollabux from Google for search engine placement in Firefox (aka, the thing that supplies most of my paycheque) should not be illegal at this time. Which is a bit of a relief. One existential threat to the business down… for now.

But mostly? This year has been feeling a little like 2016 again. Instead of The Internet of Things (IoT, where the S stands for Security), it’s genAI. Instead of Mexico and Muslims it’s Antifa and Trans people. The Jays are in the postseason again. Shit’s fucked and getting worse. But in all that, someone still has to rake the leaves and wash the dishes. And if I don’t do it, it won’t get done.

With that bright spot highlighted, what are my predictions for the new year:

  • I will requisition a second work monitor so I stop using personal hardware for work things.
  • FOG Migration (aka the Instrumentation Consolidation Project) will not fully remove all of Legacy Telemetry by this time next year. There’s evidence of cold feet on the “change business metrics to Glean-sent data” front, and even if there weren’t, there’s such a long tail that there’s no doubt something load-bearing that’d delay things to Q4 2025. I _am_ however predicting that FOG Migration will no longer being all-encompassing work — I will have a chance to do something else with my time.
  • I predict that one of the things I will do with that extra time is, since MoCo insists on a user population measurement KPI, push for a sensible user population measurement. Measuring the size of the user population by counting distinct _profiles_ we’ve _received_ a data packet from on a day (not that the data was collected on that day)? We can do better.
  • I don’t think there’s going to be an All Hands next year. If there is, I’d expect it to be Summit style: multiple cities simultaneously, with video links. Fingers crossed for Toronto finally getting its chance. Though I suppose if the people of the US rose up and took back their country, or if the current President should die, that could change the odds a little. Other US administrations saw the benefit of freedom of movement, regardless of which side of the aisle.
  • Maybe the genAI bubble will have burst? Timing these things is impossible, even if it weren’t the first time in history that this much of the US’ (and world’s) economy is inflating it. The sooner it bursts, the better, as it’s only getting bigger. (I suppose an alternative would be for the next shiny thing to happen along and the interest in genAI to dwindle more slowly with no single burst, just a bunch of crashes. Like blockchain/web3/etc. In that case a slower diminishing would be better than a sooner burst.)
  • I predict that a new MoCo CEO will have been found, but not yet sworn in by this time next year. I have no basis for this prediction: vibes only.

To another year of supporting the Mission!

:chutten

Nine-Year Moziversary

On this day (or near it) in 2015, I joined the Mozilla project by starting work as a full-time employee of Mozilla Corporation. I’m two hardware refreshes in (I was bad for doing them on time, leaving my 2017 refresh until 2018 and my 2020 refresh until 2022! (though, admittedly, the 2020 refresh was actually pushed to the end of 2021 by a policy change in early 2020 moving from 2-year to 3-year refreshes)) and facing a third in February. Organizationally, I’m three CEOs and sixty reorgs in.

I’m still working on Data, same as last year. And I’m still trying to move Firefox Desktop to use solely Glean for its data collection system. Some of my predictions from last year’s moziversary post came true: I continued working on client code in Firefox Desktop, I hardly blogged at all, we continue to support collections in all of Legacy Telemetry’s systems (though we’ve excitingly just removed some big APIs), Glean has continued to gain ground in Firefox Desktop (we’re up to 4134 metrics at time of writing), and “FOG Migration” has continued to not happen (I suppose it was one missed prediction that top-down guidance would change — it hasn’t, but interpretations of it sure have), and I’m publishing this moziversary blog post a little ahead of my moziversary instead of after it.

My biggest missed prediction was “We will quietly stop talking about AI so much, in the same way most firms have stopped talking about Web3 this year”. Mozilla, both Corporation and Foundation, seem unable to stop talking about AI (a phrase here meaning “large generative models built on extractive data mining which use chatbot UI”). Which, I mean, fair: it’s consuming basically all the oxygen and money in the industry at the moment. We have to have a position on it, and it’s appropriating “Open” language that Mozilla has a vested interest in protecting (though you’d be excused for forgetting that given how little we’ve tried to work with the FSF and assorted other orgs trying to shepherd the ideas and values of Open Source in the recent past). But we’ve for some reason been building products around these chatbots without interrogating whether that’s a good thing.

And you’d think with all our worry about what a definition of Open Source might mean, we’d make certain to only release products that are Open Source. But no.

I understand why we’re diving into products and trying to release innovative things in product shape… but Mozilla is famously terrible at building products. We’re okay at building services (I’m a fan of both Monitor and Relay). But where we seem to truly excel is in building platforms and infrastructure.

We build Firefox, the only independent browser, a train that runs on the rails of the Web. We build Common Voice, a community and platform for getting underserved languages (where which languages are used is determined by the community) the support they need. We built Rust, a memory-safe systems language that is now succeeding without Mozilla’s help. We built Hubs, a platform for bringing people together in virtual space with nothing but a web browser.

We’re just so much better at platforms and infrastructure. Why we don’t lean more into that, I don’t know.

Well, I _do_ know. Or I can guess. Our golden goose might be cooked.

How can Mozilla make money if our search deal becomes illegal? Maintaining a browser is expensive. Hosting services is expensive. Keeping the tech giants on their toes and compelling them to be better is expensive. We need money, and we’ve learned that there is no world where donations will be enough to fund even just the necessary work let alone any innovations we might try.

How do you monetize a platform? How do you monetize infrastructure?

Governments do it through taxation and funding. But Mozilla Corporation isn’t a government agency. It’s a conventional Silicon Valley private capital corporation (its relationship to Mozilla Foundation is unconventional, true, but I argue that’s irrelevant to how MoCo organizes itself these days). And the only process by which Silicon Valley seems to understand how to extract money to pay off their venture capitalists is products and consumers.

Now, Mozilla Corporation doesn’t have venture capital. You can read in the State of Mozilla that we operate at a profit each and every year with net assets valued at over a billion USD. But the environment in which MoCo operates — the place from which we hire our C-Suite, the place where the people writing the checks live — is saturated in venture capital and the ways of thinking it encourages.

This means Mozilla Corporation acts like its Bay Area peers, even though it’s special. Even though it doesn’t have to.

This means it does layoffs even when it doesn’t need to. Even when there’s no shareholders or fund managers to impress.

This means it increasingly speaks in terms of products and customers instead of projects and users.

This means it quickly loses sight of anything specifically Mozilla-ish about Mozilla (like the community that underpins specific systems crucial to us continuing to exist (support and l10n for two examples) as well as the general systems of word-of-mouth and keeping Mozilla and Firefox relevant enough that tech press keep writing about us and grandpas keep installing us) because it doesn’t fit the patterns of thought that developed while directing leveraged capital.

(( Which I don’t like, if my tone isn’t coming across clearly enough for you to have guessed. ))

Okay, that’s more than enough editorial for a Moziversary post. Let’s get to the predictions for the next year:

  • I still won’t blog as much as I’d like,
  • “FOG Migration” might actually happen! We’ve finally managed to convince Firefox folks just how great Glean is and they might actually commit official resources! I predict that we’re still sending Legacy Telemetry by the end of next year, but only bits and pieces. A weak shadow of what we send today.
  • There’ll be an All Hands, but depending on the result of the US federal election in November I might not attend because its location has been announced as Washington DC and I don’t know if the United States will be in a state next year to be trusted to keep me safe,
  • We will stop putting AI in everything and hoping to accidentally make a product that’ll somehow make money and instead focus on finding problems Mozilla can solve and only then interrogating whether AI will help
  • The search for the new CEO will not have completed by next October so I’ll still be three CEOs in, instead of four
  • I will execute on my hardware refresh on time this February, and maybe also get a new monitor so I’m not using my personal one for work.

Let’s see how it goes! Til next time.

:chutten

How to go from “Looks like something changed in a Firefox Desktop version” to “Here is a list of potential culprit bugs”

This will mostly be helpful to Firefox Desktop folks, so if you’re not one of those, please instead enjoy a different blogpost. I recommend this one about the three roles of data engagements.

So you’ve found yourself a plot that looks like this:

A timeseries bar plot that begins as an uptake curve then has a sudden drop around February 22. There is no legend and no y-axis as they are unimportant, and sometimes I like to be cagey about absolute figures.

You suspect this has something to do with a code change because, wouldn’t you know it, the sharp decline starts around Feb 22 and we released Firefox 123 on Feb 20. But where do you go from here? Here’s a step-by-step of how I went from this plot arriving in Slack#data-help to finding the bugfix that most likely caused the change:

1. Ensure this is actually a version-specific change

It’s interesting that the cliff in the plot happened near a release day, and it’s an excellent intuition to consider code releases for these sorts of sea-changes in data volume or character. But we should verify that this is the case by grouping by mozfun.norm.truncate_version(app_version, 'major') AS major_version which in our case gives us:

The same timeseries bar plot as before, but coloured to show groups by major Firefox Desktop version. The cliff happens solely in the colours for Firefox 123 and above.

Sure enough, in this case the volume cliff happens entirely within the Firefox 123+ colours. If this isn’t what you get, then it’s somewhat less likely that this is caused by a client code change and this guide might not help you. But for us this is near-certain confirmation that the change in the data is caused by a code change that landed in Firefox 123… but which one?

( This is where I spent a little time checking some frequent “gotcha” changes that could’ve happened. I checked: was it because data went from all-channel to pre-release-only? (No, the probe definitions didn’t change and the fall isn’t severe enough for that (would look more like an order of magnitude)) Was it because specific instrumentation within the group happened to expire in Fx123? (No, the first plot is grouped by specific probe, and all of the groups shared the same shape as their sum) Was it an incredibly-successful engagement-boosting experiment that ended? (No, there haven’t been any relevant experiments since last July) )

2. Figure out which Nightly builds are affected

Firefox Desktop releases new software versions twice a day on the Nightly channel. We can look at the numbers reported by these builds to narrow down what specific 12h period the code landed that caused this drastic shift. Or, well, you’d think we could, but when you group by build_id you get:

Another bar plot, but instead of the x-axis being time it is now "build id" which is a timestamp of a sort. The data is all over the place and patchy with no or little clear pattern.

Because our Nightly population isn’t randomly distributed across timezones, there are usage patterns that affect the population who use which build on which day. And sometimes there are “respins” where specific days will have more than 2 nightlies. And since our Nightly population is so small (You Can Help! Download Nightly Today!), and this data is a little sparse to begin with, little changes have big effects.

No, far more commonly the correct thing to do is to look at what I call a “build day”. This is how GLAM makes things useful, and this is how I make patterns visible. So group by SUBSTR(build_id, 1, 8) AS build_day, and you get:

It looks like a timeseries bar plot, but the x-axis is "build day" so it isn't quite. Notably, there's a sudden cliff starting with the nightlies for January 18.

Much better. We can see that the change likely landed in Jan 18’s nightlies. That Jan 18-20 are all of a level suggests to me that it probably ended up in all of Jan 18’s nightly builds (if it only landed in one of the (normally) two nightly builds we’d expect to see a short fall-off where Jan 18 would be more like an average between Jan 17 and 19.).

Regardless of when during the day, we’re pretty sure we have this nailed down to only one day’s worth of patches! That’s good… but it could be better.

3. Going from build days to pushlog

Ever since I was the human glue keeping the (now-decommissioned) automated regression detection system “alerts.tmo” working, I’ve had a document on my disk reminding me how to transform build days or build_ids into a “pushlog” of changes that landed in the suspect builds. This is how it works:

  1. Get the hg revisions of the suspect builds by looking through this list of all firefox releases for the suspect builds’ ids. You want the final build of the day before the first suspect build day and the final build of the final suspect build day, which in this case are Jan 17 and Jan 18, so we get f593f07c9772 and 9c0c2aab123:
A visual excerpt of the firefox releases list https://hg.mozilla.org/mozilla-central/firefoxreleases. For illustration only.
  1. Put them into this template: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange={}&tochange={} — which gives us https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=f593f07c9772&tochange=9c0c2aaab123

This gives you a list of all changes that are in the suspect builds, plus links to the specific code changes and the relevant bugs, with the topic sentence from each commit right there for you. Handy!

4. Going from a pushlog to a culprit

This is where human pattern matching, domain expertise, organizational memory, culture and practices, and institutional conventions all combine… or, to put it another way, I don’t know how to help you get from the list of all code that could have caused your data change to the one (or more) likely suspects. My brain has handily built me a heuristic and not handed me the source code, alas. But I’ve noticed some patterns:

  • Any change that is backed out can be disregarded. Often for reasons of test failures changes will be backed out and relanded later. Sometimes that’s later the same day. Sometimes that’s outside our pushlog. Skip any changes that have been backed out by disregarding any commits from a bug that is mentioned before a commit that says “Backed out N changesets (bug ###)…”.
  • You can often luck out by just text searching for keywords. It is custom at Mozilla to try to be descriptive about the “what” of a change in the commit’s topic, so you could try looking for “telemetry” or “ping” or “glean” to see if there’s anything from the data collection system itself in there. Or, since this particular example had to do with Firefox Relay’s integration with Firefox Desktop, I looked for “relay” (no hits) and then “form” (which hit a few times, like on the word “information”, … but also on the culprit which was in the form detector code.)
  • This is a web view on the source code, so you’re not limited to what it gives you. If you have a mozilla-central checkout yourself, you can pull up the commits (if you’re using git-cinnabar you can use its hg2git functionality to change the revs from hg to git) and dump their sum-total changes to a viewer, or pipe it through grep, or turn it into a spreadsheet you can go through row-by-row, or anything you want. I’m lazy so I always try keywording on the pushlog first, but these are always there for when I strike out.

5. Getting it wrong

Just because you found the one and only commit that landed in a suspect build that is at all related, even if that commit’s bug specifically mentions that it fixed a double-counting issue, even if there’s commentary in the code review that explains that they expect to see this exact change you just saw… you might be wrong.

Do not be brusque in your reporting. Do not cast blame. And for goodness’ sake be kind. Even if you are correct, being the person who caused a change that resulted in this investigation can be a not-fun experience. Ask Me How I Know.

Firefox Desktop is a complex system, and complex systems fail. It’s in their nature.


And that’s it! If you have any comments, question, or (better yet) improvements, please find me on the #glean:mozilla.org channel on Matrix and I’d love to chat.

:chutten

Eight-Year Moziversary

At the end of my post for my seven-year moziversary, I made some predictions about what was to be and now has been the next year of work. And I got them pretty spot-on:

Predictions for the next year of Moz Work:

  • There’ll be another All Hands
  • Glean will continue to gain ground in Firefox Desktop
  • “FOG Migration” will not happen

There was an all hands. It was in Montreal. It was fun to have folks come to a city I knew a little bit (though I’m still sore we didn’t get June 2020’s Toronto All-Hands). Poutine. Bagels. And a cirque-themed closing party.

Glean continued to gain ground on Firefox Desktop. Last year’s post mentioned over 100 Glean probes in Firefox Desktop. The current count as of time of writing is 368. Now, some of this is due to some projects our team have undertaken, but a lot of it is organic.

This is despite “FOG Migration” not happening. Firefox leadership remained uninterested in the prospect of migrating data collections to begin being sent in Glean. Though in Montreal there were some conversations that suggest that this might be changing.

So, what have I been up to? Well, I discovered in January that a legacy data collection system (PingCentre) was subject to some data loss that was incompatible with how the data was being used (( You can imagine that data loss would be acceptable for certain things like performance measurement or feedback, so long as you could characterize the loss (e.g. you lose stuff randomly? Only the small numbers? Only the feedback from Ottawa?). It’s less acceptable for retention or revenue. )). By March, replacing PingCentre had become a top-level OKR and I was managing the project.

So this year has been spent growing an appreciation for Project Management as a discipline. I now have more opinions about work tracking than I ever dreamed I’d have (though, no, I’ve not set up Kanban or anything else).

I’ve also continued my practice of basically never saying No to someone who had a question for me. As much as I bemoan the new tendency of questions being asked over direct message instead of in a topic channel where anyone can help, it does bring me no little joy to partner in a data exploration, consult on answering awkward privacy/data questions from contributors, or debug someone’s test file “out loud” so they can follow along. It really is the people that make Mozilla special, so helping them feels like a high calling.

Which is why I find our continued focus on “AI” to be so baffling. So much of “AI” we hear about is dragging the humanity out of the Internet that Mozilla is so keen to protect. We seem to be just as bad as the Valley for using “AI” to mean everything from outstanding work on local machine translation (now available in Firefox 118), to LLMs spouting out incorrect answers when you ask them to explain CSS. I hope we provide some clarity about what we mean when we say “AI”, and draw a thick line between what we’re doing and the grifts being peddled all around us at great cost to truth and the environment.

I understand that we need to be in a business to be able to speak about it. It’s why I’m excited that we’re giving social media some attention. I can’t wait to see what those teams create for the world. But the way everything became “AI” so fast sounds like chasing the hype cycle.

As for me, what do I expect to do? First, I expect to finish up the year by migrating Use Counters to Glean. Then… who knows? Maybe the results of the Events Work Week will exceed expectations and require more investment. Maybe I’ll find another data collection system in Firefox Desktop that’s dropping between 2% and 15% of all data that’ll need replacing. Maybe I’ll finally get to rewrite the IPC layer so it leaves the main thread alone. Yeah, okay maybe not.

Predictions for the next year of moz work:

  • I’ll work on client code in Firefox Desktop
  • I’ll not blog as much as I’d like
  • We continue to support existing collections in all of Legacy Telemetry’s core systems
  • There’ll be an All Hands (safe bet, as Dublin was announced in Montreal), and at least one more will be announced
  • Glean will continue to be used more on Firefox Desktop, and not just because Use Counters will juice the numbers (I will no doubt ascribe this increase to be disproportionately due to the (well-received) Glean talk I (finally) gave to a Firefox Front-End Engineering team)
  • “FOG Migration” will not happen, but new top-down guidance will be handed down expanding the circumstances where Glean is explicitly stated to be the data collection system system of choice in Firefox Desktop (and not just because it provides the best API to Legacy Telemetry)
  • We will quietly stop talking about AI so much, in the same way most firms have stopped talking about Web3 this year
  • I will publish the moziversary blog post actually _on_ my moziversary, unlike this year

Let’s see how that pans out.

:chutten

Never Look at the Data: Why did we start getting so many pings from Korea?

Something happened on January 5, 2023. All of a sudden we abruptly started receiving a number of pings from Firefox Desktop clients in Korea equal to two times the size of the entire Korean Firefox Desktop population.

What happened? How did we notice it? What did we do about it?

Let’s back up.

I can’t remember where I learned it, but I’d already started reciting as dogma in my first year of University: “The most important part about any feature is the ability to turn it off”. It’s served me well through my studies and my career. I’ve also found it to be especially true for data collection systems where, for whatever reason, as a user you might decide you no longer want the software you’re using to continue to send data. In some places this is even enshrined in laws where you can request the deletion of data that has already been collected.

Law or not, Mozilla has before, does now, and will always make it easy for you to decide whether to send data to Mozilla. We may not understand why you make that choice, and it definitely will make it harder for us to ensure our products meet your needs, but we’ll respect the heck out of your choice in our processes and in our products.

This is why, when Mozilla’s data collection system Glean is told the user went from allowing data upload to forbidding it, we send one final “deletion-request” ping before shutting down. The “deletion-request” ping contains all the internal identifiers we’ve used to longitudinally group data (if we receive ten crash reports it’s important to know whether it’s the same Firefox crashing ten times or if it’s ten Firefoxes crashing once), and we use those identifiers to (well) identify what data we’ve collected that we’re now going to delete.

For the purposes of this story you’ll need to know that there’s two times when Glean notices the product’s gone from “data upload: on” to “data upload: off”: while Glean is running, and during Glean startup. If Glean’s running, then we just handle things – we were told the setting changed from “data upload: on” to “data upload: off” and away we go. But Glean knows that it isn’t always listening to the data upload setting, so if it it starts up with “data upload: off” and the last time it shut down we were “data upload: on” we’ll send a specific “at_init”-reason “deletion-request” ping.

We in the Data Org monitor how Glean is behaving. One thing we’ve learned about how Glean behaves is that the number of “deletion-request” pings is roughly constant over time. And the proportion of “deletion-request” pings that have the “at_init” reason should remain a fairly fixed one.

What shouldn’t happen is for Firefox Desktop-sent “at_init”-reason “deletion-request” pings to spike like this on January 5:

time-series plot of ping volumes from December 2022 until mid-January 2023 showing abnormal abrupt increases in volume starting on January 5.

What we do when we notice things like this is file a bug. As the one responsible for Glean’s integration in Firefox Desktop, and as someone with a long history of looking into anomalies, I took a look. At this initial point I was pretty sure it’d be a single actor (a single user, a single company, a single internet cafe) doing something odd… but alas, the evidence was inconclusive:

Evidence consistent with a single actor being responsible for it all:

  • All the pings were coming from the same internet provider. Korea Telecom is responsible for a bare majority of Firefox Desktop data delivery from Korea, but the spikes were entirely from that ISP.
  • The Mozilla Community in Korea could offer no explanation of any wide-spread computer or software event that matched the timeline.
  • “at_init”-reason “deletion-request” pings could be a result of automation changing the files on disk to read “data upload: off” between runs of Firefox Desktop.

Evidence inconsistent with a single actor being responsible for it all:

  • The data came from a mix of Firefox Desktop versions: versions 101.0.1, 104.0, and 108.0.2.
  • The data came from a range of different regions, more or less following the population density of Korea itself.
  • “at_init”-reason “deletion-request” pings could instead be the result of users changing the setting to “data upload: off” early enough during Firefox Desktop startup that Glean hasn’t yet been initialized.

Regardless of why it was happening, it quickly became more important that we learn what we needed to do about it. We spun up an Incident, which is how we organize ourselves when there’s something happening that requires cross-functional collaboration and isn’t getting better on its own. Once there we ascertained that we could respond very quickly and decisively and do

Nothing at all.

The volume of these pings vastly eclipsed any other “deletion-request” pings we would otherwise have received, so you’d be forgiven for thinking that it was terribly expensive to receive, store, and process them all. In reality, we batch these requests. And even before this spike, every batch of requests required editing every partition of every table. Adding another list of identifiers to delete equal in size to two times the peak Firefox Desktop population in Korea just doesn’t matter all that much.

The pressure was off. Even if it got worse… which it did:

Time-series plot of "deletion-request" pings isolated to just those from Korea. Spikes begin January 25 and dwarf other reports. A plateau begins March 26 and continues to the right edge of the plot around April 10.

On March 26, when it reached and maintained a peak of five times the volume of the Firefox Desktop population in Korea, it still wasn’t harming our data platform’s ability to serve business needs or costing us all that much in operational spend. We didn’t need to invest effort into running down the source, so we didn’t.

And so I just kept an occasional eye on it until, just as suddenly but not quite as abruptly as it began, on April 12 the ping volumes began to decrease. By April 18, we were back to normal levels.

Time-series plot of "deletion-request" pings isolated to just those from Korea. Very similar to the previous plot, but continues until April 18. Spikes begin January 25 and dwarf other reports. A plateau begins March 26 and stays up there until April 12 when falls away to nothing over the course of five days or so.

We had successfully ignored it until it went away.

So what happened to Korean Firefox Desktop users from Jan 5 to April 12, 2023? We never figured it out. If you know about something happening across those dates in Korea: please get in touch. As little as it needed solving for the sake of business needs, it still needs solving for the sake of my curiosity. 

:chutten

This Week in Glean: Page Load Data, Three Ways (Or, How Expensive Are Events?)

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we make, among other things, Web Browsers which we tend to call Firefox. The central activity in a Web Browser like Firefox is loading a web page. It gets done a lot by each and every one of our users, and so you can imagine that data about pageloads is of important business interest to us.

But exactly because this is done a lot and by every one of our users, this inspires concerns of scale and cost. How much does it cost us to learn more about pageloads?[0]

As with all things in Data, the answer is the same: “Well, it depends.”

In this case it depends on how you record the data. How you record the data depends on what questions you hope to answer with it. We’re going to stick to the simplest of questions to make this (highly-suspect) comparison even remotely comparable.

Option 1: Just the Counts, Ma’am

I say page loads are done a lot, but how much is “a lot”? If that’s our only question, maybe the data we need is simply a count of pageloads. Glean already has a metric type for counting things, so it should be fairly quick to implement.

This should be cheap, right? Just a single number? Well, it depends.

Scale 1: Frequency

The count of pageloads is just a single number. One, maybe as many as eight, bytes to record, store, transmit, retain, and analyze. But Firefox has to report it more than once, so we need to first scale our cost of “one, maybe as many as eight, bytes” by the number of times we send this information.

When we first implemented Firefox’s pageload count in Glean, I wanted to send it on the builtin “metrics” ping which is sent once a day from anyone running Firefox that day[1]. In an effort to gain more complete and timely data, we ended up adding it to the builtin “baseline” ping which is sent (on average for Firefox Desktop) 8 or more times per day.

For our frequency scale we thus use 8/day.

Scale 2: Population

These 8 recordings per day are sent by about 200M users over a month. Days and months aren’t easy to scale between as not all users use Firefox every day, and our population gains new users and loses old users at variable rates… so I recalculated the Frequency scale to be in terms of months and found that we get 68 pings per month from these roughly 200M users.

So the cost is pretty easy to calculate then? Whatever the cost is of storing and transmitting 200M x 68/month x eight bytes ~= 109 GB?

Not entirely. But until and unless those other costs are not comparable between options, we can just treat them as noise. This cost, rendered in the size of the data, of about 109GB? It’ll do.

Option 2: What an Event

Page loads are interesting not just in how many of them there are, but also about what type of load they are and how long the load took. The order of a page load in between other events might also be of interest: did it happen before or after some network trouble? Did a bunch of pageloads happen all at once, or spread across the day? We might wish to instrument page loads as Glean events.

Events are each more expensive than a count. They carry a timestamp (eight bytes) and repeat their names each time they’re recorded (some strings, say fifteen bytes).

(We are not counting the load type or how long the load took in our calculations of the size of an individual sample as we’re still trying to compare methods of answering the same “How many page loads are there?” question.)

Scale 3: Page Loads

“Each time they’re recorded”, huh. Guess that means we get to multiply by the number of page loads. Each Firefox Desktop user, over the course of a month, loads on average 1190 pages[2]. This means instead of sending 68 numbers a month, we’re sending 1190 batches of strings a month.

So the comparable cost is whatever the cost is of storing and transmitting 200M x (eight bytes and fifteen bytes) x 1190 ~= 5.47TB..

We’ve jumped an order of magnitude here. And we’re not done.

Option 3: Custom Pings, and Custom Pings Only

What if the context we wish to record alongside the event of a page load cannot fit inside Glean’s prudent “event” metric type limits? What if the collected pageload data would benefit from a retention limit or access control list different from other counts or events? What if you want to submit this data to be uploaded as soon as it has been recorded? In that case, we could send a pageload as a Glean custom ping.

We’ve not (yet) done this in Firefox Desktop (at least partially because it complicates ordering amongst other events: the Glean SDK expends a lot of effort to ensure the timestamps between events are reliable. Ping times are client times which are subject to the whims of the user.), so I’m going to get even hand-wavier than before as I try to determine how large each individual data sample will be.

A Glean custom ping without any metrics in it comes to around 500 bytes. When our data platform ingests the ping and turns it into a row in a dataset, we add some metadata which adds another 300 bytes or so (which only affects storage inside the Data Platform and doesn’t add costs to client storage or client bandwidth).

We could go deeper and cost out the network headers, the costs of using TLS to ensure the integrity of the connection… but we’d be here all day. So I’m gonna call that 200 bytes to make it a nice round 1000 bytes per ping.

We’re sending these pings per pageload, so the cost is whatever the cost is of storing and transmitting 200M x 1190 x 1000 bytes = 238TB.

Rule of Thumb: 50x

There you have it: for each step up the cost ladder you’re adding an extra 50x multiplier to the cost of storing and transmitting the data. The reality’s actually much worse if it’s harder to analyze and reason about the data as it gets more complex (which it in most cases is) because, as you might remember from one of my previous explorations in costing out metrics: it’s the human costs of things (like analysis) that really getcha.

But you have to balance it out. If adding more context and information ensures your analyses only have to look in one place for its data instead of trying to tie together loosely-coupled concepts from multiple locations… if using a custom ping ensures you have everything you need and don’t have to form a committee to resource an engineer to add implementation which needs to be deployed and individually validated… if you’re willing to bet 50x or 250x the cost on getting it right the first time, then that could be a good price to pay.

But is this the case for you and your data?

Well, it depends.

:chutten

[0]: Avid readers of this blog may notice that this isn’t the first time I’ve written on the costs of data. And it likely won’t be the last!

[1]: How often a “metrics” ping is sent is a little more complicated than “once a day”, but it averages out to about that much so I’m sticking with it for this napkin.

[2]: Yes there are some wild and wacky outliers included in the figure “an average of 1190 page loads” that I’m not bothering to clean up. You can Page Loads Georg to your hearts’ content.

[3]: This is about how many characters the JSON-encoded ping payload comes to, uncompressed.

This Week in Glean: What If I Want To Collect All The Data?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

Mozilla’s approach to data is “as little as necessary to get the job done” as espoused in our Firefox Privacy Promise and put in a shape you can import into your own organization in Mozilla’s Lean Data Practices. If you didn’t already know, you’d find out very quickly by using it that Glean is a Mozilla project. All of its systems are designed with the idea that you’ve carefully considered your instrumentation ahead of time, and you’ve done some review to ensure that the collection aligns with your values.

(This happens to have some serious knock-on benefits for data democratization and tooling that allows Mozilla’s small Data Org to offer some seriously-powerful insights on a shoestring budget, which you can learn more about in a talk I gave to Ubisoft at their Data Summit in 2021.)

Less Data, as the saying goes, implies Greater Data and Greatest Data. Or in a less memetic way, Mozilla wants to collect less data… but less than what?

Less than more, certainly. But how much more? How much is too much?

How much is “all”?

Since my brain’s weird I decided to pursue this thought experiment of “What is the _most_ data you could collect from a software project being used?”.

Well, looking at Firefox, every button press and page load and scroll and click and and and… all of that matters. As does the state of Firefox when it’s being clicked and scrolled and so forth. Typing in the urlbar is different if you already have a page loaded. Opening your first tab is different from opening your nine-thousand-two-hundred-and-fiftieth.

And, underneath it all, is the code. How fast is it running? How much memory are we using? All these performance questions that Firefox Telemetry was originally built to answer. Is code on line 123 of file XYZ.cpp running? Is it running well? What do we run next?

For software this means to record all of the data, we’ll need to know the full state of the program at every expression it runs in every line of code. At every advancement of the Program Counter, we’d need to dump the entire Stack and Heap.

Yikes! That’s gigabytes of data per clock cycle.

Well, maybe we can be cleverer than this. Another one of those projects Mozilla incubated that now has a whole community of contributors and users (like Rust) is a lightweight record-and-replay debugger called rr. The rr debugger collects traces of a running piece of software and can deterministically replay it over and over again (backwards, even!), meaning it has all the information we need in it.

So a decent size estimate for “all the data” might be the size of one of these trace recordings. They’re big, but not “full heap and stack at every program counter” big. A short test run of Firefox was about 2GB for a one minute run (albeit without any user interaction or graphics).

Could Glean collect traces like these? Or bigger ones after, say, a full day’s use? Not easily. Not without modification.

Let’s say we did those modifications. Let’s push this thought experiment further. What does that mean for analysis? Well, we’d have all these recordings we could spin up a VM to replay for us. If we want the number of open tabs, we could replay it and sample that count whenever we wanted.

This would be a seismic shift in how instrumentation interacted with analysis. We’d no longer have to ship code to instrument Firefox, we could “simply” (in quotes because using rr requires you to be a programming nerd) replay existing traces and extract the new data we needed.

It would also be absolutely horrible. We’d have to store every possible metric just in case we wanted _one_ of them. And there’s so much data in these traces that Mozilla doesn’t want to store: pictures you looked at, videos you watched, videos you uploaded… good grief. We don’t want any of that.

(( I’d like to take a second to highlight that this is a thought experiment: Mozilla doesn’t do this. We don’t have plans to do this. In fact, Mozilla’s Data Privacy Principles (specifically “Limited Data”) and Mozilla’s Manifesto (specifically Principle 4 “Individuals’ security and privacy on the internet are fundamental and must not be treated as optional.”) pretty clearly state how we think about data like this. ))

And processing these traces into a useful form for analysis to be performed would take the CPU processing power of a small country, over and over again.

(( And rr introduces a 20% performance penalty which really wouldn’t ingratiate us to our users. And it only works on Linux meaning the data we’d have access to wouldn’t be representative of our user base anyway. ))

And what was the point of this again? Right. We’re here to quantify what “less data” means. But how can we do that, even knowing as we do now what the size of “all data” is? Can we compare the string value of the profile directory’s random portion comparable to the url the user visits the most? Are those both 1 piece of data that we can compare to the N pieces of data we get in a full rr trace? Mozilla doesn’t think they’re the same, since we categorize (and thus treat) these collections differently.

All in all maybe figuring out the maximum amount of data you could collect in order to contextualize how much less of it you are collecting might not be meaningful.

Oh well.

I guess this means that the only way Mozilla (and you!) can continue to quantify “less data” is by comparing it to “no data” – the least possible amount of data.

:chutten

This Week in Glean: How Long Must I Wait Before I Can See My Data?

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. All “This Week in Glean” blog posts are listed in the TWiG index).

You’ve heard about this cool Firefox on Glean thing and wish to instrument a part of Firefox Desktop. So you add a metrics.yaml definition for a new Glean metric, commit a piece of instrumentation to mozilla-central, and then Lando lands it for you.

When can you expect the data collected when users’ Firefoxes trigger that instrumentation to show up in a queryable place like BigQuery?

The answer is one of the more common phrases I say when it comes to data: Well, it depends.

In the broadest sense, we’re looking at two days:

1) A new Nightly build will be generated within 12h of your commit.

2) Users will pick up the new Nightly build fairly quickly after that, and start triggering the instrumentation.

3) The following 4am a “metrics” ping containing data from your instrumentation will be submitted (or some time later if Firefox isn’t running at 4am)

4) A new schema generated to include your new metric definition will have been deployed overnight

5) The following 12am UTC a new partition of our user-facing stable views will have the previous day’s submissions available.

And then you commence querying! Easy as that.

Any questions?

The Questions:

What if I added a new metrics.yaml file?

That file needs to land in gecko-dev (the github mirror of mozilla-central) first. Only then can we (and by “we” I here mean the Data Team, by means of a bug you file) update the data pipeline. Then you get to wait until the next weekday’s midnight UTC (ish) for a schema deploy as per Step 4.

Generally this doesn’t add too much delay, but if landing the file happens after the pipeline folks have gone home, we get to wait until the next weekday’s midnight UTC.

The Nightly population is small and weird. How long until we get data from release?

Uptake of code to release takes a while longer. Code lands in mozilla-central, and gets into the next Nightly within 12h. Getting to Beta from Nightly means waiting until the next merge day (these days that’s on the first Monday of the month, or thereabouts). Getting to Release from Beta means waiting until the merge day after that.

If you’re unlucky, you’ll be waiting over two months for your instrumentation to be in a Release build that users can pull down.

And then you get to wait for enough Release users to update that you’re getting a representative sample. (This could take a week or so.)

So… nine weeks?

That sounds really bad! Is there anything we can do?

Why yes.

The first thing we can do is adjust our expectations. There’s a four-week sway from the worst-case to best-case on this slow path. It isn’t likely that you’ll always be landing instrumentation immediately after a merge day and get to wait the whole month until Nightly merges to Beta.

Your average wait for that is only two weeks. And the best case is a matter of a day or two.

So cross your fingers, and hope your luck is with you.

Secondly, instrumentation is (by itself) very low-risk, so you can “uplift” the instrumentation change directly to Beta without waiting for merge day.

This can cut your route to release down to _two weeks_, by (e.g.) landing in Nightly on Monday Nov 22, verifying that it works on Tuesday, requesting uplift on Wednesday, getting uplifted in the last Beta on Thursday Nov 25, then making the merge from Beta to Release on Dec 6.

(You do still get to wait a third week for the release population to update to the latest version.)

Thirdly, what are the chances that your instrumentation is measuring a feature you just built or just turned on? You want that feature to benefit from the slow-roll exposure to the more tolerant audiences of Nightly and Beta before it reaches Release, right? Automated testing is great, but nothing can simulate the wild variety of use cases and hardware combinations your feature will experience in the Real World.

So what point is there getting your instrumentation into Release before the feature under instrumentation reaches it? Instead of measuring the interval between landing instrumentation and beginning analysis, perhaps measure the interval between the release of the feature you wish to instrument and beginning analysis?

That interval is only a day: gotta wait for that partition in the stable view. Sounds much better, doesn’t it?

Still, can I get data any faster?

The fastest time from Point A) Landing a metric, to Point B) Performing preliminary analysis on a metric, is about 12h:

1) Land your code just before a new Nightly is cut.

2) Hope that the number of Nightly users that update to the latest build over the next twelve hours is enough for your purposes.

If you didn’t luck out and have a schema deploy, you’ll need to dig your data out of the additional_properties JSON column. If you are lucky, you can use the friendly columns instead.

To get to the data before the nightly copy-deduplicate to stable views, you’ll be querying the live tables instead. You need to fully-qualify that table name. You need to realize that we haven’t deduped anything in here. And you need to take narrow slices, because we can’t cluster the data effectively here, so querying can get expensive, fast.

Can I get data that quickly from release?

Not yet.

I’ve seen a proposal internally for dynamically-defined metrics which get pushed to running Firefox instances (talk to :esmyth if you’re interested). Though its present form is proposing the process and possibility, not the technology, there’s a version of this I can see that would (for a subset of data collection) take the time from “I wish to instrument this” to “I am performing analysis on data received from (a subset of) the release Firefox population” down to within a business day.

Which is neat! But that speed brings risk, so it’ll take a while to design a system that doesn’t expose our users to that risk.

Don’t expect this for Christmas, is I guess what I mean : )

:chutten

This Week in Glean: Data Reviews are Important, Glean Parser makes them Easy

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean.) All “This Week in Glean” blog posts are listed in the TWiG index).

At Mozilla we put a lot of stock in Openness. Source? Open. Bug tracker? Open. Discussion Forums (Fora?)? Open (synchronous and asynchronous).

We also have an open process for determining if a new or expanded data collection in a Mozilla project is in line with our Privacy Principles and Policies: Data Review.

Basically, when a new piece of instrumentation is put up for code review (or before, or after), the instrumentor fills out a form and asks a volunteer Data Steward to review it. If the instrumentation (as explained in the filled-in form) is obviously in line with our privacy commitments to our users, the Data Steward gives it the go-ahead to ship.

(If it isn’t _obviously_ okay then we kick it up to our Trust Team to make the decision. They sit next to Legal, in case you need to find them.)

The Data Review Process and its forms are very generic. They’re designed to work for any instrumentation (tab count, bytes transferred, theme colour) being added to any project (Firefox Desktop, mozilla.org, Focus) and being collected by any data collection system (Firefox Telemetry, Crash Reporter, Glean). This is great for the process as it means we can use it and rely on it anywhere.

It isn’t so great for users _of_ the process. If you only ever write Data Reviews for one system, you’ll find yourself answering the same questions with the same answers every time.

And Glean makes this worse (better?) by including in its metrics definitions almost every piece of information you need in order to answer the review. So now you get to write the answers first in YAML and then in English during Data Review.

But no more! Introducing glean_parser data-review and mach data-review: command-line tools that will generate for you a Data Review Request skeleton with all the easy parts filled in. It works like this:

  1. Write your instrumentation, providing full information in the metrics definition.
  2. Call python -m glean_parser data-review <bug_number> <list of metrics.yaml files> (or mach data-review <bug_number> if you’re adding the instrumentation to Firefox Desktop).
  3. glean_parser will parse the metrics definitions files, pull out only the definitions that were added or changed in <bug_number>, and then output a partially-filled-out form for you.

Here’s an example. Say I’m working on bug 1664461 and add a new piece of instrumentation to Firefox Desktop:

fog.ipc:
  replay_failures:
    type: counter
    description: |
      The number of times the ipc buffer failed to be replayed in the
      parent process.
    bugs:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_reviews:
      - https://bugzilla.mozilla.org/show_bug.cgi?id=1664461
    data_sensitivity:
      - technical
    notification_emails:
      - chutten@mozilla.com
      - glean-team@mozilla.com
    expires: never

I’m sure to fill in the `bugs` field correctly (because that’s important on its own _and_ it’s what glean_parser data-review uses to find which data I added), and have categorized the data_sensitivity. I also included a helpful description. (The data_reviews field currently points at the bug I’ll attach the Data Review Request for. I’d better remember to come back before I land this code and update it to point at the specific comment…)

Then I can simply use mach data-review 1664461 and it spits out:

!! Reminder: it is your responsibility to complete and check the correctness of
!! this automatically-generated request skeleton before requesting Data
!! Collection Review. See https://wiki.mozilla.org/Data_Collection for details.

DATA REVIEW REQUEST
1. What questions will you answer with this data?

TODO: Fill this in.

2. Why does Mozilla need to answer these questions? Are there benefits for users?
   Do we need this information to address product or business requirements?

TODO: Fill this in.

3. What alternative methods did you consider to answer these questions?
   Why were they not sufficient?

TODO: Fill this in.

4. Can current instrumentation answer these questions?

TODO: Fill this in.

5. List all proposed measurements and indicate the category of data collection for each
   measurement, using the Firefox data collection categories found on the Mozilla wiki.

Measurement Name | Measurement Description | Data Collection Category | Tracking Bug
---------------- | ----------------------- | ------------------------ | ------------
fog_ipc.replay_failures | The number of times the ipc buffer failed to be replayed in the parent process.  | technical | https://bugzilla.mozilla.org/show_bug.cgi?id=1664461


6. Please provide a link to the documentation for this data collection which
   describes the ultimate data set in a public, complete, and accurate way.

This collection is Glean so is documented
[in the Glean Dictionary](https://dictionary.telemetry.mozilla.org).

7. How long will this data be collected?

This collection will be collected permanently.
**TODO: identify at least one individual here** will be responsible for the permanent collections.

8. What populations will you measure?

All channels, countries, and locales. No filters.

9. If this data collection is default on, what is the opt-out mechanism for users?

These collections are Glean. The opt-out can be found in the product's preferences.

10. Please provide a general description of how you will analyze this data.

TODO: Fill this in.

11. Where do you intend to share the results of your analysis?

TODO: Fill this in.

12. Is there a third-party tool (i.e. not Telemetry) that you
    are proposing to use for this data collection?

No.

As you can see, this Data Review Request skeleton comes partially filled out. Everything you previously had to mechanically fill out has been done for you, leaving you more time to focus on only the interesting questions like “Why do we need this?” and “How are you going to use it?”.

Also, this saves you from having to remember the URL to the Data Review Request Form Template each time you need it. We’ve got you covered.

And since this is part of Glean, this means this is already available to every project you can see here. This isn’t just a Firefox Desktop thing. 

Hope this saves you some time! If you can think of other time-saving improvements we could add once to Glean so every Mozilla project can take advantage of, please tell us on Matrix.

If you’re interested in how this is implemented, glean_parser’s part of this is over here, while the mach command part is here.

:chutten

This Week in Glean: Firefox Telemetry is to Glean as C++ is to Rust

(“This Week in Glean” is a series of blog posts that the Glean Team at Mozilla is using to try to communicate better about our work. They could be release notes, documentation, hopes, dreams, or whatever: so long as it is inspired by Glean. You can find an index of all TWiG posts online.)

I had this goofy idea that, like Rust, the Glean SDKs (and Ecosystem) aim to bring safety and higher-level thought to their domain. This is in comparison to how, like C++, Firefox Telemetry is built out of flexible primitives that assume you very much know what you’re doing and cannot (will not?) provide any clues in its design as to how to do things properly.

I have these goofy thoughts a lot. I’m a goofy guy. But the more I thought about it, the more the comparison seemed apt.

In Glean wherever we can we intentionally forbid behaviour we cannot guarantee is safe (e.g. we forbid non-commutative operations in FOG IPC, we forbid decrementing counters). And in situations where we need to permit perhaps-unsafe data practices, we do it in tightly-scoped areas that are identified as unsafe (e.g. if a timing_distribution uses accumulate_raw_samples_nanos you know to look at its data with more skepticism).

In Glean we encourage instrumentors to think at a higher level (e.g. memory_distribution instead of a Histogram of unknown buckets and samples) thereby permitting Glean to identify errors early (e.g. you can’t start a timespan twice) and allowing Glean to do clever things about it (e.g. in our tooling we know counter metrics are interesting when summed, but quantity metrics are not). Speaking of those errors, we are able to forbid error-prone behaviour through design and use of language features (e.g. In languages with type systems we can prevent you from collecting the wrong type of data) and when the error is only detectable at runtime we can report it with a high degree of specificity to make it easier to diagnose.

There are more analogues, but the metaphor gets strained. (( I mean, I guess a timing_distribution’s `TimerId` is kinda the closest thing to a borrow checker we have? Maybe? )) So I should probably stop here.

Now, those of you paying attention might have already seen this relationship. After all, as we all know, glean-core (which underpins most of the Glean SDKs regardless of language) is actually written in Rust whereas Firefox Telemetry’s core of Histograms, Scalars, and Events is written in C++. Maybe we shouldn’t be too surprised when the language the system is written in happens to be reflected in the top-level design.

But! glean-core was (for a long time) written in Kotlin from stem to stern. So maybe it’s not due to language determinism and is more to do with thoughtful design, careful change processes, and a list of principles we hold to firmly as the number of supported languages and metric types continues to grow.

I certainly don’t know. I’m just goofing around.

:chutten