Planet Igalia

January 14, 2026

Manuel Rego

Servo 2025 Stats

This is a brief blog post to highlight the growth of the Servo community in recent years, particularly since Igalia took over the project maintenance in 2023.

Note that this doesn’t talk about the technical achievements, though there have been tons of them in the last years. A picture is worth a thousand words so just take a look at this slide from my latest Servo talk which shows how google.com was rendered with Servo at the beginning of 2023 vs September 2025.

Slide showing screenshots of Servo rendering google.com in January 2025 vs September 2025

Slide showing screenshots of Servo rendering google.com in January 2023 vs September 2025

PRs numbers #

So like we did last year, let’s take a look at the PRs merged on the main Servo repository on GitHub since 2018.

2018 2019 2020 2021 2022 2023 2024 2025
PRs 1,188 986 669 118 65 776 1,771 3,183
Contributors 27.33 27.17 14.75 4.92 2.83 11.33 26.33 42.42
Contributors ≥ 10 2.58 1.67 1.17 0.08 0.00 1.58 4.67 8.50
  • PRs: total number of PRs merged.
  • Contributors: average number of contributors per month.
  • Contributors ≥ 10: average number of contributors that have merged more than 10 PRs per month.

As a clarification, these numbers don’t include PRs from bots (dependabot and Servo WPT Sync).

Checking this we can see we are close to double the numbers from last year! The numbers in 2025 are way bigger than in the previous years (even checking the numbers from 2018-2019), showing a healthy community working on Servo.

The next chart is a different view of the same data but split per month, with the number of PRs landed every month, the number of contributors and the number of contributors with more than 10 patches. It shows the evolution over the years and the high activity last year.

Number of contributors #

Now let’s focus on the last 3 years, since the project reactivation, and the numbers of contributors to the Servo project.

2023 2024 2025
Contributors 54 129 146
≥ 100 PRs 1 (2%) 3 (2%) 8 (5%)
≥ 10 PRs 8 (15%) 29 (22%) 43 (29%)
Only 1 PR 31 (57%) 53 (41%) 55 (38%)

The number of contributors to Servo has tripled since 2023, reaching 146 different contributors in 2025.

If we analyze the rest of the data in this table, we can see that the percentage of contributors that do a single PR to Servo in a year has been reduced, meaning that Servo contributors are now usually doing more than one PR to the project.

If we check the number of contributors that have done more than 10 PRs in a year, we see the percentage almost doubling from 15% to 29% in the last 3 years.

And for the top contributors doing more than 100 PRs in a year, we have gone from 1 in 2023 and 3 in 2024 to 8 last year, which represent the 5% of the Servo contributors, showing a good team of very active contributors to the project.

WPT pass-rate #

Let’s take a look at WPT evolution in 2025.

2025 January 1st December 31st Diff
Score % 48.2% 61.6% +13.4%
Subtests (passed/total) 1396647/1998146 1866247/1998146 +469,600
Subtests % 69.9% 93.4% +23.5%

Evolution of WPT pass rates for Servo in 2025

Evolution of WPT pass rates for Servo in 2025

You can check more information about WPT pass-rates at Servo’s website (where you can also find an explanation of the Score number).

Note that these numbers differ from wpt.fyi because we’re still not running all the WPT tests in Servo, so the total numbers here are smaller.

It’s not easy to extract conclusions from this data, but it shows the Servo project keeps progressing and supporting more web platform features as time passes.

Sometimes these numbers grow artificially as new tests are added to WPT for features that Servo already supports (for example, the biggest jump last year was in October getting 188,281 new subtests passing without any change in Servo, just because new tests were added to WPT).

GitHub stars #

Evolution of GitHub stars for Servo from https://www.star-history.com/#servo/servo

Evolution of GitHub stars for Servo from star-history.com

We are about to reach 35,000 stars on GitHub. It’s good to see the project has not stopped growing since the beginning, and the curve has become steeper in recent years.

Other #

If we check to the official project roles, we have now:

  • 5 administrators
  • 17 TSC members
  • 25 maintainers
  • 18 contributors

We have also started doing Servo releases, we have done 3 so far.

Also the TSC has setup sponsorship tiers for donations. We got 4 bronze sponsors in 2025 and we hope to increase the number of sponsorships in 2026.

Regarding donations, we have defined a funding process to request usage of that money. We are currently using it to sponsor Josh Matthews’ contributions, and pay for self-hosted runners to speed up CI times.

Servo has been present in several events last year, we ended up giving 10 talks all around the globe.

Wrap-up #

The idea here was to do a quick recap of the Servo stats in 2025. Taking a look at these numbers every now and then is useful, and gives you a different perspective about the status of the project, that one can easily ignore during the day-to-day tasks.

In general things have grown a lot in 2025, who knows what would happen in 2026, but we hope we can at least keep similar numbers or maybe even keep growing them further. That would be really great news for the Servo project.

Igalia is really proud of what the whole Servo community has achieved together in the recent years, and we hope for a bright future for the project going forward.

As an aside note, by the end of the month I’ll be at FOSDEM talking about Servo, other Servo folks like Delan Azabani and Martin Robinson will also be there. If you are around, don’t hesitate to say hi and ask anything about the project.

January 14, 2026 12:00 AM

January 13, 2026

Miyoung Shin

Our Journey to support Extensions for embedders

A History of Extensions for Embedders — and Where We’re Heading

Chromium’s Extensions platform has long been a foundational part of the desktop browsing experience. Major Chromium-based browsers—such as Chrome and Microsoft Edge—ship with full support for the Chrome Extensions ecosystem, and user expectations around extension availability and compatibility continue to grow.

In contrast, some Chromium embedders— for instance, products built directly on the //content API without the full //chrome stack—do not naturally have access to Extensions. Similarly, the traditional Chrome for Android app does not support Extensions. While some embedders have attempted to enable limited Extensions functionality by pulling in selected pieces of the //chrome layer, this approach is heavyweight, difficult to maintain, and fundamentally incapable of delivering full feature parity.

At Igalia we have been willing to help on the long term-goal of making Extensions usable on lightweight, //content-based products, without requiring embedders to depend on //chrome. This post outlines the background of that effort, the phases of work so far, the architectural challenges involved, and where the project is headed.

Note: ChromeOS supporting extensions (ChromeOS has announced plans to incorporate more of the Android build stack) is not the same thing as Chrome-Android App supporting extensions. The two codepaths and platform constraints differ significantly. While the traditional Chrome app on Android phones and tablets still does not officially support extensions, recent beta builds of desktop-class Chrome on Android have begun to close this gap by enabling native extension installation and execution.

Tracking bug: https://issues.chromium.org/issues/356905053

Extensions Architecture — Layered View

The following diagram illustrates the architectural evolution of Extensions support for Chromium embedders.

Traditional Chromium Browser Stack

At the top of the stack, Chromium-based browsers such as Chrome and Edge rely on the full //chrome layer. Historically, the Extensions platform has lived deeply inside this layer, tightly coupled with Chrome-specific concepts such as Profile, browser windows, UI surfaces, and Chrome services.

+-----------------------+
|      //chrome         |
|  (UI, Browser, etc.)  |
+-----------------------+
|     //extensions      |
+-----------------------+
|      //content        |
+-----------------------+

This architecture works well for full browsers, but it is problematic for embedders. Products built directly on //content cannot reuse Extensions without pulling in a large portion of //chrome, leading to high integration and maintenance costs.


Phase 1 — Extensions on Android (Downstream Work)

In 2023, a downstream project at Igalia required extension support on a Chromium-based Android application. The scope was limited—we only needed to support a small number of specific extensions—so we implemented:

  • basic installation logic,
  • manifest handling,
  • extension launch/execution flows, and
  • a minimal subset of Extensions APIs that those extensions depended on.

This work demonstrated that Extensions can function in an Android environment. However, it also highlighted a major problem: modifying the Android //chrome codepath is expensive. Rebasing costs are high, upstream alignment is difficult, and the resulting solution is tightly coupled to Chrome-specific abstractions. The approach was viable only because the downstream requirements were narrow and controlled.

I shared this experience at BlinkOn Lightning Talk: “Extensions on Android”.


Phase 2 — Extensions for Embedders
( //content + //extensions + //components/extensions )

Following Phase 1, we began asking a broader question:

Can we provide a reusable, upstream-friendly Extensions implementation that works for embedders without pulling in the //chrome layer?

Motivation

Many embedders aim to remain as lightweight as possible. Requiring //chrome introduces unnecessary complexity, long build times, and ongoing maintenance costs. Our hypothesis was that large portions of the Extensions stack could be decoupled from Chrome and reused directly by content-based products.

One early idea was to componentize the Extensions code by migrating substantial parts of //chrome/*/extensions into //components/extensions.

+-------------------------+
| //components/extensions |
+-------------------------+
|      //extensions       |
+-------------------------+
|       //content         |
+-------------------------+

Proof-of-concept : Wolvic

We tested this idea through Wolvic , a VR browser used in several commercial
solutions. Wolvic has two implementations:

  • a Gecko-based version, and
  • a Chromium-based version built directly on the //content API.


Originally, Extensions were already supported in Wolvic-Gecko, but not in Wolvic-Chromium. To close that gap, we migrated core pieces of the Extensions machinery into //components/extensions and enabled extension loading and execution in a content-only environment.

By early 2025, this work successfully demonstrated that Extensions could run without the //chrome layer.

Demo video::
https://youtube.com/shorts/JmQnpC-lxR8?si=Xf0uB6q__j4pmlSj

Design document:
https://docs.google.com/document/d/1I5p4B0XpypR7inPqq1ZnGMP4k-IGeOpKGvCFS0EDWHk/edit?usp=sharing

However, this work lived entirely in the Wolvic repository, which is a fork of Chromium. While open source, this meant that other embedders could not easily benefit without additional rebasing and integration work.

This raised an important question:

Why not do this work directly in the Chromium upstream so that all embedders can benefit?


Phase 3 — Extensions for Embedders
(//content + //extensions)

Following discussions with the Extensions owner (rdevlin.cronin@chromium.org), we refined the approach further.

Rather than migrating functionality into //components, the preferred long-term direction is to move Extensions logic directly into the //extensions layer wherever possible.

+-----------------------+
|      Embedder UI      | (minimal interfaces)
+-----------------------+
|      //extensions     |
+-----------------------+
|       //content       |
+-----------------------+

This approach offers several advantages:

  • clearer layering and ownership,
  • fewer architectural violations,
  • reduced duplication between Chrome and embedders,
  • a cleaner API surface for integration.

We aligned on this direction and began upstream work accordingly.

Tracking bug: 🔗 https://issues.chromium.org/issues/358567092

Our goals for Content Shell + //extensions are:

  1. Embedders should only implement a small set of interfaces, primarily for UI surfaces (install prompts, permission dialogs) and optional behaviors.
  2. Full Web Extensions APIs support
    w3c standard : https://w3c.github.io/webextensions/specification/
  3. Chrome Web Store compatibility
    Embedders should be able to install and run extensions directly from the Chrome Web Store.

Short-term Goal: Installation Support

Our immediate milestone is to make installation work entirely using //content + //extensions.

Current progress:

  • ✅ .zip installation support already lives in //extensions
  • 🚧 Migrating Unpacked directory installation from //chrome to //extensions
    (including replacing Profile with BrowserContext abstractions)
  • 🔜 Moving .crx installation code from //chrome → //extensions

    As part of this effort, we are introducing clean, well-defined interfaces for install prompts and permission confirmations:
  • Chrome will continue to provide its full-featured UI
  • Embedders can implement minimal, custom UI as needed

What Comes Next:

Once installation is fully supported, we will move on to:

  • Chrome Web Store integration flows
  • Core WebExtensions APIs required by commonly used extensions

Main Engineering Challenge — Detaching from the Chrome Layer

The hardest part of this migration is not moving files—it is breaking long-standing dependencies on the //chrome layer.

The Extensions codebase is large and historically coupled to Chrome-only concepts such as:

  • Profile
  • Browser
  • Chrome-specific WebContents delegates
  • Chrome UI surfaces
  • Chrome services (sync, signin, prefs)

Each migration requires careful refactoring, layering reviews, and close collaboration with component owners. While the process is slow, it has already resulted in meaningful architectural improvements.


What’s Next?

In the next post, We’ll demonstrate:

A functioning version of Extensions running on top of
//content + //extensions only — capable of installing and running extensions app.

from Igalia side, we continue working on ways to make easier integrating Chromium on other platforms, etc. This will mark the first end-to-end, //chrome-free execution path for extensions in content-based browsers.

Stay tuned!

by mshin at January 13, 2026 02:00 AM

January 07, 2026

Alex Bradbury

Per-query energy consumption of LLMs

How much energy is consumed when querying an LLM? We're largely in the dark when it comes to proprietary models, but for open weight models that anyone can host on readily available, albeit eye-wateringly expensive, hardware this is something that can be measured and reported, right? In fact, given other people are doing the hard work of setting up and running benchmarks across all kinds of different hardware and software configurations for common open weight models, can we just re-use that to get a reasonable figure in terms of Watt-hours (Wh) per query?

For the kind of model you can run locally on a consumer GPU then of course there's some value in seeing how low the per-query energy usage might be on a large scale commercial setup. But my main interest is in larger and more capable models, the kind that you wouldn't realistically run locally and end up using in a pay-per-token manner either directly with your host of choice or through an intermediary like OpenRouter. In these cases where models are efficiently served with a minimum of 4-8 GPUs or even multi-node clusters it's not easy to get a feel for the resources you're using. I'm pretty happy that simple back of the envelope maths shows that whether providers are properly amortising the cost of their GPUs or not, it's implausible that they're selling per-token API access for open models at below the cost of electricity. That gives a kind of upper bound on energy usage, and looking at the pennies I spend on such services it's clearly a drop in the ocean compared to my overall energy footprint. But it's not a very tight bound, which means it's hard to assess the impact of increasing my usage.

We can look at things like Google's published figures on energy usage for Gemini but this doesn't help much. They don't disclose the length of the median prompt and its response, or details of the model used to serve that median query meaning it's not helpful for either estimating how it might apply to other models or how it might apply to your own usage (which may be far away from this mysterious median query). Mistral released data on the per query environmental impact (assuming for a 400 token query), but the size of the Mistral Large 2 model is not disclosed and they don't calculate a Wh per query figure. CO2 and water per query are very helpful to evaluate a particular deployment, but the actual energy used is a better starting point that can be applied to other providers assuming different levels of carbon intensity. If one of the API providers were to share statistics based on a real world deployment of one of the open models with a much higher degree of transparency (i.e. sharing stats on the number of queries served during the period, statistics on their length, and measured system power draw) that would be a useful source of data. But today we're looking at what we can conclude from the InferenceMAX benchmark suite published results.

I'd started looking at options for getting good figures thinking I might have to invest in the hassle and expense of renting a multi-GPU cloud instance to run my own benchmarks, then felt InferenceMAX may make that unnecessary. After writing this up along with all my provisos I'm perhaps tempted again to try to generate figures myself. Anyway, read on for a more detailed look at that benchmark suite. You can scroll past all the provisos and jump ahead to the figures giving the Wh/query figures implied by the benchmark results across different GPUs, different average input/output sequence lengths, and for gpt-oss 120B and DeepSeek-R1-0528. But I hope you'll feel a bit guilty about it.

If you see any errors, please let me know.

High-level notes on InferenceMAX

InferenceMAX benchmark suite has the stated goal to "provide benchmarks that both emulate real world applications as much as possible and reflect the continuous pace of software innovation." They differentiate themselves from other benchmarking efforts noting "Existing performance benchmarks quickly become obsolete because they are static, and participants often game the benchmarks with unrealistic, highly specific configurations."

The question I'm trying to answer is "what is the most 'useful AI' I can expect for a modern GPU cluster in a realistic deployment and how much energy does it consume". Any benchmark is going to show peak throughput higher than you'd expect to achieve in real workload and there's naturally a desire to keep it pinned on a specific model for as long as it isn't totally irrelevant in order to enable comparisons as hardware and software evolves with a common point of reference. But although I might make slightly different choices about what gets benchmarked and how, the InferenceMAX setup at first look seems broadly aligned with what I want to achieve.

They benchmark DeepSeek-R1-0528 (both at the native fp8 quantisation and at fp4) which is a 671B parameter model with 37B active weights released ~7 months ago and seems a fair representative of a large MoE open weight model. gpt-oss-120b is also benchmarked, providing a point of comparison for a much smaller and efficient to run model. Different input sequence length and output sequence length (ISL and OSL - the number of input and output tokens) are tested: 1k/1k, 1k/8k, 8k/1k, which provides coverage of different query types. Plus tests against a wide range of GPUs (including the 72-GPU GB200 NVL72 cluster) and sweeps different settings.

At the time of writing you might reasonably consider to be 'InferenceMAX' is split into around three pieces:

GitHub Actions is used to orchestrate the runs, ultimately producing a zip file containing JSON with the statistics of each configuration (e.g. here). The benchmark_serving.py script is invoked via the run_benchmark_serving wrapper in benchmark_lib.sh which hardcodes some options and passes through some others from the workflow YAML. The results logged by benchmark_serving.py are processed in InferenceMAX's process_result.py helper which will produce JSON in the desired output format. Together, these scripts provide statistics like throughput (input and output token), end to end latency, interactivity (output tokens per second) etc.

Further studying the benchmark setup

So, let's look at the benchmarking logic in more detail to look for any surprises or things that might affect the accuracy of the Wh-per-query figure I want to generate. I'll note that InferenceMAX is an ongoing project that is actively being developed. These observations are based on a recent repo checkout, but of course things may have changed since then if you're reading this post some time after it was first published.

Looking through I made the following observations. Some represent potential issues (see the next subheading for a list of the upstream issues I filed), while others are just notes based on aspects of the benchmark I wanted to better understand.

  • One of the required arguments to the benchmark serving script is --random-range-ratio. This is set by default to 0.8 in benchmark-tmpl.yml and in benchmark-multinode-tmpl.yml and is not overridden elsewhere.
    • This argument is ultimately used in sample_random_requests. It uses np.random.randint to sample input/output lengths between the range_ratio * {input,output}_len and {input,output}_len.
    • Taken together, this logic means for for a workload advertised as having 8k input or output tokens (8192), the benchmark will actually run with an average ~7373 (0.9*num_tokens, due to the length being a random number between 0.8*num_tokens and num_tokens) tokens.
    • Because the throughput figures are calculated using the actual input and output token lengths, the figure does represent what was observed, it's just the workload doesn't quite match the description. The reported end to end latency for instance will be misleadingly lower than you would get for a workload that actually did have the expected input / output sequence lengths.
  • The various request functions in backend_request.func.py will set output.success = False if they don't get a HTTP 200 status code back for a request. There is no logic to retry a refused request and metrics will be calculated skipping any failed requests. This means an overloaded server will perform better on this benchmark for metrics like E2E latency and TTFT if it refuses requests rather than accept them and serve them slowly. As the number of failed requests isn't included in the results json it's not easy to tell if this is a factor for any benchmarks.
  • Many of the various scripts in the benchmarks/ subdirectory set a max-model-len parameter or the similar --max_seq_len parameter for trt-llm (e.g. the b200 config which if I'm not mistaken will ultimately be set from the max_model_len defined in generate_sweep_configs.py. This parameter is documented in vllm and in TensortRT-LLM and controls the maximum supported length of a request, including both the prompt and any generated output. Setting it 20 or 200 tokens above the sum of the benchmarked ISL+OSL to minimise memory use does not seem like a realistic real-world deployment, which seems the wrong choice given the InferenceMAX complaint that in other suites "participants often game the benchmarks with unrealistic, highly specific configurations". Benchmarks naturally show a 'best case', but if you're generating figures like $ per M tokens it's a figure that makes little sense if it reflects a configuration you wouldn't feasibly use/sell.
  • Throughput is calculated in benchmark_serving.py based on the total number of tokens divided by the duration of the benchmark. This is then normalised on a per-GPU basis in process_result.py. No problems here, I just wanted to clarify the source of the figure.
  • In terms of the source of the input tokens themselves, we can see that --dataset-name random is always passed to benchmark_serving.py. This leads to sample_random_requests being called, which will pick random token ids and create a list of tokens of the desired length (mapping these randomly picked IDs to tokens).
    • The --ignore-eos flag is passed to the benchmark_serving.py script which will in turn set this option in the JSON when making the LLM request. backend_request_func.py sets this and also sets max_tokens to the desired output_len which should ensure that the response has that exact desired number of output tokens. ignore_eos means that the LLM server will keep generating tokens even after seeing the end of sequence token.
    • It's interesting that some of the benchmark configurations enable multi-token prediction, and presumably find it beneficial even given the totally random token inputs. Is it possible that such configurations benefit from undesirable looped outputs (due to a combination of random inputs and continuing to sample tokens past the EOS marker) that potentially are very predictable and give an extra boost?
  • The --num-prompts parameter controls the total number of requests that are issued. The benchmark script is written so it will wait for all of these to complete (either successfully or unsuccessfully). This is typically set to the concurrency times 10, but some benchmark setups set it higher (presumably as the default figure finishes too quickly for good results).
  • In terms of how requests are submitted with a certain level of concurrency:
  • There are no tests that the configuration is serving the model with the expected quality currently, but there's an issue tracking at least adding a simple quality benchmark. Although none of the explored settings should impact the quality of output, it's always possible they trigger a bug and in this case it's not interesting to benchmark.
  • It would be helpful for reproducibility if more complete system information for the benchmark runners was released. This is being worked on.
  • You should of course consider whether the tested input and output sequence lengths correspond to a workload you are interested in (thank you to Aaron Zhao for reminding me to mention this. This benchmarking approach also doesn't consider caching. Both factors could be highly relevant if trying to estimate energy cost for a long context chat or 'agentic' flow. But I'm happy enough with the tested workloads as a starting point, and my main focus here is trying to get a degree of comfort with the reported numbers for the ISL/OSL combinations they've chosen to test.

Filed issues

I ended up filing the following issues upstream:

  • FIXED Token throughput per MW is described as reflecting the generated tokens but is actually processed+generated tokens
    • The companion article introducing InferenceMAX has previously defined throughput as the rate at which the GPU generates tokens yet the figure displayed in the UI was the total number of output and input tokens per second. The definition in the article has now been fixed, and changes to the UI make it more obvious based on context that throughput refers to input+output tokens (as y-axis metric options now exist to show "input token throughput per GPU" and "output token throughput per GPU").
    • This talking head video from Nvidia seems to make the same error, talking about the number of tokens 'generated' per second per GPU when looking at the relevant results these sem to be the total throughput (i.e. output plus the much faster to process input tokens).
  • Presented input/output token throughput per GPU for disaggregated setups not usefully comparable to standard multi-gpu setups
    • In disaggregated setups you have some number of GPUs dedicated to prefill (processing input tokens) and some number dedicated to decode (generating output tokens). In this case, the reported input/output throughput figures refer to the input or output throughput per prefill GPU or per decode GPU. It doesn't make sense (IMHO) to plot this figure against the input/output throughput figures for a non-disaggregated setup. To make it comparable, the input/output throughput per GPU should be calculated by averaging across the whole cluster rather than just the GPUs dedicated to prefill or decode respectively.
  • Standard deviatation of interactivity (std_intvty) in result json is incorrectly calculated
    • Not a big issue as the figure isn't used anywhere. Interactivity (tokens/second) metrics are calculated from the recorded time per output token. 1000/$tpot_metric is correct for the mean, median, and p99 figures but mathematically incorrect for the standard deviation. e.g. a small standard deviation for time per output token will result in a huge standard deviation being computed for interactivity.
  • FIXED Reference kW figures no longer shown in frontend for each GPU
    • At some point updates to the frontend logic meant that the per-GPU kW figures used in calculating the token throughput per utility MW were no longer displayed. This has now been fixed.
  • How will full workflow run output be retained beyond 90 days
    • The benchmark frontend helpfully links to the GitHub Actions run that generated the displayed results and has a datepicker to view previous results. Clicking through to GitHub means you can download the original .zip of the JSON format benchmark results which is something I take advantage of in the analysis later in this article. According to GitHub docs, the maximum retention period for Actions artifacts and logs is 90 days for a public repo. It would be good to have a mechanism so that this information is backed up rather than lost.
  • Contents of CONFIG_DIR path as used in launch_gb200-nv.sh is undisclosed
    • Most benchmark configuration lives in the main repository, but unfortunately one of the Nvidia DeepSeek R1 configurations relies on a config dir that's not publicly available meaning it can't be audited or reproduced. This is a case where tightening up benchmark rules and review process can hopefully avoid it happening in the future.
  • Reconsider allowing setting max_model_len / max_seq_len to isl+osl+tiny_margin
    • As explained above, a number of benchmarks set max_model_len (or for Nvidia's TensorRT, --max_seq_len) to some figure that is just above ISL+OSL. Although some degree of tuning is expected, to me this goes against the idea that "We want server configs to reflect real world deployments as much as possible" and the stated goal "to provide benchmarks that both emulate real world applications as much as possible and reflect the continuous pace of software innovation". It's hard to imagine a realistic deployment that would configure their serving engine in a way such that it errors if input+output tokens passes ~2k tokens for instance. Looking at the DeepSeek R1 0528 providers on OpenRouter, the vast majority offer greater than 128k context.
    • By my understanding, with PagedAttention the KV cache is dynamically allocated anyway so this setting would largely impact other data structures. Plus vllm at least contains a startup check that there is sufficient VRAM to serve at least one request at the maximum configured context. I would really like to see what impact this setting has on benchmarks.
    • The repository maintainers renamed my issue to a title that doesn't reflect my report. I'm hopeful they will review my recent comment and title it back.
  • Some reported metrics will be inflated if a serving engine sheds load
    • This covers the observation made above that failed requests are simply skipped. As the number of failed requests isn't tracked, it's not easy to see if a particular configuration may appear better (better E2E latency, lower time to first token) as a reset of shedding load rather than queueing.
    • The repository maintainers renamed this issue to "[feature suggestion for vllm/vllm benchmark_serving]" and closed it. I'm hopeful they will read my response and reconsider on the grounds that:
      • The benchmark_serving script isn't doing anything "wrong" necessarily. It is simply making an implementation choice with potential impact on results that the InferenceMAX harness isn't tracking.
      • The script is planned to be added to the repo soon anyway.
  • Benchmarked ISL and OSL averages 0.9*target_length meaning results are over-optimistic.
    • This is the problem mentioned above where the introduced variance in input/output sequence length has an average lower than the headline rate. As noted, this means specifically the end to end latency figure is misleading, but also impacts tokens/second and throughput to the extent that the cost of serving a query doesn't scale with O(n).
    • This will be fixed by PR 339 which upstreams the benchmark_serving.py script and in that modified branch changes sample_random_requests to sample a range with multiplier between 1 - RANGE_RATIO and 1 + RANGE_RATIO.

In the best case, you'd hope to look at the benchmark results, accept they're probably represent a higher degree of efficiency than you'd likely get on a real workload, that an API provider might achieve 50% of that and double the effective cost per query to give a very rough upper estimate on per-query cost But that only really works if the reported benchmark results roughly match the achievable throughput in a setup configured for commercial serving. Given the tuning to specific isl/osl values, I'm not at all confident thats the case and I don't know how wide the gap is.

Generating results

Firstly I wrote a quick script to check some assumptions about the data and look for anything that seems anomalous. Specifically:

  • Check that total throughput per GPU matches what you'd expect based on the input token and output token throughput per GPU, even in the disaggregated case. i.e. the total thoughput per GPU averaged over the whole cluster should equal the sum of the input and output throughput per GPU provided those figures are averaged over the whole cluster.
  • The ratio of input token throughput to output token throughput should be almost equal to the to the ratio of input to output tokens in the benchmark's workload. If not, there is something surprising that needs investigating.

Based on the information available in the generated result JSON and the reported all-in power per GPU (based on SemiAnalysis' model), we can calculate the Watt hours per query. First calculate the joules per token (watts per GPU divided by the total throughput per GPU). This gives a weighted average of the joules per token for the measured workload (i.e. reflecting the ratio of isl:osl). Multiplying joules per token by the tokens per query (isl+osl) gives the joules per query, and we can just divide by 3600 to get Wh.

There is some imprecision because we're constructing the figure for e.g. 8192/1024 ISL based on measurements with an average 0.9*8192 input and 0.9*1024 output length. The whole calculation would be much simpler if the benchmark harness recorded the number of queries executed and in what time, meaning we can directly calculate the Wh/query from the Wh for the system over the benchmark duration divided by the number of queries served (and remembering that in the current setup each query is on average 90% of the advertised sequence length).

This logic is wrapped up in a simple script.

There's been a recent change to remove the 'full sweep' workflows in favour of only triggering a subset of runs when there is a relevant change. But I grabbed my results from before this happened, from a December 15th 2025 run. However when finalising this article I spotted Nvidia managed to land some new NVL72 DeepSeek R1 0528 configurations just before Christmas, so I've merged in those results as well, using a run from December 19th. All data and scripts are collected together in this Gist.

Results

As well as giving the calculated Wh per query, the script also gives a comparison point of minutes of PS5 gameplay (according to Sony, "Active Power Consumption" ranges from ~217W to ~197W depending on model - we'll just use 200W). The idea here is to provide some kind of reference point for what a given Wh figure means in real-world times, rather than focusing solely on the relative differences between different deployments. Comparisons to "minutes of internet streaming" seem popular at the moment, presumably as it's because an activity basically everyone does. I'm steering away from that because I'd be comparing one value that's hard to estimate accurately and has many provisos to another figure that's hard to estimate accurately and has many provisos, which just injects more error and uncertainty into this effort to better measure/understand/contextualise energy used for LLM inference.

I'm now going to cherry-pick some results for discussion. Firstly for DeepSeek R1 0528 with 8k/1k ISL/OSL, we see that the reported configurations that give a usable level of interactivity at fp8 report between 0.96-3.74 Wh/query (equivalent to 0.29-1.12 minutes of PS5 gaming). The top row which is substantially more efficient is the newer GB200 NVL72 configuration added at the end of last year. It's not totally easy to trace the configuration changes given they're accompanied by a reworking of the associated scripts, but as far as I can see the configuration ultimately used is this file from the dynamo repository. Looking at the JSON the big gain comes from significantly higher prefill throughput (with output throughput per GPU remaining roughly the same). This indicates the older results (the second row) were bottlenecked waiting for waiting for prefill to complete.

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp8 DS R1 0528 8k/1k 39.5 36.5 gb200 dynamo-sglang (72 GPUs disagg, conc: 2048, pfill_dp_attn, dec_dp_attn) 0.96 0.29
fp8 DS R1 0528 8k/1k 31.3 55.2 gb200 dynamo-sglang (72 GPUs disagg, conc: 1024, pfill_dp_attn, dec_dp_attn) 3.13 0.94
fp8 DS R1 0528 8k/1k 20.9 48.8 h200 trt (8 GPUs, conc: 64, dp_attn) 3.32 1.00
fp8 DS R1 0528 8k/1k 19.5 49.6 h200 sglang (8 GPUs, conc: 64) 3.39 1.02
fp8 DS R1 0528 8k/1k 23.9 39.9 b200-trt trt (8 GPUs, conc: 64) 3.39 1.02
fp8 DS R1 0528 8k/1k 22.3 44.5 b200 sglang (8 GPUs, conc: 64) 3.74 1.12

Now taking a look at the results for an fp4 quantisation of the same workload, the result is significantly cheaper to serve with similer or better interactivity and the NVL72 setup Nvidia submitted does have a significant advantage over the 4/8 GPU clusters. This time we see 0.63-1.67 Wh/query (equivalent to 0.19-0.50 minutes of PS5 power draw while gaming). Serving at a lower quantisation impacts the quality of results of course, but the improved efficiency, including on smaler 4 GPU setups helps demonstrate why models like Kimi K2 thinking are distributed as "native int4", with benchmark results reported at this quantisation and quantisation aware training used to maintain quality of result.

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 8k/1k 41.6 24.6 gb200 dynamo-trt (40 GPUs disagg, conc: 1075, pfill_dp_attn, dec_dp_attn) 0.63 0.19
fp4 DS R1 0528 8k/1k 22.8 43.2 b200-trt trt (4 GPUs, conc: 128, dp_attn) 0.93 0.28
fp4 DS R1 0528 8k/1k 18.7 59.3 b200 sglang (4 GPUs, conc: 128) 1.25 0.38
fp4 DS R1 0528 8k/1k 30.3 39.4 b200 sglang (4 GPUs, conc: 64) 1.67 0.50

Looking now at the 1k/8k workload (i.e. generating significant output) and the cost is 15.0-16.3 Wh/query (equivalent to 4.49-4.89 minutes of PS5 power draw while gaming). As expected this is significantly higher than the 8k/1k workload as prefill (processing input tokens) is much cheaper per token than decode (generating output tokens)

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp8 DS R1 0528 1k/8k 42.5 176.3 b200 sglang (8 GPUs, conc: 64) 15.0 4.49
fp8 DS R1 0528 1k/8k 31.9 232.2 h200 sglang (8 GPUs, conc: 64) 15.9 4.76
fp8 DS R1 0528 1k/8k 31.2 237.9 h200 trt (8 GPUs, conc: 64) 16.3 4.88
fp8 DS R1 0528 1k/8k 39.1 189.5 b200-trt trt (8 GPUs, conc: 64) 16.3 4.89

Again, fp4 has a significant improvement in efficiency:

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 DS R1 0528 1k/8k 29.7 251.5 b200-trt trt (4 GPUs, conc: 256, dp_attn) 2.73 0.82
fp4 DS R1 0528 1k/8k 37.7 197.5 b200-trt trt (8 GPUs, conc: 256, dp_attn) 4.31 1.29
fp4 DS R1 0528 1k/8k 34.2 221.2 b200 sglang (4 GPUs, conc: 128) 4.75 1.43
fp4 DS R1 0528 1k/8k 33.1 223.1 b200-trt trt (4 GPUs, conc: 128) 4.79 1.44

As you'd expect for a much smaller model at native fp4 quantisation, GPT-OSS-120B is much cheaper to serve. e.g. for 8k/1k:

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 GPT-OSS 120B 8k/1k 45.8 20.8 b200-trt trt (1 GPUs, conc: 128) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 93.1 10.5 b200-trt trt (2 GPUs, conc: 128, dp_attn) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 44.3 21.4 b200 vllm (1 GPUs, conc: 128) 0.11 0.03
fp4 GPT-OSS 120B 8k/1k 145.7 6.7 b200-trt trt (2 GPUs, conc: 64, dp_attn) 0.14 0.04
fp4 GPT-OSS 120B 8k/1k 103.8 9.2 b200 vllm (2 GPUs, conc: 64) 0.20 0.06

Or for 1k/8k:

Workload Intvty (tok/s) E2EL (s) Details Wh/Q PS5 min
fp4 GPT-OSS 120B 1k/8k 80.5 91.6 b200-trt trt (1 GPUs, conc: 128) 0.49 0.15
fp4 GPT-OSS 120B 1k/8k 72.3 102.0 b200 vllm (1 GPUs, conc: 128) 0.55 0.16
fp4 GPT-OSS 120B 1k/8k 144.9 51.1 b200-trt trt (2 GPUs, conc: 128, dp_attn) 0.55 0.17
fp4 GPT-OSS 120B 1k/8k 129.4 57.0 b200-trt trt (2 GPUs, conc: 128) 0.61 0.18

Conclusion

Well, this took rather a lot more work than I thought it would and I'm not yet fully satisfied with the result. Partly we have to accept a degree of fuzziness about marginal energy usage of an individual query - it's going to depend on the overall workload of the system so there's going to be some approximation when you try to cost a single query.

I'm glad that InferenceMAX exists and am especially glad that it's open and publicly developed, which is what has allowed me to dive into its implementation to the extent I have and flag concerns/issues. I feel it's not yet fully living up to its aim of providing results that reflect real world application, but I hope that will improve with further maturation and better rules for benchmark participants. Of course, it may still make most sense to collect benchmark figures myself and even if doing so, being able to refer to the benchmarked configurations and get an indication of what hardware can achieve what performance is helpful in doing so. Renting a 72-GPU cluster is expensive and as far as I can see not typically available for a short time, so any benchmarking run by myself would be limited to 4-8 GPU configurations. If the gap in efficiency is huge for such setups vs the NVL72 then these smaller setups are maybe less interesting.

If I found the time to run benchmarks myself, what would I be testing? I'd move to DeepSeek V3.2. One of the big features of this release was the movement to a new attention mechanism which scales much closer to linearly with sequence length. With e.g. Kimi Linear and Qwen3-Next, other labs are moving in a similar direction experimentally at least. I'd try to set up 8 GPU configuration with sglang/vllm configured in a way that it would be capable of serving a commercial workload with varied input/output sequence lengths and test this is the case (Chutes provide their deployed configs which may be another reference point). I'd want to see how much the effective Wh per million input/output tokens varies depending on the different isl/osl workloads. These should be relatively similar given the linear attention mechanism, and if so it's a lot easier to estimate the rough energy cost of a series of your own queries of varied length. I would stick with the random input tokens for the time being.

So where does that leave us? All of this and we've got figures for two particular models, with one benchmark harness, a limited set of input/output sequence lengths, and a range of potential issues that might impact the conclusion. I think this is a useful yardstick / datapoint, though I'd like to get towards something that's even more useful and that I have more faith in.


Article changelog
  • 2026-01-09:
    • Fix broken link.
    • Add note that more complete system info would be helpful for reproducibility.
    • Add note about variety of input/output sequence lengths tested.
  • 2026-01-07: Initial publication date.

January 07, 2026 12:00 PM

January 06, 2026

Brian Kardell

The Secret Life of Custom Elements

The Secret Life of Custom Elements

Twenty years ago last month, Google published an analysis of "slightly over a billion documents," a snapshot of the web that helped shape the early direction of HTML5. It followed a lineage of smaller, more personal studies — individuals poking at the web to answer some narrow question, often with datasets that would easily fit on a thumb drive today. For about half those two decades, I’ve been arguing that we need more study of the web, not less. The platform evolves faster than our understanding of it, and the only way to know what the web actually is — not what we imagine it to be — is to look.

Every month the HTTP Archive quietly captures a snapshot of the web as it actually exists—not the idealized web that we hope for, but the messy, improvised, duct‑taped reality of millions of sites in the wild. I’ve been collecting and studying these elements for the last six years.

This new dataset is the largest I’ve ever worked with: Billions of pages, hundreds of thousands of distinct non-standard element names, and a long tail that stretches into places no standards body has ever seriously examined. And unlike the Google study, which looked for patterns in class names, this dataset captures the long tail of non‑standard elements — the names people invent for actual elements when the platform doesn’t give them what they need.

What emerges is a portrait of the web as it is lived: messy, inventive, repetitive, global, and full of reinvention. It’s also a mirror held up to the platform itself.

But, it's also much more complex to study than I could have imagined a decade ago, and I really wish that the W3C (and member orgs which include academia) had taken up the charge to begin to figure out how to really study the web and use that information to inform standards work.

What's difficult about it...

One problem is that the dataset itself has some fairly extreme bias. The crawl doesn't hit anything that isn't on the public internet - that means it excludes intranets which are massive. In fact, most of my career was spent working on intranets. The crawl captures only home pages, plus the target of whatever it interprets as the largest link on that page. It also can't get to anything that requires login - which means that for a site like twitter or bluesky or mastodon, you're going to get something very unrepresentative of any of those. So, one challenge I'd love to see us trying to tackle is how to get even better data representation. It's hard to "pave cowpaths" if they're in a country we can't even see into.

Initially I had this idea that we could watch for the adoption of tags - imagining that we'd get some that would become very popular, just like we did with JavaScript libraries and frameworks. However, it turns out that this is not the signal it might first appear to be. An element appearing in tens of thousands or even hundreds of thousands of pages is often simply because they are part of a larger successful system. If Wix or Shopify create some custom elements that work behind the WYSIWYG tooling, and lots of people use it to create their pages - then suddenly that element gets very very popular - even if it isn't actually particularly good. In fact, we can see shifts in the data where the teams themselves changed their minds and another version supplants the first very quickly because it's simply internal.

Then, I thought that perhaps what we can do with the dataset instead, is to squint at it and look a little more abstractly at what people are naming their elements and see if people are re-solving similar problems. Do we find, for example, multiple non-standard element names that appear to be about tabs? Yes! Clearly that is indicative that we need a native element, right? Maybe. It's a bit more nuanced than that. Here are the most commonly re-created/repeated non-standard element themes:

  • Navigation
  • Headers and footers
  • Carousels and sliders
  • Modals
  • Search bars
  • Product cards
  • Login forms
  • Cookie banners
  • Accordions
  • Tabs
  • Toasts
  • Breadcrumbs

While we don't have several of these in standard HTML, we do have native <header>, <footer>, <nav>, <dialog>, and <search> elements, and even accordions via the name attribute of <details>. And yet, the wild still contains hundreds or thousands of custom elements with names like <app-header>, <site-footer>, <main-nav>, <modal-dialog>, <search-box>, and <accordion-panel>.

Native primitives may exist, but not at the same level of abstraction as these. <header> and <footer> in HTML are structural, not behavioral. <dialog> is behavioral, but not styled. <search> exists, but doesn’t solve autocomplete, filtering, or results.

So developers build those - and, if you stop and think about it, not all non-standard elements are equally as undesirable. Many of them will be simple decorations or thin wrappers that do use their native counterparts. Where there is definitely some interesting thing to study is where there is clear generic need where the platform doesn't provide anything close. Above, tabs, for example.

Observations..

Here are many observations from the data, in no real particular order of importance.

Forms and Inputs: Tweaked, Wrapped, and Re‑Wrapped

Forms and inputs are a great example of the constant re-invention I just described. Sometimes it's because the native element is insufficient, but that's not necessarily the case. In some cases they're just slight wrappers. Among them are lots and lots of "pickers" and "selecters" that show up...

  • <custom-select>
  • <date-picker>
  • <variant-picker>
  • <quantity-selector>

There is already a lot of ongoing work to make native form elements (including selects) require less code and just be more stylable and flexible, and the data at least suggests that such efforts will be very welcome.

Hidden Machinery

A surprising number of elements aren’t UI components at all. They’re runtime markers:

  • <ng-container>
  • <router-outlet>
  • <astro-island>
  • <ion-router-outlet>
  • <next-route-announcer>

These exist because frameworks need declarative boundaries for hydration, routing, rendering or template expansion. I suppose it is debatable wither these are an indicator of “missing HTML features”, or just how much.

Carousels (and sliders... and toasts)

I don't love carousels, but it's hard to deny that they are popular. There are dozens of distinct and identifiable carousel/slider elements in the dataset and they appear a lot. I really dislike a few bits of Google's attempt to make CSS-only carousels possible, but it's pretty clear why they chose to tackle that problem. I guess it is worth stressing again the bias in the dataset here - if there is a page I most expect to see a carousel, it is exactly the primary one the archive crawls. So, while it is the most popular in the dataset, I don't know that it is the most popular all-around. You can see why Google winds up with their proposals though, toasts are on that top list too.

Structural semantics?

There are a few broad categories where the main point seems to be "semantics". That is, very often many of these don't actually do anything, beyond provide some hooks, mainly for styling. They aren't actually even custom elements sometimes (or maybe even often) - just non-standard elements.

e-commerce

Dozens of these surround e-commerce. There are tens of thousands of sites that use elements with names (and variants).

Product & merchandising
  • <product-card>
  • <product-title>
  • <product-price>
  • <product-rating>
  • <product-variant>
  • <product-gallery>
  • <product-description>
  • <product-badge>
Pricing & money
  • <price-money>
  • <sale-price>
  • <compare-at-price>
  • <discount-amount>
  • <currency-display>
Inventory & availability
  • <stock-status>
  • <pickup-availability>
  • <delivery-estimate>
  • <inventory-level>
Cart & checkout
  • <cart-items>
  • <cart-count>
  • <checkout-button>
  • <order-summary>

Very interestingly they are often used alongside actual machine readable semantics via jsonLD in the same markup.

While the vast majority of these elements appear because of common tooling, the fact that there are dozens of variants of similar names appearing on smaller numbers of sites indicates there is something widely interesting here. It's hard to say what it is other than that it would be nice to have a common structural semantic that would work for both purposes.

I guess the biggest surprise here is that if it's true, why hasn't such a thing arisen already? It is entirely within the community's power to develop such a thing. Perhaps the answer is that there is just so much variance it isn't easily plausible. Maybe templating would somehow allow us to achieve a common pattern which achieved this based on the shared jsonLD semantics.

Publishing & Editorial Semantics

CMSes and news sites often invent tags for editorial structure, and many of these are sticking around.

Content structure
  • <article-header>
  • <article-summary>
  • <article-author>
  • <article-date>
  • <article-tags>
  • <article-tag>
  • <article-category>
  • <byline>
  • <dateline>
  • <pullquote>
  • <footnote>
Taxonomy
  • <tag-list>
  • <category-label>
  • <topic-header>

These reflect the needs of journalism and long‑form content.

Social & Community Semantics

These show up in comment systems, forums, and social platforms.

User‑generated content
  • <comment>
  • <comment-list>
  • <comment-item>
  • <comment-author>
  • <comment-content>
  • <comment-date>
  • <comment-form>
Identity
  • <user-avatar>
  • <user-name>
  • <profile-card>

These encode relationships and interactions, not UI patterns.

Events
  • <event-date>
  • <event-location>
  • <event-schedule>
  • <event-details>

Again, these are domain objects, not widgets - and they have well established schema.org or microformats as well.

Invoicing
  • <invoice>
  • <invoice-line>
  • <invoice-total>
  • <invoice-summary>

Before the web came along, there were already national and international standards around electronically trading informtation like invoices - and when XML was sold, invoices were a common example. Here we are again.

"Namespaced" Elements

Several elements like `o:p`, `rdf:rdf`, `dc:format`, `cc:work`, `fb:like`, `g:plusone` appear in the top 100. These basically were thinking of an XHTML future (namespacing) that never really arrived. However, HTML has always allowed it - so that's just the tag name. In many ways, it's just as good. Interestingly, these may be some of the better examples of what I'd like to see happen - they are widely understood.

Conversely, while hugely successful, the share buttons are more an indication of a desire than something we could actually standardize in precisely that way. They also point to a desire _in time_. Google Plus doesn't even exist anymore, `fb:like` is from a time when Facebook was at the top of the most interesting places to be. Maybe one of the things we've learned is that this is way handier to do at the browser/OS levels? I suppose the Web Share API was a part of thinking how we'd deal with this.

The fact that they both still appear so much is also kind of an indication of age of the page and slow replacement of underlying tools.

Typos, Encoding Errors, and the Weird Stuff

One of the most delightful parts of the dataset is the long tail of what are almost certainly just typos:

  • <prodcut-card>
  • <navgation>
  • <contianer>

The fact that these can appear on tens of thousands of sites because they are part of common tooling helps re-enforce that not every non-standard element is a signal. :)

In conclusion...

I wish that I could say "Ah ha - the data says very clearly that these are the specific things we should definitely 'just write down' now" in the way that I imagined a decade ago, but I don't think we're there yet. I guess if I had to give three things I'd like to see happen from here they'd be:

  1. We need lots more effort in thinking about how to study these things. I would love to see real investment in this space. This year, at last, the W3C is hiring someone to study the web. I'm not yet sure what that looks like but I look forward to trying to discuss more with them.

  2. We need a real community effort - an Underwriters Labs for custom elements, with participation and funding from orgs with money. We don't necessarily need "the one true tabs" as much as we need a place to find what I expect will be a very few sets of tabs as custom elements which we can trust like we trust native elements. Given a little bit of time, I have faith that this will naturally sort itself into a few 'winners'.

  3. That community effort might also include things which won't ever have native implmentations, but which lay down some kind of light semantic meaning or compound styling structure that we all begin to agree on - like product cards or breadcrumbs.

A lot of this is pretty adjacent/close to the ideas behind OpenUI and it's possible some of this could just happen there. However, due mainly to limits and participation, OpenUI has really not really produced custom elements or worked to somehow list or grade and promote them (though we did study them quite a bit in the tabs research). The effort led by Brad Frost to think about a "global design system" in particular might be closer to some of these ideas.

January 06, 2026 05:00 AM

January 05, 2026

Andy Wingo

pre-tenuring in v8

Hey hey happy new year, friends! Today I was going over some V8 code that touched pre-tenuring: allocating objects directly in the old space instead of the nursery. I knew the theory here but I had never looked into the mechanism. Today’s post is a quick overview of how it’s done.

allocation sites

In a JavaScript program, there are a number of source code locations that allocate. Statistically speaking, any given allocation is likely to be short-lived, so generational garbage collection partitions freshly-allocated objects into their own space. In that way, when the system runs out of memory, it can preferentially reclaim memory from the nursery space instead of groveling over the whole heap.

But you know what they say: there are lies, damn lies, and statistics. Some programs are outliers, allocating objects in such a way that they don’t die young, or at least not young enough. In those cases, allocating into the nursery is just overhead, because minor collection won’t reclaim much memory (because too many objects survive), and because of useless copying as the object is scavenged within the nursery or promoted into the old generation. It would have been better to eagerly tenure such allocations into the old generation in the first place. (The more I think about it, the funnier pre-tenuring is as a term; what if some PhD programs could pre-allocate their graduates into named chairs? Is going straight to industry the equivalent of dying young? Does collaborating on a paper with a full professor imply a write barrier? But I digress.)

Among the set of allocation sites in a program, a subset should pre-tenure their objects. How can we know which ones? There is a literature of static techniques, but this is JavaScript, so the answer in general is dynamic: we should observe how many objects survive collection, organized by allocation site, then optimize to assume that the future will be like the past, falling back to a general path if the assumptions fail to hold.

my runtime doth object

The high-level overview of how V8 implements pre-tenuring is based on per-program-point AllocationSite objects, and per-allocation AllocationMemento objects that point back to their corresponding AllocationSite. Initially, V8 doesn’t know what program points would profit from pre-tenuring, and instead allocates everything in the nursery. Here’s a quick picture:

diagram of linear allocation buffer containing interleaved objects and allocation mementos
A linear allocation buffer containing objects allocated with allocation mementos

Here we show that there are two allocation sites, Site1 and Site2. V8 is currently allocating into a linear allocation buffer (LAB) in the nursery, and has allocated three objects. After each of these objects is an AllocationMemento; in this example, M1 and M3 are AllocationMemento objects that point to Site1 and M2 points to Site2. When V8 allocates an object, it increments the “created” counter on the corresponding AllocationSite (if available; it’s possible an allocation comes from C++ or something where we don’t have an AllocationSite).

When the free space in the LAB is too small for an allocation, V8 gets another LAB, or collects if there are no more LABs in the nursery. When V8 does a minor collection, as the scavenger visits objects, it will look to see if the object is followed by an AllocationMemento. If so, it dereferences the memento to find the AllocationSite, then increments its “found” counter, and adds the AllocationSite to a set. Once an AllocationSite has had 100 allocations, it is enqueued for a pre-tenuring decision; sites with 85% survival get marked for pre-tenuring.

If an allocation site is marked as needing pre-tenuring, the code in which it is embedded it will get de-optimized, and then next time it is optimized, the code generator arranges to allocate into the old generation instead of the default nursery.

Finally, if a major collection collects more than 90% of the old generation, V8 resets all pre-tenured allocation sites, under the assumption that pre-tenuring was actually premature.

tenure for me but not for thee

What kinds of allocation sites are eligible for pre-tenuring? Sometimes it depends on object kind; wasm memories, for example, are almost always long-lived, so they are always pre-tenured. Sometimes it depends on who is doing the allocation; allocations from the bootstrapper, literals allocated by the parser, and many allocations from C++ go straight to the old generation. And sometimes the compiler has enough information to determine that pre-tenuring might be a good idea, as when it generates a store of a fresh object to a field in an known-old object.

But otherwise I thought that the whole AllocationSite mechanism would apply generally, to any object creation. It turns out, nope: it seems to only apply to object literals, array literals, and new Array. Weird, right? I guess it makes sense in that these are the ways to create objects that also creates the field values at creation-time, allowing the whole block to be allocated to the same space. If instead you make a pre-tenured object and then initialize it via a sequence of stores, this would likely create old-to-new edges, preventing the new objects from dying young while incurring the penalty of copying and write barriers. Still, I think there is probably some juice to squeeze here for pre-tenuring of class-style allocations, at least in the optimizing compiler or in short inline caches.

I suspect this state of affairs is somewhat historical, as the AllocationSite mechanism seems to have originated with typed array storage strategies and V8’s “boilerplate” object literal allocators; both of these predate per-AllocationSite pre-tenuring decisions.

fin

Well that’s adaptive pre-tenuring in V8! I thought the “just stick a memento after the object” approach is pleasantly simple, and if you are only bumping creation counters from baseline compilation tiers, it likely amortizes out to a win. But does the restricted application to literals point to a fundamental constraint, or is it just accident? If you have any insight, let me know :) Until then, happy hacking!

by Andy Wingo at January 05, 2026 03:38 PM

December 30, 2025

December 25, 2025

Igalia WebKit Team

WebKit Igalia Periodical #52

Update on what happened in WebKit in the week from December 16 to December 25.

Right during the holiday season 🎄, the last WIP installment of the year comes packed with new releases, a couple of functions added to the public API, cleanups, better timer handling, and improvements to MathML and WebXR support.

Cross-Port 🐱

Landed support for font-size: math. Now math-depth can automatically control the font size inside of <math> blocks, making scripts and nested content smaller to improve readability and presentation.

Two new functions have been added to the public API:

  • webkit_context_menu_item_get_gaction_target() to obtain the GVariant associated with a context menu item created from a GAction.

  • webkit_context_menu_item_get_title() may be used to obtain the title of a context menu item.

Improved timers, by making some of them use the timerfd API. This reduces timer “lateness”—the amount of time elapsed between the configured trigger time, and the effective one—, which in turn improves the perceived smoothness of animations thanks to steadier frame delivery timings. Systems where the timerfd_create and timerfd_settime functions are not available will continue working as before.

On the WebXR front, support was added for XR_TRACKABLE_TYPE_DEPTH_ANDROID through the XR_ANDROID_trackables extension, which allows reporting depth information for elements that take part in hit testing.

Graphics 🖼️

Landed a change that implements non-composited page rendering in the WPE port. This new mode is disabled by default, and may be activated by disabling the AcceleratedCompositing runtime preference. In such case, the frames are rendered using a simplified code path that does not involve the internal WebKit compositor. Therefore it may offer a better performance in some specific cases on constrained embedded devices.

Since version 2.10.2, the FreeType library can be built with direct support for loading fonts in the WOFF2 format. Until now, the WPE and GTK WebKit ports used libwoff2 in an intermediate step to convert those fonts on-the-fly before handing them to FreeType for rendering. The CMake build system will now detect when FreeType supports WOFF2 directly and skip the conversion step. This way, in systems which provide a suitable version of FreeType, libwoff2 will no longer be needed.

WPE WebKit 📟

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

The legacy libwpe-based API can now be disabled at build time, by toggling the ENABLE_WPE_LEGACY_API CMake option. This allows removal of uneeded code when an application is exclusively using the new WPEPlatform API.

WPE Android 🤖

Adaptation of WPE WebKit targeting the Android operating system.

AHardwareBuffer is now supported as backing for accelerated graphics surfaces that can be shared across processes. This is the last piece of the puzzle to use WPEPlatform on Android without involving expensive operations to copy rendered frames back-and-forth between GPU and system memory.

Releases 📦️

WebKitGTK 2.50.4 and WPE WebKit 2.50.4 have been released. These stable releases include a number of important patches for security issues, and we urge users and distributors to update to this release if they have not yet done it. An accompanying security advisory, WSA-2025-0010, has been published (GTK, WPE).

Development releases of WebKitGTK 2.51.4 and WPE WebKit 2.51.4 are available as well, and may be used to preview upcoming features. As usual, bug reports are welcome in Bugzilla.

Community & Events 🤝

Paweł Lampe has published a blog post that discusses various pre-rendering techniques useful in the context of using WPE on embedded devices.

That’s all for this week!

by Igalia WebKit Team at December 25, 2025 06:26 PM

December 19, 2025

Eric Meyer

Targeting by Reference in the Shadow DOM

I’ve long made it clear that I don’t particularly care for the whole Shadow DOM thing.  I believe I understand the problems it tries to solve, and I fully acknowledge that those are problems worth solving.  There are just a bunch of things about it that don’t feel right to me, like how it can break accessibility in a number of ways.

One of those things is how it breaks stuff like the commandFor attribute on <button>s, or the popoverTarget attribute, or a variety of ARIA attributes such as aria-labelledby.  This happens because a Shadow DOMmed component creates a whole separate node tree, which creates a barrier (for a lot of things, to be clear; this is just one class of them).

At least, that’s been the case.  There’s now a proposal to fix that, and prototype implementations in both Chrome and Safari!  In Chrome, it’s covered by the Experimental Web Platform features flag in chrome://flags.  In Safari, you open the Develop > Feature Flags… dialog, search for “referenceTarget”, and enable both flags.

(Disclosure: My employer, Igalia, with support from NLnet, did the WebKit implementation, and also a Gecko implementation that’s being reviewed as I write this.)

If you’re familiar with Shadow DOMming, you know that there are attributes for the <template> element like shadowRootClonable that set how the Shadow DOM for that particular component can be used.  The proposal at hand is for a shadowRootReferenceTarget attribute, which is a string used to identify an element within the Shadowed DOM tree that should be the actual target of any references.  This is backed by a ShadowRoot.referenceTarget API feature.

Take this simple setup as a quick example.

 <label for="consent">I agree to join your marketing email list for some reason</label>
<sp-checkbox id="consent">
	<template>
		<input id="setting" type="checkbox" aria-checked="indeterminate">
		<span id="box"></span>	
	</template> </sp-checkbox> 

Assume there’s some JavaScript to make that stuff inside the Shadow DOM work as intended.  (No, nothing this simple should really be a web component, but let’s assume that someone has created a whole multi-faceted component system for handling rich user interactions or whatever, and someone else has to use it for job-related reasons, and this is one small use of that system.)

The problem is, the <label> element’s for is pointing at consent, which is the ID of the component.  The actual thing that should be targeted is the <input> element with the ID of setting .  We can’t just change the markup to <label for="setting"> because that <input> is trapped in the Shadow tree, where none in the Light beyond may call for it.  So it just plain old doesn’t work.

Under the Reference Target proposal, one way to fix this would look something like this in HTML:

 <label for="consent">I agree to join your marketing email list for some reason</label>
<sp-checkbox id="consent">
	<template shadowRootReferenceTarget="setting">
		<input id="setting" type="checkbox" aria-checked="indeterminate">
		<span id="box"></span>	
	</template> </sp-checkbox> 

With this markup in place, if someone clicks/taps/otherwise activates the label, it points to the ID consent .  That Shadowed component takes that reference and redirects it to an effective target  —  the reference target identified in its shadowRootReferenceTarget attribute.

You could also set up the reference with JavaScript instead of an HTML template:

 <label for="consent">I agree to join your marketing email list for some reason</label>
<sp-checkbox id="consent"></sp-checkbox> 
class SpecialCheckbox extends HTMLElement {
	checked = "mixed";
	constructor() {
		super();
		this.shadowRoot_ = this.attachShadow({ 
			referenceTarget: "setting"
		});
		
		// lines of code to Make It Go
	
	}
} 

Either way, the effective target is the <input> with the ID of setting .

This can be used in any situation where one element targets another, not just with for .  The form and list attributes on inputs would benefit from this.  So, too, would the relatively new popoverTarget and commandFor button attributes.  And all of the ARIA targeting attributes, like aria-controls and aria-errormessage and aria-owns as well.

If reference targets are something you think would be useful in your own work, please give it a try in Chrome or Safari or both, to see if your use cases are being met.  If not, you can leave feedback on issue 1120 to share any problems you run into.  If we’re going to have a Shadow DOM, the least we can do is make it as accessible and useful as possible.


Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at December 19, 2025 03:04 PM

Pawel Lampe

WPE performance considerations: pre-rendering

This article is a continuation of the series on WPE performance considerations. While the previous article touched upon fairly low-level aspects of the DOM tree overhead, this one focuses on more high-level problems related to managing the application’s workload over time. Similarly to before, the considerations and conclusions made in this blog post are strongly related to web applications in the context of embedded devices, and hence the techniques presented should be used with extra care (and benchmarking) if one would like to apply those on desktop-class devices.

The workload #

Typical web applications on embedded devices have their workloads distributed over time in various ways. In practice, however, the workload distributions can usually be fitted into one of the following categories:

  1. Idle applications with occasional updates - the applications that present static content and are updated at very low intervals. As an example, one can think of some static dashboard that presents static content and switches the page every, say, 60 seconds - such as e.g. a static departures/arrivals dashboard on the airport.
  2. Idle applications with frequent updates - the applications that present static content yet are updated frequently (or are presenting some dynamic content, such as animations occasionally). In that case, one can imagine a similar airport departures/arrivals dashboard, yet with the animated page scrolling happening quite frequently.
  3. Active applications with occasional updates - the applications that present some dynamic content (animations, multimedia, etc.), yet with major updates happening very rarely. An example one can think of in this case is an application playing video along with presenting some metadata about it, and switching between other videos every few minutes.
  4. Active applications with frequent updates - the applications that present some dynamic content and change the surroundings quite often. In this case, one can think of a stock market dashboard continuously animating the charts and updating the presented real-time statistics very frequently.

Such workloads can be well demonstrated on charts plotting the browser’s CPU usage over time:

Typical web application workloads.

As long as the peak workload (due to updates) is small, no negative effects are perceived by the end user. However, when the peak workload is significant, some negative effects may start getting noticeable.

In case of applications from groups (1) and (2) mentioned above, a significant peak workload may not be a problem at all. As long as there are no continuous visual changes and no interaction is allowed during updates, the end-user is unable to notice that the browser was not responsive or missed some frames for some period of time. In such cases, the application designer does not need to worry much about the workload.

In other cases, especially the ones involving applications from groups (3) and (4) mentioned above, the significant peak workload may lead to visual stuttering, as any processing making the browser busy for longer than 16.6 milliseconds will lead to lost frames. In such cases, the workload has to be managed in a way that the peaks are reduced either by optimizing them or distributing them over time.

First step: optimization #

The first step to addressing the peak workload is usually optimization. Modern web platform gives a full variety of tools to optimize all the stages of web application processing done by the browser. The usual process of optimization is a 2-step cycle starting with measuring the bottlenecks and followed by fixing them. In the process, the usual improvements involve:

  • using CSS containment,
  • using shadow DOM,
  • promoting certain parts of the DOM to layers and manipulating them with transforms,
  • parallelizing the work with workers/worklets,
  • using the visibility CSS property to separate painting from layout,
  • optimizing the application itself (JavaScript code, the structure of the DOM, the architecture of the application),
  • etc.

Second step: pre-rendering #

Unfortunately, in practice, it’s not uncommon that even very well optimized applications still have too much of a peak workload for the constrained embedded devices they’re used on. In such cases, the last resort solution is pre-rendering. As long as it’s possible from the application business-logic perspective, having at least some web page content pre-rendered is very helpful in situations when workload has to be managed, as pre-rendering allows the web application designer to choose the precise moment when the content should actually be rendered and how it should be done. With that, it’s possible to establish a proper trade-off between reduction in peak workload and the amount of extra memory used for storing the pre-rendered contents.

Pre-rendering techniques #

Nowadays, the web platform provides at lest a few widely-adapted APIs that provide means for the application to perform various kinds of pre-rendering. Also, due to the ways the browsers are implemented, some APIs can be purposely misused to provide pre-rendering techniques not necessarily supported by the specification. However, in the pursuit of good trade-offs, all the possibilities should be taken into account.

Before jumping into particular pre-rendering techniques, it’s necessary to emphasize that the pre-rendering term used in this article refers to the actual rendering being done earlier than it’s visually presented. In that sense, the resource is rasterized to some intermediate form when desired and then just composited by the browser engine’s compositor later.

Pre-rendering offline #

The most basic (and limited at the same time) pre-rendering technique is one that involves rendering offline i.e. before the browser even starts. In that case, the first limitation is that the content to be rendered must be known beforehand. If that’s the case, the rendering can be done in any way, and the result may be captured as e.g. raster or vector image (depending on the desired trade-off). However, the other problem is that such a rendering is usually out of the given web application scope and thus requires extra effort. Moreover, depending on the situation, the amount of extra memory used, the longer web application startup (due to loading the pre-rendered resources), and the processing power required to composite a given resource, it may not always be trivial to obtain the desired gains.

Pre-rendering using canvas #

The first group of actual pre-rendering techniques happening during web application runtime is related to Canvas and OffscreenCavas. Those APIs are really useful as they offer great flexibility in terms of usage and are usually very performant. However, in this case, the natural downside is the lack of support for rendering the DOM inside the canvas. Moreover, canvas has a very limited support for painting text — unlike the DOM, where CSS has a significant amount of features related to it. Interestingly, there’s an ongoing proposal called HTML-in-Canvas that could resolve those limitations to some degree. In fact, Blink has a functioning prototype of it already. However, it may take a while before the spec is mature and widely adopted by other browser engines.

When it comes to actual usage of canvas APIs for pre-rendering, the possibilities are numerous, and there are even more of them when combined with processing using workers. The most popular ones are as follows:

  • rendering to an invisible canvas and showing it later,
  • rendering to a canvas detached from the DOM and attaching it later,
  • rendering to an invisible/detached canvas and producing an image out of it to be shown later,
  • rendering to an offscreen canvas and producing an image out of it to be shown later.

When combined with workers, some of the above techniques may be used in the worker threads with the rendered artifacts transferred to the main for presentation purposes. In that case, one must be careful with the transfer itself, as some objects may get serialized, which is very costly. To avoid that, it’s recommended to use transferable objects and always perform a proper benchmarking to make sure the transfer is not involving serialization in the particular case.

While the use of canvas APIs is usually very straightforward, one must be aware of two extra caveats.

First of all, in the case of many techniques mentioned above, there is no guarantee that the browser will perform actual rasterization at the given point in time. To ensure the rasterization is triggered, it’s usually necessary to enforce it using e.g. a dummy readback (getImageData()).

Finally, one should be aware that the usage of canvas comes with some overhead. Therefore, creating many canvases or creating them often, may lead to performance problems that could outweigh the gains from the pre-rendering itself.

Pre-rendering using eventually-invisible layers #

The second group of pre-rendering techniques happening during web application runtime is limited to the DOM rendering and comes out of a combination of purposeful spec misuse and tricking the browser engine into making it rasterizing on demand. As one can imagine, this group of techniques is very much browser-engine-specific. Therefore, it should always be backed by proper benchmarking of all the use cases on the target browsers and target hardware.

In principle, all the techniques of this kind consist of 3 parts:

  1. Enforcing the content to be pre-rendered being placed on a separate layer backed by an actual buffer internally in the browser,
  2. Tricking the browser’s compositor into thinking that the layer needs to be rasterized right away,
  3. Ensuring the layer won’t be composited eventually.

When all the elements are combined together, the browser engine will allocate an internal buffer (e.g. texture) to back the given DOM fragment, it will process that fragment (style recalc, layout), and rasterize it right away. It will do so as it will not have enough information to allow delaying the rasterization of the layer (as e.g. in case of display: none). Then, when the compositing time comes, the layer will turn out to be invisible in practice due to e.g. being occluded, clipped, etc. This way, the rasterization will happen right away, but the results will remain invisible until a later time when the layer is made visible.

In practice, the following approaches can be used to trigger the above behavior:

  • for (1), the CSS properties such as will-change: transform, z-index, position: fixed, overflow: hidden etc. can be used depending on the browser engine,
  • for (2) and (3), the CSS properties such as opacity: 0, overflow: hidden, contain: strict etc. can be utilized, again, depending on the browser engine.
The scrolling trick

While the above CSS properties allow for various combinations, in case of WPE WebKit in the context of embedded devices (tested on NXP i.MX8M Plus), the combination that has proven to yield the best performance benefits turns out to be a simple approach involving overflow: hidden and scrolling. The example of such an approach is explained below.

Suppose the goal of the application is to update a big table with numbers once every N frames — like in the following demo: random-numbers-bursting-in-table.html?cs=20&rs=20&if=59

Bursting numbers demo.

With the number of idle frames (if) set to 59, the idea is that the application does nothing significant for the 59 frames, and then every 60th frame it updates all the numbers in the table.

As one can imagine, on constrained embedded devices, such an approach leads to a very heavy workload during every 60th frame and hence to lost frames and unstable application’s FPS.

As long as the numbers are available earlier than every 60th frame, the above application is a perfect example where pre-rendering could be used to reduce the peak workload.

To simulate that, the 3 variants of the approach involving the scrolling trick were prepared for comparison with the above:

In the above demos, the idea is that each cell with a number becomes a scrollable container with 2 numbers actually — one above the other. In that case, because overflow: hidden is set, only one of the numbers is visible while the other is hidden — depending on the current scrolling:

Scrolling trick explained.

With such a setup, it’s possible to update the invisible numbers during idle frames without the user noticing. Due to how WPE WebKit accelerates the scrolling, changing the invisible numbers, in practice, triggers the layout and rendering right away. Moreover, the actual rasterization to the buffer backing the scrollable container happens immediately (depending on the tiling settings), and hence the high cost of layout and text rasterization can be distributed. When the time comes, and all the numbers need to be updated, the scrollable containers can be just scrolled, which in that case turns out to be ~2 times faster than updating all the numbers in place.

To better understand the above effect, it’s recommended to compare the mark views from sysprof traces of the random-numbers-bursting-in-table.html?cs=10&rs=10&if=11 and random-numbers-bursting-in-table-prerendered-1.html?cs=10&rs=10&if=11 demos:

Sysprof from basic demo.



Sysprof from pre-rendering demo.

While the first sysprof trace shows very little processing during 11 idle frames and a big chunk of processing (21 ms) every 12th frame, the second sysprof trace shows how the distribution of load looks. In that case, the amount of work during 11 idle frames is much bigger (yet manageable), but at the same time, the formerly big chunk of processing every 12th frame is reduced almost 2 times (to 11 ms). Therefore, the overall frame rate in the application is much better.

Results

Despite the above improvement speaking for itself, it’s worth summarizing the improvement with the benchmarking results of the above demos obtained from the NXP i.MX8M Plus and presenting the application’s average frames per second (FPS):

Benchmarking results.

Clearly, the positive impact of pre-rendering can be substantial depending on the conditions. In practice, when the rendered DOM fragment is more complex, the trick such as above can yield even better results. However, due to how tiling works, the effect can be minimized if the content to be pre-rendered spans multiple tiles. In that case, the browser may defer rasterization until the tiles are actually needed. Therefore, the above needs to be used with care and always with proper benchmarking.

Conclusions #

As demonstrated in the above sections, when it comes to pre-rendering the contents to distribute the web application workload over time, the web platform gives both the official APIs to do it, as well as unofficial means through purposeful misuse of APIs and exploitation of browser engine implementations. While this article hasn’t covered all the possibilities available, the above should serve as a good initial read with some easy-to-try solutions that may yield surprisingly good results. However, as some of the ideas mentioned above are very much browser-engine-specific, they should be used with extra care and with the limitations (lack of portability) in mind.

As the web platform constantly evolves, the pool of pre-rendering techniques and tricks should keep evolving as well. Also, as more and more web applications are used on embedded devices, more pressure should be put on the specification, which should yield more APIs targeting the low-end devices in the future. With that in mind, it’s recommended for the readers to stay up-to-date with the latest specification and perhaps even to get involved if some interesting use cases would be worth introducing new APIs.

December 19, 2025 12:00 AM

December 18, 2025

Delan Azabani

Web engine CI on a shoestring budget

Servo is a greenfield web browser engine that supports many platforms. Automated testing for the project requires building Servo for all of those platforms, plus several additional configurations, and running nearly two million tests including the entire Web Platform Tests. How do we do all of that in under half an hour, without a hyperscaler budget for compute and an entire team to keep it all running smoothly, and securely enough to run untrusted code from contributors?

We’ve answered these questions by building a CI runner orchestration system for GitHub Actions that we can run on our own servers, using ephemeral virtual machines for security and reproducibility. We also discuss how we implemented graceful fallback from self-hosted runners to GitHub-hosted runners, the lessons we learned in automating image rebuilds, and how we could port the system to other CI platforms like Forgejo Actions.

This is a transcript post for a talk I gave internally at Igalia.


Web engine CI on a shoestring budget
delan azabani (she/her)
azabani.com
November 2025

Let's talk about how Servo can have fast CI for not a lot of money.


Servo's situation
- Servo currently uses GitHub Actions (GHA) quite heavily
    - Many platforms: Linux, Windows, macOS, Android, and OpenHarmony
    - Many tests: Web Platform Tests (50K+ tests, 1.8M+ subtests), WebGPU CTS, devtools, unit tests…
    - Many configurations: MSRV, libservo, linting, release, benchmarking…
- GHA is a frustrating CI service with baffling limitations

Servo is a greenfield web browser engine. And being a web browser engine, it has some pretty demanding requirements for its CI setup.

We have to build and run Servo for many platforms, including three desktop platforms and two mobile platforms.

We have to run many, many tests, the main bulk of which is the entire Web Platform Tests suite, which is almost 2 million subtests. We also have several smaller test suites as well, like the WebGPU tests and the DevTools tests and so on.

We have to build Servo in many different configurations for special needs. So we might build Servo with the oldest version of Rust that we still support, just to make sure that still works. We might build Servo as a library in the same way that it would be consumed by embedders. We have to lint the codebase. We have to build with optimizations for nightly and monthly releases. We have to build with other optimizations for benchmarking work, and so on.

And as you'll see throughout this talk, we do this on GitHub and GitHub Actions, but GitHub Actions is a very frustrating CI service, and it has many baffling limitations. And as time goes on, I think Servo being on GitHub and GitHub Actions will be more for the network effects we had early on than for any particular merits of these platforms.


GitHub-hosted runners
- GitHub provides their own runners
- Essential for glue logic and small workloads
- Painful for building and testing a browser engine
    - The gratis runners are tiny and slow
    - The paid runners are very expensive
    - Need more tools or deps? Install them every time
    - Caching is a joke, not suitable for incremental builds

On GitHub Actions, GitHub provides their own first-party runners.

And these runners are very useful for small workloads, as well as the logic that coordinates workloads. So this would include things like taking a tryjob request for "linux", and turning that into a run that just builds Servo for Linux. Or you might get a try job request for "full", and we'll turn that into a run that builds Servo for all of the platforms and runs all of the tests.

But for a project of our scale, these runners really fall apart once you get into any workload beyond that.

They have runners that are free of charge. They are very tiny, resource constrained, and it seems GitHub tries to cram as many of these runners onto each server as they possibly can.

They have runners that you can pay for, but they're very very expensive. And I believe this is because you not only pay a premium for like hyperscaler cloud compute rates, but you also pay a premium on top of that for the convenience of having these runners where you can essentially just flip a switch and get faster builds. So they really make you pay for this.

Not only that, but using GitHub hosted runners you can't really customize the image that runs on the runners besides being able to pull in containers, which is also kind of slow and not useful all the time. If there are tools and dependencies that you need that aren't in those images, you need to install them every single time you run a job or a workflow, which is such a huge waste of time, energy, and money, no matter whose money it is.

There are also some caching features on GitHub Actions, but they're honestly kind of a joke. The caching performs really poorly, and there's not a lot you can cache with them. So in general, they're not really suitable for doing things like incremental builds.


Alternatives considered
- SaaS providers: expensive, often no Win and/or macOS
- RunsOn: less expensive, AWS-only, no macOS support
-

So we have all these slow builds, and they're getting slower, and we want to make them faster. So we considered several alternatives.

The first that comes to mind are these third-party runner providers. These are things like Namespace, Warpbuild, Buildjet. There's so many of them. The caveat with these is that they're almost always... almost as expensive per hour as GitHub's first-party runners. And I think this is because they try to focus on providing features like better caching, that allow you to accumulate less hours on their runners. And in doing so, they don't really have any incentive to also compete on the hourly rate.

There is one exception: there's a system called RunsOn. It's sort of an off-the-shelf thing that you can grab this software and operate it yourself, but you do have to use AWS. So it's very tightly coupled to AWS. And both of these alternatives, they often lack support for certain platforms on their runners. Many of them are missing Windows or missing macOS or missing both. And RunsOn is missing macOS support, and probably won't get macOS support for the foreseeable future.

We considered offloading some of the work that these free GitHub hosted runners do onto our own servers with, let's call them like, "proxy jobs". The idea would be that you'd still use free GitHub hosted runners, but you'd do the work remotely on another server. The trouble with this is that then you're still using these free GitHub hosted runners, which take up your quota of your maximum concurrent free runners.

And it's also tricky to do this without losing access to the broader ecosystem of prebuilt Actions. So these Actions are like steps that you can pull in that will let you do useful things like managing artifacts and installing dependencies and so on. But it's one thing to say, okay, my workload is this shell script, and I'm going to run it remotely now. That's easy enough to do, but it's a lot harder to say, okay, well, I've got this workflow that has a bunch of scripts and a bunch of external Actions made by other people, and I'm going to run all of this remotely. I'm going to make sure that all of these Actions are also compatible with being run remotely. That's a lot trickier. That said, you should probably avoid relying too heavily on this ecosystem anyway. It's a great way to get locked into the platform, and working with YAML honestly really sucks.

So we could set up an entire CI service. There are some CI services like Jenkins, Bamboo... honestly, most CI services nowadays have support for built-in container orchestration. So they can spin up containers for each runner, for each job, autonomously. And while they can do containers, none of them have really solved the problem of virtual machine orchestration out of the box. And this is a problem for us, because we want to use virtual machines for that security and peace of mind, which I'll explain in more detail a little bit later. And seeing as we lack the dedicated personnel to operate an entire CI service and have that be on the critical path — we didn't want to have someone on call for outages — this was probably not the best option for us at the moment.


Self-hosted runners
- Self-hosted runners are better!
- Give the runners as much RAM and CPU as we want
- Custom build environments tailored to the project
    - Bake in whatever tools we want
    - Bake in a prebuilt Servo for incremental builds

What we decided to do was set up some self-hosted runners.

These solve most of our problems, primarily because we can throw as much hardware and resources at these runners as we want. We can define the contention ratios.

And better yet, we can also customize the images that the runners use. By being able to customize the runner images, we can move a lot of steps and a lot of work that used to be done on every single workflow run, and do it only once, or once per build. We only rebuild the runner images, say, once a week. So that significantly saves... it cuts down on the amount of work that we have to do.

This is not just installing tools and dependencies, but it's also enabling incremental builds quite powerfully, which we'll see a little bit later on.


How much faster?
- mach try full workflow: 61m30s → 25m47s (−58%)
- linux-unit-tests job: 34m29s → 3m15s (−90%)
- windows-unit-tests job: 59m14s → 8m4s (−86%)
- lint job: 11m54s → 2m25s (−79%)
- wpt jobs: 25m35s → 20m50s (−18%)
    - But we also went from 20 runners → 3 runners

How much faster is this system with the self-hosted runners? Well, it turns out, quite a lot faster.

We've got this typical workflow that you use when you're testing your commits, when you're making a pull request. And if you kick off a tryjob like this, several of the jobs that we've now offloaded onto self-hosted runners are now taking 70, 80, even 90% less time than they did on GitHub hosted runners, which is excellent.

And even for the web platform tests, we found more modest time savings with the capacity that we've allocated to them. But as a result, even though the savings are a bit more modest, what's worth highlighting here is that we were able to go from twenty runners — we had to parallelise this test suite across twenty runners before — and now we only need three.


What makes our system unique
- Augments the built-in CI service of the forge
- Almost transparent user experience
- Linux, Windows, and macOS runners
- Graceful fallback to GitHub-hosted runners
- Secure enough for a large public project
- Completely self-hosted, so it's dirt cheap^

^ operating expenses, not necessarily labour

Some things that make our system unique and that we're pretty proud of — and these things apply to all versions of our system, which we've been developing over the last 12 to 18 months.

The first is that we build on top of the native CI service of the Git forge. So in this case, it's GitHub and GitHub Actions. It could be Forgejo and Forgejo Actions in the future, and we're working on that already.

We also want to give users more or less a transparent user experience, the idea being that users should not notice any changes in their day-to-day work, besides the fact that their builds have gotten faster. And I think for the most part, we have achieved that.

Our system supports all of the platforms that GitHub Actions supports for their runners, including Linux, Windows, and macOS, and we could even support some of the other platforms that Forgejo Actions supports in the future, including BSD.

We have the ability to set up a job so that it can try to use self-hosted runners if they're available, but fall back to GitHub hosted runners if there's none available for whatever reason, like maybe they're all busy for the foreseeable future, or the servers are down or something like that. We have the ability to fall back to GitHub hosted runners, and this is something that was quite complicated to build, actually. And we have a whole section of this talk explaining how that works, because this is something that's not actually possible with the feature set that GitHub provides in their CI system.

It's secure enough for a large public project like Servo, where we don't necessarily know all of our contributors all that well personally. And this is in large part because we use virtual machines, instead of just containers, for each run and for each job. My understanding is that it is possible to build a system like this securely with containers, but in practice it's a lot harder to get that right as a security boundary than if you had the benefit of a hypervisor.

And of course it is all completely self-hosted, which makes it about as cheap as it gets, because your operating expenses are really just the costs of bare metal compute.


Completely self-hosted
so it's dirt cheap^
- We spend 312 EUR/month on general-purpose runners
- On comparable GitHub runners: 1421–2500 EUR/month
- On comparable third-party runners: 503–1077 EUR/month

^ operating expenses, not necessarily labour

Now, how cheap is that? Well, in our deployment in Servo, we spend about 300 EUR per month on servers that do general-purpose runners, and these handle most of our workload.

If we were to compare that to if we were running on GitHub-hosted runners, there'd be almost an order of magnitude increase in costs, somewhere like 1400 EUR to over 2000 EUR per month if we were doing the same work on GitHub-hosted runners.

There are also significant increases if we went with third-party runner providers as well, although not quite as much.

But in truth, this is actually kind of an unfair comparison, because it assumes that we would need the same amount of hours between if we were running on self-hosted runners or if we were running on GitHub hosted runners. And something that you'll see throughout this talk is that this is very much not the case. We spend so many fewer hours running our jobs, because we have to do so much less work on them. So in reality, the gap between our expenses with these two approaches would actually be a lot wider.


Three ways to use runners
- Mandatory self-hosted: (with natural queueing)
    - runs-on: self-hosted-image:servo-ubuntu2204
- Graceful fallback to GitHub-hosted:
    - Decision job: POST https://monitor/select-runner
    - runs-on: ${{ needs.decision.outputs.label }}
- Graceful fallback plus queueing:
    - Decision job: POST https://queue/select-runner
    - runs-on: ${{ needs.decision.outputs.label }}

There are three ways to use our CI runner system.

The first one is the way that GitHub designed self-hosted runners to be used. The way they intend you to use self-hosted runners is that when you define a job, you use this runs-on setting to declare what kind of runners your job should run on. And you can use labels for GitHub hosted runners, or you can define arbitrary labels for your own runners and select those instead. But you have to choose one or the other in advance. And if you do that, you can't have any kind of fallback, which was a bit of a drawback for us, especially early on. One benefit of this, though, is that you get the natural ability to queue up jobs. Because if, at the time of queuing up a new job, if there's no self-hosted runners available yet, the job will stay in a queued state until a self-hosted runner becomes available. And it's nice that you essentially get that for free. But you have no ability to have this kind of graceful fallback.

So we built some features to allow you to do graceful fallback. And how this works is that each of the servers that operates these runners has a web API as well. And you can hit that web API to check if there are runners available and reserve them. Reserving them is something you have to do if you're doing graceful fallback. But I'll explain that in more detail a bit later on. And in doing so, because you have a job that can check if there are runners available, you can now parameterize the runs-on setting and decide "I want to use a self-hosted runner, or a GitHub hosted runner". It's unclear if this is going to be possible on Forgejo Actions yet, so we may have to add that feature, but it's certainly possible on GitHub Actions.

Now, the downside of this is that you do lose the ability, that natural ability, to queue up jobs, and I'll explain why that is a bit later. But in short, we have a queue API that mitigates this problem, because you can hit the queue API, and it can have a full view of the runner capacity, and either tell you to wait, or forward your request once capacity becomes available.


Faster checkouts
- No repo checkout unless you run actions/checkout
    - But servo/servo has 130K+ files and 6K+ directories
    - This is especially slow on Windows and macOS
- Bake a repo checkout into every runner image
- Replace actions/checkout with our own action:
    git fetch --depth=1 origin $commit
    git switch --detach
    git reset --hard FETCH_HEAD

Some things that you can do with our system, that you can't do with GitHub hosted runners. One of them is check out the repo significantly faster.

Something about GitHub Actions and how it works is that, if you run a job, you run a workflow in a repo, you don't actually get a checkout of the repo. You don't get a clone of the repo out of the box, unless you explicitly add a step that does a checkout. And this is fine for the most part, it works well enough for most users and most repos.

But the Servo repo has over 130,000 files across 6,000 directories, and that's just the tracked files and directories. And as a result, even if we use things like shallow clones, there's just kind of no getting around the fact that cloning this repo and checking it out is just slow. It's unavoidably slow.

And it's especially slow on Windows and macOS, where the filesystem performance is honestly often pretty poor compared to Linux. So we want to make our checkouts faster.

Well, what we can do is, we can actually move the process of cloning and checking out the repo from the build process, and move that into the image build process. So we only do it once, when we're building the runner images.

Then what we can do is go into the jobs that run on self-hosted runners, and switch out the stock checkout action with our own action. And our own action will just use the existing clone of the repo, it will fetch the commit that we actually need to build on, then switch to it, and check it out.

And as a result, we can check out the Servo repo pretty reliably in a couple of seconds, instead of having to check out the entire repo from scratch.


Incremental builds
- Cargo supports incremental builds
- We're now baking a repo checkout into every image
- Why not build Servo and bake that in too?
- Not perfect — some compilation units get false rebuilds
- Probably don't use this for release artifacts

Something that flows on from that, though, is that if we are baking a copy of the Servo repo into our runner images, well, what if we just bake a copy of the built artifacts as well? Like, why don't we just build Servo, and bake that into the image?

And this will allow us to do incremental builds, because Servo is a Rust project, and we use Cargo, and Cargo supports incremental builds. As a result, by doing this, when you run a job on our CI system, most of the time we only have to rebuild a handful of crates that have changed, and not have to rebuild all of Servo from scratch.

Now, this is not perfect. Sometimes we'll have some compilation units that get falsely rebuilt, but this works well enough, for the most part, that it's actually a significant time savings.

I also probably wouldn't trust this for building artifacts that you actually want to release in a finished product, just because of the kinds of bugs that we've seen in Cargo's incremental build support.

But for everything else, just your typical builds where you do like commit checks and such, this is very very helpful.


Servo's deployment
- Five servers on Hetzner^
    - 3x AX102 (Zen 4 16c/32t, 128G RAM) = 312 EUR/month
    - 2x AX42 (Zen 3 8c/16t, 64G RAM) = 92 EUR/month
- NixOS + libvirt + KVM + ZFS
- Custom orchestration

^ not including OpenHarmony runners

Servo has this system deployed, and it's had it deployed for at least the last year or so.

Nowadays we have three servers which are modestly large, and we use these for the vast majority of our workload. We also have two smaller servers that we use for specific benchmarking tasks.

The stack on these servers, if you could call it that, is largely things that I personally was very familiar with, because I built this. So we've got NixOS for config management, the hypervisor is libvirt and Linux KVM, and the storage is backed by ZFS. The process of actually building the images, and managing the lifecycle of the virtual machines though, is done with some custom orchestration tooling that we've written.


How does it work?
- Monitor service orchestrates the runners
    - Rebuilds virtual machine images
    - Spawns virtual machines for runners
    - Registers runners with CI service API
    - Labels runners when asked to reserve them
      (optional, but required for graceful fallback)
- Queue service allows queueing with fallback (optional)

This tooling consists of two services. The monitor service, which runs on every server that operates self-hosted runners, and the queue service, which is single and global.

So the monitor service does the vast majority of the work. It rebuilds virtual machine images, these templates. It clones the templates into virtual machines for each runner, for each job. It registers the runners with the CI service using its API, so that it can receive jobs. And it can also label the runners to tie them to specific jobs when asked to reserve them. This is optional, but it is a required step if we're doing graceful fallback.

Now, with graceful fallback, you do lose the ability to naturally queue up jobs. So what we've put on top of that is a single queue service that sits in front of the cluster, and it essentially acts as a reverse proxy. It's quite thin and simple, and there's a single one of them, not one per server, for the same reason as the general principle of, like, when you go to a supermarket, it's more efficient to have a single large queue, a single long queue of customers that gets dispatched to a bunch of shopkeepers. That's more efficient than having a sort of per-shopkeeper queue, especially when it comes to, like, moving jobs dynamically in response to the availability.


Graceful fallback

So we've got a whole section here about how graceful fallback works. I might cut this from shorter versions of the talk, but let's jump into it.


Graceful fallback
- Every job has to choose a runner label in advance
    runs-on: ubuntu-latest     # GitHub-hosted
    runs-on: self-hosted-image:servo-ubuntu2204
- Once you choose the runner label, there's no turning back
- Borrowing a built-in label does not prioritise self-hosted runners over GitHub-hosted runners
- So there's no way to fall back… or is there?

On GitHub Actions, every job has this runs-on setting, that defines what kind of runner it needs to get assigned to. It has to define this in advance, before the job runs.

And annoyingly, once you choose, "I want to run on a GitHub hosted runner" or "I want to run on a self-hosted runner", once you decide that, there's no turning back. Now, if you've decided that my job needs to run on a self-hosted runner, it can only run on a self-hosted runner. And now the outcome of your job, and the outcome of your workflow, now depends on that job actually getting a self-hosted runner with the matching criteria and succeeding. There's no way for that to become optional anymore.

And you might say, okay, well, maybe I can work around this by spinning up my self-hosted runners, but just giving them the same labels that GitHub assigns to their runners, maybe it'll do something smart? Like maybe... it will run my self-hosted runners if they're available and fall back. But no, the system has no sense of priority and you can't even define any sense of priority of like, I want to try this kind of runner and fall back to another kind of runner. This is simply not possible with the GitHub Action system.

So it may seem like it's impossible to do graceful fallback, but we found a way.


Decision jobs
- Prepend a job that chooses a runner label
    1.  let label = runner available?
           | yes => [self-hosted label]
           | no  => [GitHub-hosted label]
    2.  $ echo [self-hosted label] | no => [GitHub-hosted label] 2. $ echo " /> [self-hosted label] | no => [GitHub-hosted label] 2. $ echo " /> [self-hosted label] | no => [GitHub-hosted label] 2. $ echo "label=${label}" | tee -a $GITHUB_OUTPUT - Use the step output in runs-on runs-on: ${{ needs.decision.outputs.label }} - But two decisions must not be made concurrently">

What we can do is add a decision job, for each workload job that may need to run on self-hosted runners. And we prepend this job, and its job is to choose a runner label.

So how it essentially works is: it checks if a runner is available, and based on that, either chooses a self-hosted label or a GitHub hosted label. And it chooses it and sets it in an output. And this output can get pulled in... in our workload job, the job that actually does our work, because now you can parameterize the runs-on setting, so that it takes the value from this previous decision job.

Unfortunately, it seems like this may not be possible in Forgejo Actions just yet, so we might have to develop support for that, but it's certainly possible in GitHub Actions today, and it has been for quite a while.

The one caveat with this approach is that you need to be careful to only do this decision process one at a time. You should not do two instances of this process concurrently and interleave them.


Decisions must be serialised
- Stack Overflow and GitHub answers are inherently racy:
    any idle runners? —(TOCTOU)→ commit to self-hosted
- We can reserve a runner by applying a unique label to it!
    - GH API: add custom label: reserved-for:<UUIDv4>
    -   runs-on: ${{ needs.decision.outputs.label }}
        runs-on: 6826776b-5c18-4ef5-8129-4644a698ae59
- Initially do this directly in the decision job (servo#33081) - runs-on: ${{ needs.decision.outputs.label }} runs-on: 6826776b-5c18-4ef5-8129-4644a698ae59 - Initially do this directly in the decision job (servo#33081)" /> - runs-on: ${{ needs.decision.outputs.label }} runs-on: 6826776b-5c18-4ef5-8129-4644a698ae59 - Initially do this directly in the decision job (servo#33081)" /> - runs-on: ${{ needs.decision.outputs.label }} runs-on: 6826776b-5c18-4ef5-8129-4644a698ae59 - Initially do this directly in the decision job (servo#33081)">

The reason for this is something you'll see if you think about a lot of the answers that you get on Stack Overflow and the GitHub forums. If you look up, "how do I solve this problem? How do I define a job where I can fall back from self-hosted runners to GitHub hosted runners?"

And most of the answers there, they'll have a problem where they check if there are any runners that are available, and then, they will make the decision, committing to either a self-hosted runner or a GitHub hosted runner. The trouble is that in between, if another decision job comes in and tries to make the same kind of decision, they can end up "earmarking" the same runner for two jobs. But each runner is only meant to run one job, and it can only run one job, so one of the jobs will get left without a runner.

So we can start to fix this by actually reserving the runners when we're doing graceful fallback. And how we've done it so far, is that we've used the GitHub Actions API to label the runner when we want to reserve it, and we label it with a unique ID. Then the workload job can put that unique ID, that label, in its runs-on setting. Instead of a general runner label, it can tie itself to this specific, uniquely identified runner label.

And we did it this way initially, because it allowed us to do it inside the decision job, at first. I think in the future, we will have to move away from this, because on Forgejo Actions, the way runner labels work is quite different. They're not something that you can sort of update after the fact. In fact, they're actually kind of defined by the runner process. So this approach for reserving runners won't work on Forgejo Actions. We'll probably have to do that internally on the runner servers. But in the meantime, we use labeling. Yeah, so at first we did this inside the decision job.


Decisions must be serialised
- Labelling runners requires a privileged GitHub API token
- Even with reservation, decisions must still be serialised:
    runner not yet reserved? —(TOCTOU)→ label the runner
    - But hey, at least we have job concurrency… right?
    - Wrong: runs will fail under even modest contention :(

There are some problems with this. One of them is that now the decision job has to have a GitHub token that has permissions to manage runners and update their labels. And this is something that we'd like to avoid if possible, because we'd like our jobs to have least privilege that they need to do their work.

And something I didn't mention is that reserving runners in this way doesn't actually solve the problem on its own, because you've now transformed the problem to being, okay, we're going to check if the runner is not yet reserved. We're going to check if there's an unreserved runner, and then, we're going to label the runner. But in between, there's a moment where another process doing the same thing could make the same decision. And as a result, if we just did this, we could end up with a situation where one runner gets assigned two unique labels, but it can only fulfill one of them. So we have that same problem again.

So you might say, okay, well, it looks like GitHub Actions has this neat job concurrency feature. I mean, they say you can use this to define a job in a way where only one of them will run at a time, and you can't run them concurrently, so let's try using that to solve this problem.

What you'll find, though, is that if you try to solve the problem with job concurrency, as soon as there's even the slightest bit of contention, you'll just have your runs starting to fail spuriously, and you'll have to keep rerunning your jobs, and it'll be so much more annoying.

And the reason for this is that, if you look more closely at the docs, job concurrency essentially has a queue limited to one job. So at a maximum, you can have one job, one instance of the job that's running, one instance of the job that's queued, and then if another instance of the job comes in while you have one running and one queued, then those extra jobs just get cancelled, and the build fails. So unfortunately, job concurrency does not solve this problem.


Decisions must be serialised
- So move decisions out of the decision jobs (servo#33315)
- But what happens if reserved runners fail to materialise?
- You can limit in_progress time in GHA, but not queued time

So to solve these problems, what we did is we moved that decision and runner reservation process out of the decision jobs, and into the servers that operate the runners themselves. And we do this with an API that runs on the servers.

One smaller problem you might notice though, is that there's still a possibility that you could reserve a runner, but then after you've reserved the runner, it might fail to actually run. And this has become a lot less likely in our system, in our experience nowadays, now that we've ironed out most of the bugs, but it can still happen from time to time, usually due to infrastructure or network connectivity failures.

And we wanted to solve this by setting a time limit on how long a job can be queued for, because if it can't actually get a runner in practice, it will get stuck in that queued state indefinitely. But unfortunately, while we can set a time limit on how long a job can run for once it's been assigned, we can't actually set a time limit on how long a job can be queued for.

So we have to rely on the default built-in limit for all jobs, where I think there's like a limit of a day or two? So like, 24 or 48 hours or so, for how long a job can be queued? And this is just a really long time. So as a result, whenever this happens, you essentially have to cancel the job manually, or if you don't have permission to do that, you have to go ask someone who has permission, to go and cancel your job for you, which is really annoying.


Timeout jobs
- Watchdog for your workload job, ensuring it gets a runner
    1.  Wait a short amount of time (e.g. 120 seconds)
    2.  Query the CI service API for the workload job
    3.  If the job is still queued, cancel the run
- Only run this when you actually use a self-hosted runner:
    if: ${{ fromJSON(needs.decision.outputs.is-self-hosted) }}

So we solved that using timeout jobs. A timeout job is a sibling to your workload job, and it acts like a watchdog, and it ensures that it actually got a runner when you expect to get a self-hosted runner.

And how that works is, we wait a short amount of time, just like a minute or two, which should be long enough for the runner to actually start running the job, and then we query the API of the CI service to check if the workload job actually started running, or if it's still in that queued state. If it's still queued after two minutes, we cancel the run.

Unfortunately, we can't just cancel the job run. We do have to cancel the whole workflow run, which is quite annoying. But, you know, it's GitHub, nothing is surprising anymore.

Thankfully, we only have to run this when we actually get a self-hosted runner, so we can make it conditional. But over time, in Servo's deployment, we have actually stopped using these, to free up some of those GitHub-hosted runner resources.


Uniquely identifying jobs
- How do we know the run id of the workload job?
    - Jobs can be instantiated many times via workflow calls
- The only supported job relationship is needs
    - Workload job needs decision job
    - Timeout job needs decision job
    - Timeout job can't needs workload job
- needs relationships are not exposed in the API

One challenge that comes in making these timeout jobs is identifying the workload jobs uniquely, so we can look up whether it's still queued or whether it's started running. There are unique IDs for each job run. These are just like an incrementing number, and you'd think we'd be able to use this number to look up the workload job uniquely and robustly.

Unfortunately, you can't know the run ID of the job [correction 2025-12-24] until it starts, and it may not ever start... or at least you may not know it until the workflow runs, and there can be many instances of the job in the workflow because of workflow calls. Workflow calls are a feature that essentially allows you to inline the contents of a workflow in another as many times as you like. And as a result, you can have multiple copies, multiple instances of a job that run independently within one workflow run. So we definitely need a way to uniquely look up our instance of the workload job.

The trouble is that the only job relationship you can do in GitHub Actions is a needs relationship, and that's inappropriate for our situation here, because we can say that the workload job needs the decision job, we can say the timeout job needs the decision job — and in fact we do both of these, we "need" to do both of these — but we can't say that the timeout job needs the workload job, because of how needs works.

How needs works is that if job A needs job B, then job B has to actually get assigned a runner, and run, and complete its run — it has to finish — before job A can even start. And in this situation, we're making a timeout job to catch situations where the workload job never ends up running, so if we expressed a needs relationship between them, then the timeout job would never run, in these cases at least.

And even if we could express a needs relationship between jobs, like maybe we could walk the job tree, and go from the timeout job, through the decision job via the needs relationship, and then walk back down to the workload job using the same kind of needs relationship... unfortunately, none of these needs relationships are actually exposed in the API for a running workflow. So like, they're used for scheduling, but when you actually go and query the API, you can't tell what job needs what other job. They're just three jobs, and they're unrelated to one another.


Uniquely identifying jobs
- Tie them together by putting the <UUIDv4> in the name:
    name: Linux [${{ needs.decision.outputs.unique-id }}]
    name: Linux [6826776b-5c18-4ef5-8129-4644a698ae59]
- Query the CI service API for all jobs in the workflow run
- Check the status of the job whose name contains
    [${{ needs.decision.outputs.unique-id }}]
- Yes, really, we have to string-match the name :))) in the name: name: Linux [${{ needs.decision.outputs.unique-id }}] name: Linux [6826776b-5c18-4ef5-8129-4644a698ae59] - Query the CI service API for all jobs in the workflow run - Check the status of the job whose name contains [${{ needs.decision.outputs.unique-id }}] - Yes, really, we have to string-match the name :)))" /> in the name: name: Linux [${{ needs.decision.outputs.unique-id }}] name: Linux [6826776b-5c18-4ef5-8129-4644a698ae59] - Query the CI service API for all jobs in the workflow run - Check the status of the job whose name contains [${{ needs.decision.outputs.unique-id }}] - Yes, really, we have to string-match the name :)))" /> in the name: name: Linux [${{ needs.decision.outputs.unique-id }}] name: Linux [6826776b-5c18-4ef5-8129-4644a698ae59] - Query the CI service API for all jobs in the workflow run - Check the status of the job whose name contains [${{ needs.decision.outputs.unique-id }}] - Yes, really, we have to string-match the name :)))">

So how we had to end up solving this is, we had to tie these jobs together by generating a unique ID, a UUID, and putting it in the friendly name, like the display name of the job, like this.

And to query the CI service to find out if that job is still queued, we need to query it for the whole workflow run, and just look at all of the jobs, and then find the job whose name contains that unique ID.

Then we can check the status, and see if it's still queued. This is really, really silly. Yes, we really do have to string match the name, which is bananas! But this is GitHub Actions, so this is what we have to do.


Tokenless API
- Monitor API requires access to secrets in the workflow
    - All pull_request_target runs have access to secrets
        - …but you generally don't want to use it anyway
    - Most pull_request runs do not have access to secrets
- How do we prove the request is genuine and authorised, if we can't authenticate with a token?

One thing I didn't mention is that being able to reserve runners needs to be kind of a privileged operation, because we don't just want an arbitrary client on the internet to be able to erroneously or maliciously reserve runners. Even if they may not be able to do a whole lot with those runners, they can still deny service.

So to use the monitor API to do graceful fallback and to request and reserve a runner, normally we would require knowledge of some kind of shared secret, like an API token, and that's what we've done for most of the life of this system.

The trouble with this is that there are many kinds of workflow runs that don't have access to the secrets defined in the repo. A big one is pull_request runs. Most of the time, pull_request runs don't have access to secrets defined in the repo. And there is another kind of run called a pull_request_target run, and those do have access to secrets, but they also have some pretty gnarly security implications that mean that in general, you wanna avoid using these for pull requests anyway.

So if you're stuck with pull_request runs for your pull requests, does that mean that you can't use self-hosted runners? How do we allow pull_request runs to request and reserve runners in a way that it can prove that its request is genuine and authorized?


Tokenless API
- Upload an artifact representing the request!
- Hit the monitor API
    - /select-runner ?unique_id &qualified_repo &run_id
    - (the profile_key is in the artifact)
- Important: delete the artifact, so it can't be reused (and set the minimum auto-delete, in case that fails)

What we do is we use artifacts. We upload a small artifact that encodes the details of the request and publish the artifact against the run. So in the artifact, we'd say, "I want two Ubuntu runners" or "one Windows runner" or something like that.

And then we would hit the monitor API, we hit a different endpoint that just says, go to this repo, go to this run ID, and then check the artifacts! You'll see my request there! And this does not require an API token, it's not a privileged operation. What's privileged is publishing the artifact, and that's unforgeable; the only entity who can publish the artifact is the workflow itself.

All we have to do then, all we have to be careful to do, is to delete the artifact after we reserve the runner, so that it can't be replayed by a malicious client. And in the event that deleting the artifact fails, we can also set the minimum auto-delete period of, I think at the moment it's 24 hours, just in case that fails.


Global queue
- Fallback happens immediately if no runners are available
- But if GitHub-hosted runs take 5x as long as self-hosted, we can wait up to 80% of that time and still win
- Run a queue service that allows jobs to wait for capacity
- Decision jobs hit the queue API instead of the monitor API
- Queue API says

Graceful fallback normally means that you lose the ability to naturally queue jobs for self-hosted runners. And this happens because when you hit the API, requesting, you know, are there any available runners, please reserve one for me. If there aren't any available at the time of the request, we will immediately fall back to running on a GitHub hosted runner.

But the thing is, is that our self-hosted runners are generally so much faster! A GitHub hosted run might take five times as long as the equivalent self-hosted run, and if that was the case, it would actually be beneficial to wait up to 80% of that time, and we'd still probably save time.

So to increase the utilization of our self-hosted runners, what we can do is run a small queue service that sits in front of the monitors, and it essentially acts as a reverse proxy. It will take the same kind of requests for reserving runners as before, but it will have a global view of the availability and the runner capacity at any given moment.

And based on that, it will either respond to the client saying, go away and please try again later, or it will forward the request onto one of the monitors based on the available capacity.


Runner images

We also have some lessons here about how to automate building virtual machine images, essentially. And these are lessons that we've learned over the last year or two.


Runner images
- GitHub uses Packer for their stock runner images
- Our monitor service manages image rebuilds
- Initially kicked off manually, now fully automated (#6)
    - Driven by Rust with reflink copies (#32)
- Mounting images to inject data is no longer viable (#30)
    - macOS has no usable FS with Linux write support
    - Get tools and deps from the monitor's web server (#32)

GitHub Actions uses Packer for their first-party runner images, so they use Packer to build those images. And our monitor service also automates building the images, but we don't use Packer.

Initially, we had a handful of scripts that we just kicked off manually whenever we needed to update our images, but we've now fully automated the process. And we've done this using some modules in our monitor service, so there's some high-level Rust code that drives these image rebuilds, and it even uses reflink copies to take advantage of copy-on-write time savings and space savings with ZFS.

Now, one of the complications we ran into when building images, over the past year or so, is that we used to pull a sort of clever trick where, we would do as much of the process of configuring the virtual machine image on the host, actually, as possible, rather than doing configuration inside the guest, and having to spin up the guest so that we can configure it. And we were able to do this by essentially mounting the root file system of the guest image, on the host, and then injecting files, like injecting tools and scripts and other things that are needed by the image and needed to configure the image.

But we stopped being able to do this eventually because, well, essentially because of macOS. macOS has no file system that's usable for building Servo that can also be mounted on Linux with write support. Because like, just think about them, right? We've got HFS+, which Linux can write to but only if there's no journaling and you can't install macOS on HFS+ without journaling. There's APFS, which Linux has no support for. There's exFAT, which has no support for symlinks, so a lot of tools like uv will break. There's NTFS, which we thought would be our savior, but when we tried to use it, we ran into all sorts of weird build failures, which we believe are due to some kind of filesystem syncing race condition or something like that, so NTFS was unusable as well.

In the end, what we had to do is, if a guest image needed any tools or dependencies, it had to download them from a web server on the host.


Runner images
- Consistent approach to automating operating systems
    - OS installation: whatever is easiest
    - Config bootstrap: native config management (if any)
        - Use it as little as possible
    - Image config: native scripting
    - Runner boot: native scripting

We eventually settled on a consistent approach to automating the process of installing the OSes and configuring them for each runner template.

And the approach that we used was to install the OS using whatever method is easiest, and to bootstrap the config using any native config management system, if there is one included with the operating system.

But once we've kicked off that process, we then use the native config management as little as possible. And we do this because a lot of the config management tools that are built into these operating systems are quite quirky and annoying, and they are built for needs that we don't have, the primary need being that they can manage the configuration of a system over time, keeping it up to date with any changes. The thing about these runner images, though, is that each runner image only needs to get built once, it only needs to get configured once, and then after after that it'll be cloned for each runner, and then it will only run once, and this kind of sort of "one shot" use case is something that's kind of overkill with config management systems.

We can just do it with the usual scripting and automation facilities of the operating system. How does that look like in practice?


Linux runners
- OS installation: prebuilt Ubuntu cloud images
- Config bootstrap: cloud-init config
    - Use it as little as possible (systemd journal to tty7, netplan config, curl and run next stage)
- Image config: bash script
- Runner boot: same bash script

Well for Linux, we install the OS using pre-built Ubuntu cloud images, that we just download from the mirrors.

We bootstrap the config using cloud-init, which is such a painful... it's so painful to use cloud-init. We use it because it's included with the operating system, so that means it's the fastest possible way to get started.

We use it as little as possible: we just configure the logs to go to a TTY, we configure the network, so we can connect to the network, and then once we're on the network, we just curl and run a bash script which does the rest.


Windows runners
- OS installation: autounattend.xml (generator)
- Config bootstrap: same autounattend.xml
    - Use it as little as possible
    - Create elevated scheduled task to curl and run next stage
    - Install NetKVM driver, do some reg imports, reboot
- Image config: PowerShell script
- Runner boot: same PowerShell script

The same goes for Windows.

We install the operating system using an automated answers file called autounattend.xml. There's a nice little generator here which you can use if you don't want to have to set up a whole Windows server to set up unattended installations. You generate that XML file, and you can also use that XML file to bootstrap the config.

Again we use it as little as possible, because writing automations as XML elements is kind of a pain. So we essentially just set up a scheduled task to run the next stage of the config, we install the network driver, we import some registry settings, and we reboot. That's it. The rest of it is done with a PowerShell script.


macOS runners
- OS installation: by hand :( but only once
- Config bootstrap: curl|sh by hand :( but only once
    - Use it as little as possible (zsh script)
    - Enable SSH, enable autologin, enable sudo NOPASSWD
    - Install a LaunchAgent to curl and run next stage
    - Disable broken session restore feature in Terminal.app
- Image config and runner boot: zsh script

The same goes for macOS.

Now, unfortunately, installing the OS and bootstrapping the config does have to be done by hand. And this is because if you want to automate a macOS installation, your two options, more or less, are enterprise device management solutions, which cost a lot of money and mean that you have to have a macOS server around to control and orchestrate the servers. But if you don't want to use one of those enterprise solutions, what most open systems that are faced with this problem end up doing is to throw OpenCV at the problem. I've seen several projects use OpenCV to OCR the setup wizard, which is... it's certainly a bold strategy. It's not really for me.

What I decided to do instead is just install the OS by hand, and pipe curl into sh to kick off the config management process. And this is something that we only really have to do once, because we do it once, and then we take a snapshot of it, and then we never have to do it again, at least until the next version of macOS comes out.

So this bootstrap script just does a handful of minimal things: it enables automatic login, it sets up a LaunchAgent to ensure that we can run our own code on each boot, and then it does a handful of other things which it honestly doesn't really have to do in this script. We could probably do these things in the zsh script which we then curl and run. And that's where the remainder of the work is done.


Future directions
- Decoupling the system from Servo
- macOS arm64 runners (#64)
- Support for Forgejo Actions (#94)
- Support for other CI services?
- Dynamic runner counts / autoscaling
- Hot runners with memory ballooning
- microVM runners?

So looking forward, some things that we'd like to do with this system.

The first is to decouple it from Servo. So we built this CI system quite organically over the past 12 to 18 months, and we built it around Servo's needs specifically, but we think this system could be more broadly useful for other projects. We'll just have to abstract away some of the Servo specific bits, so that it can be more easily used on other projects, and that's something we're looking into now.

Something else that we'll have to do sooner or later is add support for macOS runners on Apple Silicon, that is ARM64, and the reason we have to do this is that macOS 26, which is the most recent version of macOS that came out in September, that's just a couple months ago, that is the last version of macOS that will support x86 CPUs. And at the moment, our macOS runners run on x86 CPUs, on the host and in the guest.

This is a little bit complicated because at the moment, our macOS runners actually run on Linux hosts, using a Linux-KVM-based sort of Hackintosh-y kind of solution. And there is no counterpart for this for arm64 hosts and guests, and I'm not sure there ever will be one. So we're going to have to port the system so that it can run on macOS hosts, so we can use actual Mac hardware for this, which is easy enough to do, and that's in progress.

But we're also going to have to port the system so it can run with other hypervisors. And this is because, although libvirt supports macOS hosts, the support for the macOS Hypervisor framework and Virtualization framework is not mature enough to actually run macOS guests in libvirt. And I'm not sure how long that will take to develop, so in the meantime, we've been looking at porting the system so that when you're on a Mac, you can run with UTM instead, and that's been working fairly well so far.

We're also looking at porting the system so that it can run with Forgejo Actions and not just GitHub Actions. So Forgejo Actions is an open alternative to GitHub Actions that tries to be loosely compatible with GitHub Actions, and in our experience, from experimentation so far, we found that it mostly is loosely compatible. We think we'll only have to make some fairly minor changes to our system to make it work on both CI systems.

That said, this CI system could potentially be more broadly applicable to other CI services as well, because virtual machine orchestration is something that we haven't really seen any CI services have a great built-in solution for. So if this interests you and you want to use it on your project on some other CI service, then we'd appreciate knowing about that, because that could be something we would explore next.

The remaining ideas are things that I think we could look into to make our runners more efficient.

The big one is autoscaling. So at the moment when you set up a server to operate some self-hosted runners, you essentially have to statically pre-configure how many runners of each kind of runner you want to be kept operating. And this has worked well enough for us for the most part, but it does mean that there's some kind of wasted resources sometimes, when the moment-to-moment needs of the jobs that are queued up aren't well fitted to the composition of your runner configuration. So if we had the ability to dynamically respond to demand, or some kind of autoscaling, I think we could improve our runner utilization rates a little bit, and sort of get more out of the same amount of runner capacity, the same amount of server capacity.

There's a couple ideas here, also, about reducing boot times for the runners, which can be quite helpful if you have a big backlog of jobs queued up for these servers, and this is because time spent booting up each runner, each virtual machine, is time that cannot be spent doing real work.

So two ways we can think of to reduce these boot times are, to have hot spares ready to go, the idea being that, if you can spin up more runners than you actually intend to run concurrently, and just have them sitting around, then you can kind of amortize the boot times, and sort of get the boot process process out of the way. And the way you do this is by spinning up a whole bunch of runners, say maybe you spin up like twenty runners, even though you only intend to run four of them concurrently.

And what you do is you give these runners a token amount of RAM to start with. You give them like one gig of RAM instead of 16 or 32 gigs of RAM. And then when a job comes in, and you actually want to assign the runner out so that it can do the work, then you dynamically increase the RAM from one gig, or that token amount, to the actual amount, like 16 gigs or 32 gigs. And this should be fairly easy to do in practice. This is actually supported in libvirt using a feature known as memory ballooning. But there are some minor caveats, like you do lose the ability to do certain optimizations, like you can't do huge pages backing on the memory anymore. But for the most part, this should be fairly technically simple to implement.

Something that could be more interesting in the longer term is microVMs, things like Firecracker, which as I understand it, these microVMs can sort of take the concept of paravirtualization to its logical extreme. And what it means is that on kernels that support being run as microVMs, you can boot them in like one or two seconds, instead of 20 or 30 or 60 seconds. And this could save a great deal of time, at least for jobs that run on Linux and BSD. I don't know if I said Linux and macOS, but I meant Linux and BSD.


github.com/servo/ci-runners
Slides: go.daz.cat/3tdhp

So yea, we now have a system that we use to speed up our builds in Servo's CI, and it works fairly well for us. And we think that it's potentially useful for other projects as well.

So if you're interested to find out more, or you're interested to find out how you can use the system in your own projects, go to our GitHub repo at servo/ci-runners, or you can go here for a link to the slides. Thanks! article > section:not(#specificity) { position: relative; display: flex; flex-flow: column nowrap; padding: 1em 0; } article > section:not(#specificity) > * { flex: 0 0 auto; } article > section:not(#specificity) > ._slide { max-height: 50vh; flex: 0 0 auto; position: sticky; top: 0; text-align: center; } article > section:not(#specificity) > ._slide > img { max-width: 100%; max-height: 50vh; } article > section:not(#specificity) > ._spacer { flex: 1 0 1em; }

December 18, 2025 10:00 AM

December 17, 2025

Andy Wingo

in which our protagonist dreams of laurels

I had a dream the other evening, in which I was at a large event full of hackers—funny, that this is the extent of my dreams at the moment; as a parent of three young kids, I don’t get out much—and, there, I was to receive an award and give a speech. (I know, I am a ridiculous man, even when sleeping.) The award was something about free software; it had the trappings of victory, but the vibe among attendees was numbness and bitter loss. Palantir had a booth; they use free software, and isn’t that just great?

My talk was to be about Guile, I think: something technical, something interesting, but, I suspected, something inadequate: in its place and time it would be a delight to go deep on mechanism but the moment seemed to call for something else.

These days are funny. We won, objectively, in the sense of the goals we set in the beginning; most software is available to its users under a free license: Firefox, Chromium, Android, Linux, all the programming languages, you know the list. So why aren’t we happy?

When I reflect back on what inspired me about free software 25 years ago, it was much more political than technical. The idea that we should be able to modify our own means of production and share those modifications was a part of a political project of mutual care: we should be empowered to affect the systems that surround us, to the extent that they affect us.

To give you an idea of the milieu, picture me in 1999. I left my home to study abroad on another continent. When I would go to internet cafés I would do my email and read slashdot and freshmeat as one did back then, but also I would often read Z magazine, Noam Chomsky and Michael Albert and Michael Parenti and Arundhati Roy and Zapatistas and all. I remember reading El País the day after “we” shut down the World Trade Organization meeting in Seattle, seeing front-page pictures of pink-haired kids being beat up by the cops and wishing I were there with them. For me, free software fit with all of this: the notion that a better world was possible, and we could build it together.

I won’t lie and say that the ideals were everything. I think much of my motivation to program is selfish: I like to learn, to find out, to do. But back then I felt the social component more strongly. Among my cohort, though, I think we now do free software because we did free software; the motive sedimented into mechanism. These are the spoils of victory: free is the default. But defaults lack a sense of urgency, of the political.

Nowadays the commons that we built is the feedlot of large language models, and increasingly also its waste pond. The software we make is free, but the system in which it is made is not; Linux Magazine 1, Z magazine 0.

All of this makes me think that free software as a cause has run its course. We were the vanguard, and we won. Our dreams of 25 years ago are today’s table stakes. Specifically for my copyleft comrades, it seems that the role of copyright as a societal lever has much less purchase; taken to its conclusion, we might find ourselves siding with Disney and OpenAI against Google.

If I had to choose an idea from the 90s to keep, I would take “another world is possible” over the four freedoms. For me, software freedom is a strategy within a broader humanist project of liberation. It was clever, in that it could motivate people from a variety of backgrounds in a way that was on the whole positive for the humanist project. It inspired me as a meaningful way in which I could work towards a world of people caring for each other. In that spirit, I would like to invite my comrades to reflect on their own hierarchy of principles; too often I see people arguing the fine points of “is this software free” according to a specific definition without appreciating the ends to which the software freedom definition is a means.

Anyway, it turns out that I did win something, the Award for the Advancement of Free Software, for my work on Guile over the years. My work on Guile has waxed and waned, and in these last few years of parenthood it has been rather the latter, but I am proud of some of the technical hacks; and it has been with a heart-warming, wondrous delight that I have been a spectator to the rise of Guix, a complete operating system built on Guile. Apart from its quite compelling technical contributions, I just love that Guix is a community of people working together to build a shared project. I am going to the Guix days in a month or so and in past years it has been such a pleasure to see so many people there, working to make possible another world.

In my dream, instead of talking about Guile, I gave a rousing and compelling impromptu invective against Palantir and their ilk. I thought it quite articulate; I was asleep. In these waking hours, some days later, I don’t know what I did say, but I think I know what I would like to have said: that if we take the means of free software to be the ends, then we will find ourselves arguing our enemies are our friends. Saying that it’s OK if some software we build on is made by people who facilitate ICE raids. People who build spy software for controlling domestic populations. People who work for empire.

What I would like to say is that free software is a strategy. As a community of people that share some kind of liberatory principles of which free software has been a part, let use free software as best we can, among many other strategies. If it fits, great. If you find yourself on the same side of an argument as Palantir, it’s time to back up and try something else.

by Andy Wingo at December 17, 2025 10:42 PM

December 16, 2025

Pablo Saavedra

Verifying ARM vs THUMB2 Instruction Sets in ELF Binaries

When working with embedded Linux systems, in ARM-based boards, sometimes you need to determine whether a binary was compiled with ARM or THUMB2 instruction sets. This is a quick reference guide for checking this without relying on heavy tools like readelf or objdump. The Core Concept ARM uses a clever trick to distinguish between ARM […]

by Pablo Saavedra at December 16, 2025 08:17 AM

December 15, 2025

Igalia WebKit Team

WebKit Igalia Periodical #51

Update on what happened in WebKit in the week from December 8 to December 15.

In this end-of-year special have a new GMallocString helper that makes management of malloc-based strings more efficient, development releases, and a handful of advancements on JSC's implementation of Temporal, in particular the PlainYearMonth class.

Cross-Port 🐱

Added GMallocString class to WTF to adopt UTF8 C strings and make them WebKit first class citizens efficiently (no copies). Applied in GStreamer code together with other improvements by using CStringView. Fixed other two bugs about string management.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

Releases 📦️

Development releases of WebKitGTK 2.51.3 and WPE WebKit 2.51.3 are now available. These include a number of API additions and new features, and are intended to allow interested parties to test those in advance, prior to the next stable release series. As usual, bug reports are welcome in Bugzilla.

That’s all for this week!

by Igalia WebKit Team at December 15, 2025 07:58 PM

December 11, 2025

José Dapena

Maintaining Chromium downstream: how can upstream help?

As I write often, maintaining a downstream of Chromium is not easy. A lot of effort falls on the shoulders of the teams embedding Chromium, or creating products on top of the upstream Chromium project.

We covered this in the previous chapters of my series of blog posts about maintaining Chromium downstreams. Now, this post is going to be a bit different.

I start with a question:

What can upstream Chromium do to help downstreams?

This very same question was discussed in the Web Engines Hackfest breakout session that originated most of these posts. In this blog post, I will share some of the most interesting answers that came up in that session.

Better componentization #

One of the ideas was to move code around more aggressively to make it easier to reuse. Specifically, refactoring to move more and more code from //chrome to //components.

Chromium has gone a long way in that direction. Each of these changes allows downstreams to directly use only the components they need, instead of working on top of //chrome. But there is still room for improvement.

Some parts of //chrome are still not refactored and could be very useful, especially for downstreams shipping a browser. Some examples:

  • Tabs implementation.
  • Profiles.
  • Synchronization.

Improve extensibility #

In the same direction, supporting easier ways to provide alternative implementations, and add custom software components, was considered important.

Some examples:

  • Making it easier to support Chrome extensions without using //chrome, would allow implementing new browsers without bundling the Chromium UI.
  • Going further in the direction of what has been done with Ozone: the Chromium platform abstraction layer that helps to implement the support for a variety of OS (including Linux and X11). Similar steps could be taken at other levels to improve OS integration (system hardware encryption, accelerated video codecs, system IPC, and so on).

Downstream advocacy #

A very interesting proposal was to create the role of downstream advocates in the Chrome community.

They would act as an entry point for downstream projects wanting to interact with the Chrome community and be an official communication channel for downstreams to report their needs.

This would also increase awareness of the different ways Chromium is used by downstreams.

Today there are two channels that are somewhat similar: the Chromium Embedders mailing list and the #embedders Slack channel.

A two-way problem #

So far, three different problems raised by downstreams have been covered, and they seem like fair requests to the Chromium community.

But there is also work to do on the downstreams side.

Can downstreams contribute more of their work to upstream? Not only in code, but also in all the maintenance activities.

There is also code written for very specific downstream needs that could land upstream, as long as it does not become a burden to the common project. That means ownership and enough work bandwidth need to be in place.

Where are we now? #

There is a major change in the Chromium community: the creation of the Supporters of Chromium Based Browsers. What does it mean for embedders? Could it be a good way to channel requirements from the different downstream projects?

Two years after the Web Engines Hackfest session, we can see some improvements. But the general question is still valid:

What can upstream Chromium do to help downstreams?

The last post #

The next post in this series will be the last one. It will cover some typical problems downstream projects are facing today.

References #

December 11, 2025 12:00 AM

December 09, 2025

Luke Lau

Closing the LLVM RISC-V gap to GCC, part 1

At the time of writing, GCC beats Clang on several SPEC CPU 2017 benchmarks on RISC-V1:

LNT results comparing GCC and Clang

LLVM developers upstream have been working hard on the performance of generated code, in every part of the pipeline from the frontend all the way through to the backend. So when we first saw these results we were naturally a bit surprised. But as it turns out, the GCC developers have been hard at work too.

Sometimes a bit of healthy competition isn’t a bad thing, so this blog post is the first in a series looking at the work going on upstream to improve performance and catch up to GCC.

Please note that this series focuses on RISC-V. Other targets may have more competitive performance but we haven’t measured them yet. We’ll specifically be focusing on the high-performance application processor use case for RISC-V, e.g. compiling for a profile like RVA23. Unfortunately since we don’t have access to RVA23 compatible hardware just yet we’ll be benchmarking on a SpacemiT-X60 powered Banana Pi BPI-F3 with -march=rva22u64_v. We don’t want to use -mcpu=spacemit-x60 since we want to emulate a portable configuration that an OS distribution might compile packages with. And we want to include the vector extension, as we’ll see in later blog posts that optimization like auto-vectorization can have a major impact on performance.

Where to start?

It goes without saying that a vague task like “make LLVM faster” is easier said than done. The first thing is to find something to make fast, and while you could read through the couple dozen million lines of code in LLVM until inspiration strikes, it’s generally easier to start the other way around by analyzing the code it generates.

Sometimes you’ll get lucky by just stumbling across something that could be made faster when hacking or poring through generated assembly. But there’s an endless amount of optimizations to be implemented and not all of them are equally impactful. If we really want to make large strides in performance we need to take a step back and triage what’s actually worth spending time on.

LNT, LLVM’s nightly testing infrastructure, is a great tool for this task. It’s both a web server that allows you to analyze benchmark results, and a command line tool to help run the benchmarks and submit the results to said web server.

As the name might imply, it’s normally used for detecting performance regressions by running benchmarks daily with the latest revision of Clang, flagging any tests that may have become slower or faster since the last revision.

But it also allows you to compare benchmark results across arbitrary configurations. You can run experiments to see what effects a flag has, or see the difference in performance on two pieces of hardware.

Moreover, you can pass in different compilers. In our case, we can do two “runs” with Clang and GCC. Here’s how we would kick these off:

for CC in clang riscv64-linux-gnu-gcc
do
  lnt runtest test-suite bpi-f3-rva22u64_v-ReleaseLTO \
    --sandbox /var/lib/lnt/ \
    --test-suite=path/to/llvm-test-suite \
    -DTEST_SUITE_SPEC2017_ROOT=path/to/cpu2017 \
    --cc=$CC \
    --cflags="-O3 -flto -march=rva22u64_v" \
    --cxxflags="-O3 -flto -march=rva22u64_v" \
    --benchmarking-only \
    --build-threads=16 \
    # cross-compile and run on another machine over ssh
    --toolchain=rva22u64_v.cmake \	
    --remote-host=bpi-f3 \
    # fight noise, run each benchmark 3 times on the same core
    --exec-multisample=3 \
    --run-under="taskset -c 5" \
    # submit the results to a web server for easy viewing
    --submit=https://mylntserver.com/submitRun
done

This command does a lot of heavy lifting. First off it invokes CMake to configure a new build of llvm-test-suite and SPEC CPU 2017 with -O3 -flto -march=rva22u64_v. But because compiling the benchmarks on the Banana Pi BPI-F3 would be painfully slow, we’ve specified a CMake toolchain file to cross-compile to riscv64-linux-gnu from an x86-64 build machine. Here’s what the toolchain file looks like:

# rva22u64_v.cmake
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_C_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET riscv64-linux-gnu)
set(ARCH riscv64)

If you’ve got your cross toolchain sysroots set up in the right place in /usr/riscv64-linux-gnu/, things should “just work” and CMake will magically build RISC-V binaries. On Debian distros you can install the *-cross packages for this:

$ apt install libc6-dev-riscv64-cross libgcc-14-dev-riscv64-cross
libstdc++-12-dev-riscv64-cross

(You could also use mmdebstrap, or see Alex Bradbury’s guide to this)

After the benchmarks are built it rsyncs over the binaries to the remote machine, and then sshes into it to begin running the benchmarks. It will expect the sandbox path where the binaries are built on the remote to also exist on the host, so something like /var/lib/lnt should work across both. The BPI-F3 can also produce some noisy results, so the --exec-multisample=3 and -run-under="taskset -c 5" tell it to run the benchmarks multiple times and pin them to the same core.

Finally it generates a report.json file and submits it to the web server of choice. Navigate to the web interface and you’ll be shown two “machines”, LNT’s parlance for a specific combination of hardware, compiler and flags. You should see something like: bpif3-rva22u64_v-ReleaseLTO__clang_DEV__riscv64 and bpif3-rva22u64_v-ReleaseLTO__gcc_DEV__riscv64. Clicking into one of these machines will allow you to compare it against the other.

LNT UI for comparing results across two machines

Profiling

Once on the LNT web interface you’ll be presented with a list of benchmarks with a lot of red percentages beside them. We now know what is slower, but next we need to know why they’re slower. We need to profile these benchmarks to see where all the cycles are spent and to figure out what Clang is doing differently from GCC.

LNT makes this easy, all you need to do is add --use-perf=profile to the lnt runtest invocation and it will perform an additional run of each benchmark wrapped in perf record. Profiling impacts run time so it runs it separately to avoid interfering with the final results. If you want to override the default events that are sampled you can specify them with --perf-events=cycles:u,instructions:u,....

LNT will take care of copying back the collected profiles to the host machine and encoding them in the report, and in the web interface you’ll notice a “Profile” button beside the benchmark. Click on that and you’ll be brought to a side by side comparison of the profiles from the two machines:

LNT UI for comparing profiles

From here you can dive in and see where the benchmark spends most of its time. Select a function from the dropdown and choose one with a particularly high percentage: This is how much it makes up overall of whatever counter is active in the top right, like cycles or instructions. Then do the same for the other run and you’ll be presented with the disassemblies side-by-side below. Most importantly, information about the counters is displayed inline with each instruction, much like the output of perf annotate.

You might find the per-instruction counter cycle information to be a bit too fine-grained, so personally I like to use the “Control-Flow Graph” view mode in the top left. This groups the instructions into blocks and lets you see which blocks are the hottest. It also shows the edges between branches and their destinations which makes identifying loops a lot easier.

A real example

Lets take a look at how we can use LNT’s web interface to identify something that GCC does but Clang doesn’t (but should). Going back to the list of SPEC benchmark results we can see 508.namd_r is about 17% slower, so hopefully we should find something to optimize in there.

Jumping into the profile we can see there’s a bunch of functions that all contribute a similar amount to the runtime. We’ll just pick the hottest one at 14.3%, ComputeNonbondedUtil::calc_pair_energy_fullelect(nonbonded*). It’s a pretty big function, but in GCC’s profile 71% of the dynamic instruction count comes from this single, albiet large block.

A hot block in the profile for GCC's 508.namd_r

Looking at Clang’s profile on the opposite side we see a similar block that accounts for 85% of the function’s instruction count. This slightly higher proportion is a small hint that the block that Clang’s producing is sub-optimal. If we take the hint and stare at it for long enough, one thing starts to stand out is that Clang generates a handful of fneg.d instructions which GCC doesn’t:

	fneg.d  fa0, fa0
	fneg.d  ft0, ft0
	fneg.d  ft2, ft2
	fmul.d  fa3, ft5, fa3
	fmul.d  fa0, fa3, fa0
	fmul.d  ft0, fa3, ft0
	fmul.d  fa3, fa3, ft2
	fmadd.d fa2, fa4, fa2, fa0
	fmadd.d ft6, fa4, ft6, ft0
	fmadd.d fa4, fa4, ft1, fa3

fneg.d rd, rs1 negates a double and fmul.d multiplies two doubles. fmadd.d rd, rs1, rs2, rs3 computes (rs1*rs2)+rs3, so here we’re doing some calculation like (a*b)+(c*-d).

These fneg.ds and fmadd.ds are missing on GCC. Instead it emits fmsub.d, which is entirely absent from the Clang code:

	fmul.d fa1,fa4,fa1
	fmul.d ft10,fa4,fa5
	fmsub.d ft10,ft7,fa0,ft10
	fmsub.d fa5,ft7,fa5,fa1
	fmul.d fa1,fa4,fa1
	fmsub.d fa1,ft7,fa0,fa1

fmsub.d rd, rs1, rs2, rs3 computes (rs1*rs2)-rs3, so GCC is instead doing something like (a*b)-(c*d) and in doing so avoids the need for the fneg.d. This sounds like a missed optimization in LLVM, so lets take a look at fixing it.

Writing the (right) fix

The LLVM RISC-V scalar backend is pretty mature at this stage so it’s surprising that we aren’t able to match fmsub.d. But if you take a look in RISCVInstrInfoD.td, you’ll see that the pattern already exists:

// fmsub: rs1 * rs2 - rs3
def : Pat<(any_fma FPR64:$rs1, FPR64:$rs2, (fneg FPR64:$rs3)),
          (FMSUB_D FPR64:$rs1, FPR64:$rs2, FPR64:$rs3, FRM_DYN)>;

We’ll need to figure out why this pattern isn’t getting selected, so lets start by extracting the build commands so we can look under the hood and dump the LLVM IR:

$ cmake -B build -C cmake/caches/ReleaseLTO.cmake --toolchain=...
$ ninja -C build 508.namd_r -t clean
$ ninja -C build 508.namd_r -v
...
[44/45] : && llvm-project/build.release/bin/clang++ --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r

This is an LTO build so the code generation step is actually happening during link time. To dump the IR we can copy and paste the link command from the verbose output and append -Wl,--save-temps to it, which in turn tells the Clang driver to pass --save-temps to the linker2.

$ llvm-project/build.release/bin/clang++ -Wl,--save-temps --target=riscv64-linux-gnu -march=rva22u64_v -O3 -fomit-frame-pointer -flto -DNDEBUG -fuse-ld=lld ... -o External/SPEC/CFP2017rate/508.namd_r/508.namd_r 
$ ls External/SPEC/CFP2017rate/508.namd_r/508.namd_r*
External/SPEC/CFP2017rate/508.namd_r/508.namd_r
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.0.preopt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.2.internalize.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.4.opt.bc
External/SPEC/CFP2017rate/508.namd_r/508.namd_r.0.5.precodegen.bc

The bitcode is dumped at various stages, and 508.namd_r.0.5.precodegen.bc is the particular stage we’re looking for. This is after all the middle-end optimisations have run and is as close as we’ll get before the backend begins. It contains the bitcode for the entire program though, so lets find the symbol for the C++ function and extract just that corresponding LLVM IR function:

$ llvm-objdump -t 508.namd_r | grep calc_pair_energy_fullelect
...
000000000004562e l     F .text	0000000000001c92 _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded
$ llvm-extract -f 508.namd_r.0.5.precodegen.bc --func _ZN20ComputeNonbondedUtil26calc_pair_energy_fullelectEP9nonbonded \
  | llvm-dis > calc_pair_energy_fullelect.precodegen.ll

Now quickly grep the disassembled LLVM IR to see if we can find the source of the fnegs:

  %316 = fneg double %315
  %neg = fmul double %mul922, %316
  %317 = tail call double @llvm.fmuladd.f64(double %mul919, double %314, double %neg)

This looks promising. We have a @llvm.fmuladd that’s being fed by a fmul of a fneg, which is similar to the (a*b)+(c*-d) pattern in the resulting assembly. But looking back to our TableGen pattern for fmsub.d, we want (any_fma $rs1, $rs2, (fneg $rs3)), i.e. a llvm.fmuladd fed by a fneg of a fmul.

One thing about floating point arithmetic is that whilst it’s generally not associative, we can hoist out the fneg from the fmul since all negation does is flip the sign bit. So we can try to teach InstCombine to hoist the fneg outwards like (fmul x, (fneg y)) -> (fneg (fmul x, y)). But if we go to try that out we’ll see that InstCombine already does the exact opposite:

Instruction *InstCombinerImpl::visitFNeg(UnaryOperator &I) {
  Value *Op = I.getOperand(0);
  // ...
  Value *OneUse;
  if (!match(Op, m_OneUse(m_Value(OneUse))))
    return nullptr;

  if (Instruction *R = hoistFNegAboveFMulFDiv(OneUse, I))
    return replaceInstUsesWith(I, R);
  // ...
}

Instruction *InstCombinerImpl::hoistFNegAboveFMulFDiv(Value *FNegOp,
                                                      Instruction &FMFSource) {
  Value *X, *Y;
  if (match(FNegOp, m_FMul(m_Value(X), m_Value(Y)))) {
    // Push into RHS which is more likely to simplify (const or another fneg).
    // FIXME: It would be better to invert the transform.
    return cast<Instruction>(Builder.CreateFMulFMF(
        X, Builder.CreateFNegFMF(Y, &FMFSource), &FMFSource));
  }

InstCombine usually has good reasons for canonicalizing certain IR patterns, so we need to seriously reconsider if we want to change the canonical form. InstCombine affects all targets and it could be the case that some other backends have patterns that match fmul (fneg x, y), in which case we don’t want disturb them. However for RISC-V we know what our patterns for instruction selection are and what form we want our incoming IR to be in. So a much better place to handle this in is in RISCVISelLowering.cpp, which lets us massage it into shape at the SelectionDAG level, in a way that’s localized to just our target. “Un-canonicalizing” the IR is a common task that backends end up performing, and this is what the resulting patch ended up looking like:

--- a/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
+++ b/llvm/lib/Target/RISCV/RISCVISelLowering.cpp
@@ -20248,6 +20248,17 @@ SDValue RISCVTargetLowering::PerformDAGCombine(SDNode *N,
       return V;
     break;
   case ISD::FMUL: {
+    using namespace SDPatternMatch;
+    SDLoc DL(N);
+    EVT VT = N->getValueType(0);
+    SDValue X, Y;
+    // InstCombine canonicalizes fneg (fmul x, y) -> fmul x, (fneg y), see
+    // hoistFNegAboveFMulFDiv.
+    // Undo this and sink the fneg so we match more fmsub/fnmadd patterns.
+    if (sd_match(N, m_FMul(m_Value(X), m_OneUse(m_FNeg(m_Value(Y))))))
+      return DAG.getNode(ISD::FNEG, DL, VT,
+                         DAG.getNode(ISD::FMUL, DL, VT, X, Y));
+

And if we rebuild our benchmark after applying it, we can see we the fmsub.ds getting matched, saving a couple of instructions:

@@ -983,18 +983,15 @@
        fld     ft2, 48(a5)
        fld     ft3, 64(a5)
        fld     ft4, 72(a5)
-       fneg.d  fa0, fa0
-       fneg.d  ft0, ft0
-       fneg.d  ft2, ft2
        fmul.d  fa3, ft5, fa3
        fmul.d  fa0, fa3, fa0
        fmul.d  ft0, fa3, ft0
        fmul.d  fa3, fa3, ft2
        fld     ft2, 0(s1)
        fmul.d  fa4, ft5, fa4
-       fmadd.d fa2, fa4, fa2, fa0
-       fmadd.d ft6, fa4, ft6, ft0
-       fmadd.d fa4, fa4, ft1, fa3
+       fmsub.d fa2, fa4, fa2, fa0
+       fmsub.d ft6, fa4, ft6, ft0
+       fmsub.d fa4, fa4, ft1, fa3

All in all this ended up giving a 1.77% improvement in instruction count for the 508.namd_r benchmark. It’s still not nearly as fast as GCC, but we’re a little bit closer than before we started.

What’s next?

Hopefully this has given you an overview of how to identify opportunities for optimization in LLVM, and what a typical fix might look like. The analysis is really the most important part, but if you don’t feel like setting up an LNT instance yourself locally Igalia runs one at cc-perf.igalia.com3. We run llvm-test-suite and SPEC CPU 2017 nightly built with Clang and GCC on a small set of RISC-V hardware4, but hopefully to be expanded in future. Feel free to use it to investigate some of the differences between Clang and GCC yourself, and maybe you’ll find some inspiration for optimizations.

In the next post in this series I’ll talk about a performance improvement that recently landed related to cost modelling.

  1. Compiled with -march=rva22u64_v -O3 -flto, running the train dataset on a 16GB Banana Pi BPI-F3 (SpacemiT X60), with GCC and Clang from ToT on 2025-11-25. 

  2. LLD in this case, configurable through CMake with -DCMAKE_LINKER_TYPE=LLD

  3. The LLVM foundation is also in the process of rebooting its canonical public server, which should hopefully be up and running in the coming months. 

  4. Currently it consists of a few Banana Pi BPI-F3s and some HiFive Premier P550s, the latter of which were generously donated by RISC-V International. 

December 09, 2025 04:00 PM

December 08, 2025

Igalia WebKit Team

WebKit Igalia Periodical #50

Update on what happened in WebKit in the week from December 1 to December 8.

In this edition of the periodical we have further advancements on the Temporal implementation, support for Vivante super-tiled format, and an adaptation of the DMA-BUF formats code to the Android port.

Cross-Port 🐱

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

Implemented the toString, toJSON, and toLocaleString methods for PlainYearMonth objects in JavaScriptCore's implementation of Temporal.

Graphics 🖼️

BitmapTexture and TextureMapper were prepared to handle textures where the logical size (e.g. 100×100) differs from the allocated size (e.g. 128×128) due to alignment requirements. This allowed to add support for using memory-mapped GPU buffers in the Vivante super-tiled format available on i.MX platforms. Set WEBKIT_SKIA_USE_VIVANTE_SUPER_TILED_TILE_TEXTURES=1 to activate at runtime.

WPE WebKit 📟

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

The WPEBufferDMABufFormats class has been renamed to WPEBufferFormats, as it can be used in situations where mechanisms other than DMA-BUF may be used for buffer sharing—on Android targets AHardwareBuffer is used instead, for example. The naming change involved also WPEBufferFormatsBuilder (renamed from WPEBufferDMABufFormatsBuilder), and methods and signals in other classes that use these types. Other than the renames, there is no change in functionality.

That’s all for this week!

by Igalia WebKit Team at December 08, 2025 08:26 PM

December 05, 2025

Enrique Ocaña

Meow: Process log text files as if you could make cat speak

Some years ago I had mentioned some command line tools I used to analyze and find useful information on GStreamer logs. I’ve been using them consistently along all these years, but some weeks ago I thought about unifying them in a single tool that could provide more flexibility in the mid term, and also as an excuse to unrust my Rust knowledge a bit. That’s how I wrote Meow, a tool to make cat speak (that is, to provide meaningful information).

The idea is that you can cat a file through meow and apply the filters, like this:

cat /tmp/log.txt | meow appsinknewsample n:V0 n:video ht: \
ft:-0:00:21.466607596 's:#([A-za-z][A-Za-z]*/)*#'

which means “select those lines that contain appsinknewsample (with case insensitive matching), but don’t contain V0 nor video (that is, by exclusion, only that contain audio, probably because we’ve analyzed both and realized that we should focus on audio for our specific problem), highlight the different thread ids, only show those lines with timestamp lower than 21.46 sec, and change strings like Source/WebCore/platform/graphics/gstreamer/mse/AppendPipeline.cpp to become just AppendPipeline.cpp“, to get an output as shown in this terminal screenshot:

Screenshot of a terminal output showing multiple log lines. Some of them have the word

Cool, isn’t it? After all, I’m convinced that the answer to any GStreamer bug is always hidden in the logs (or will be, as soon as I add “just a couple of log lines more, bro🤭).

Currently, meow supports this set of manipulation commands:

  • Word filter and highlighting by regular expression (fc:REGEX, or just REGEX): Every expression will highlight its matched words in a different color.
  • Filtering without highlighting (fn:REGEX): Same as fc:, but without highlighting the matched string. This is useful for those times when you want to match lines that have two expressions (E1, E2) but the highlighting would pollute the line too much. In those case you can use a regex such as E1.*E2 and then highlight the subexpressions manually later with an h: rule.
  • Negative filter (n:REGEX): Selects only the lines that don’t match the regex filter. No highlighting.
  • Highlight with no filter (h:REGEX): Doesn’t discard any line, just highlights the specified regex.
  • Substitution (s:/REGEX/REPLACE): Replaces one pattern for another. Any other delimiter character can be used instead of /, it that’s more convenient to the user (for instance, using # when dealing with expressions to manipulate paths).
  • Time filter (ft:TIME-TIME): Assuming the lines start with a GStreamer log timestamp, this filter selects only the lines between the target start and end time. Any of the time arguments (or both) can be omitted, but the - delimiter must be present. Specifying multiple time filters will generate matches that fit on any of the time ranges, but overlapping ranges can trigger undefined behaviour.
  • Highlight threads (ht:): Assuming a GStreamer log, where the thread id appears as the third word in the line, highlights each thread in a different color.

The REGEX pattern is a regular expression. All the matches are case insensitive. When used for substitutions, capture groups can be defined as (?CAPTURE_NAMEREGEX).

The REPLACEment string is the text that the REGEX will be replaced by when doing substitutions. Text captured by a named capture group can be referred to by ${CAPTURE_NAME}.

The TIME pattern can be any sequence of numbers, : or . . Typically, it will be a GStreamer timestamp (eg: 0:01:10.881123150), but it can actually be any other numerical sequence. Times are compared lexicographically, so it’s important that all of them have the same string length.

The filtering algorithm has a custom set of priorities for operations, so that they get executed in an intuitive order. For instance, a sequence of filter matching expressions (fc:, fn:) will have the same priority (that is, any of them will let a text line pass if it matches, not forbidding any of the lines already allowed by sibling expressions), while a negative filter will only be applied on the results left by the sequence of filters before it. Substitutions will be applied at their specific position (not before or after), and will therefore modify the line in a way that can alter the matching of subsequent filters. In general, the user doesn’t have to worry about any of this, because the rules are designed to generate the result that you would expect.

Now some practical examples:

Example 1: Select lines with the word “one”, or the word “orange”, or a number, highlighting each pattern in a different color except the number, which will have no color:

$ cat file.txt | meow one fc:orange 'fn:[0-9][0-9]*'
000 one small orange
005 one big orange

Example 2: Assuming a pictures filename listing, select filenames not ending in “jpg” nor in “jpeg”, and rename the filename to “.bak”, preserving the extension at the end:

$ cat list.txt | meow 'n:jpe?g' \
   's:#^(?<f>[^.]*)(?<e>[.].*)$#${f}.bak${e}'
train.bak.png
sunset.bak.gif

Example 3: Only print the log lines with times between 0:00:24.787450146 and 0:00:24.790741865 or those at 0:00:30.492576587 or after, and highlight every thread in a different color:

$ cat log.txt | meow ft:0:00:24.787450146-0:00:24.790741865 \
 
  ft:0:00:30.492576587- ht:
0:00:24.787450146 739 0x1ee2320 DEBUG …
0:00:24.790382735 739 0x1f01598 INFO …
0:00:24.790741865 739 0x1ee2320 DEBUG …
0:00:30.492576587 739 0x1f01598 DEBUG …
0:00:31.938743646 739 0x1f01598 ERROR …

This is only the begining. I have great ideas for this new tool (as time allows), such as support for parenthesis (so the expressions can be grouped), or call stack indentation on logs generated by tracers, in a similar way to what Alicia’s gst-log-indent-tracers tool does. I might also predefine some common expressions to use in regular expressions, such as the ones to match paths (so that the user doesn’t have to think about them and reinvent the wheel every time). Anyway, these are only ideas. Only time and hyperfocus slots will tell…

By now, you can find the source code on my github. Meow!

by eocanha at December 05, 2025 11:16 AM

December 04, 2025

Brian Kardell

Standards Queues

Standards Queues

The hardest part of web standards isn’t even the technology — it’s the queues. And that’s the real problem I keep coming back to.

Pools, Queues, and Bottlenecks

As programmers, we’re familiar with these kinds of problems: if things enter faster than they leave, they back up. We often need to prioritize among the backlog. The standards process is like several of those queues stacked together.

Ideas enter the system far faster than they leave — and they can come from anywhere. But to progress, you need implementers. They are finite, already busy, and often advancing their own priorities. On top of that, every proposal competes for wide review in privacy, security, architecture, accessibility, and internationalization. Each of those specialties is even more finite, even more busy, and even more backed up.

So an idea lands in hundreds — even thousands — of inboxes, waiting for attention. We might not even notice it as it whips past among all the others. Even if we do, it might just get starred in email or left open in a tab for “later.” Sometimes that idea is a book, or an explainer, or suddenly has 20 replies. Instead of needing five minutes to read and consider, it becomes intimidating.

At some point it just sits. It might wait weeks, months, or even years before someone comments. Why? Because everyone has jobs with other tasks. The queues are full.

And the longer it sits, the more things change around it. The more it unloads from memory. The more intimidating it becomes to return to. It has to get through a whole lot of asynchronous back-and-forth between implementers, spec writers, and test writers before reaching baseline usability.

Along the way, if coordination isn’t strong (and historically it hasn’t been), once work is invested it’s hard to throw away. It’s hard to propose breaking changes or add stop energy.

Real Impacts

This is why something like :focus-visible can take seven years. Realistically, it required only a few days of effective discussion. The tests and development weren’t that hard.

The hard part was agreeing on what it should do and which tests it should pass. Most of that difficulty came from the fact that you couldn’t get everyone to sit down and focus concurrently. Implementations — and thus real focus — were years apart.

Checking Fitness

Getting “unstuck” isn’t just about moving something forward. One major appeal of the standards process is wide review, but for this to work we need ways for things to fail early enough to shift efforts.

Sometimes failure happens after months or even years. That’s frustrating and demoralizing. It’s like standing in a long DMV line, inching forward, only to discover at the end that you’re in the wrong building.

All of this is made worse by the fact that queues keep getting fuller.

Things that help

Interop

The Interop project illustrates the very end of the process, and in many ways the simplest.

Without intervention, each implementer historically built their own priority queue from all possible shortcomings of their browser engine. There’s a huge pool of things to choose from. I’ve written before about how WPT and the dashboard aren’t the best way to view this, but there are currently almost 23k subtests that fail in every browser (or almost 11k that fail and aren’t marked tentative).

Interop coordinates efforts to choose an achievable set of things from this gigantic pool that meet strict criteria: all that’s left is implementation work. It’s been hugely successful because it ensures delivery. It also helps us deliver early when people are excited about common priorities. In those cases, the impact is huge — we can go from mostly words to three interoperable implementations in one year. Amazing.

Still, every year a huge number of things remain in the pool that we can’t afford to take up. The pool keeps growing.

Joint Meetings

The WHATWG has started holding regular joint meetings with groups like OpenUI and CSSWG. This is valuable because it allows the right people to agree on an agenda and discuss directly, rather than leaving issues to sit unnoticed or requiring endless pings for attention.

W3C's TPAC is an annual event with five days of meetings (both W3C and WHATWG), many of them joint with wide-review specialists. These are dedicated times to get a lot of people in the same rooms for extended periods. The availability for hallway conversations also matters a lot: You can corner people there in ways that are much harder when everyone is remote. More progress happens at TPAC than in half the rest of the year combined.

Timely coordination — and investment to make it plausible — is still the single biggest problem we face in standards. I'd love to see us find ways to improve that.

December 04, 2025 05:00 AM

December 02, 2025

Igalia WebKit Team

WebKit Igalia Periodical #49

Update on what happened in WebKit in the week from November 24 to December 1.

The main highlights for this week are the completion of `PlainMonthDay` in Temporal, moving networking access for GstWebRTC to the WebProcess, and Xbox Cloud Gaming now working in the GTK and WPE ports.

Cross-Port 🐱

Multimedia 🎥

GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.

Xbox Cloud Gaming is now usable in WebKitGTK and WPE with the GstWebRTC backend, we had to fix non-spec compliant ICE candidates handling and add a WebRTC quirk forcing max-bundle in PeerConnections to make it work. Happy cloud gaming!

Support for remote inbound RTP statistics was improved in 303671@main, we now properly report framesPerSecond and totalDecodeTime metrics, those fields are used in the Xbox Cloud Gaming service to show live stats about the connection and video decoder performance in an overlay.

The GstWebRTC backend now relies on librice for its ICE. The Sans-IO architecture of librice allows us to keep the WebProcess sandboxed and to route WebRTC-related UDP and (eventually) TCP packets using the NetworkProcess. This work landed in 303623@main. The GNOME SDK should also soon ship librice.

Support for seeking in looping videos was fixed in 303539@main.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

Implemented the valueOf and toPlainDate for PlainMonthDay objects. This completes the implementation of Temporal PlainMonthDay objects in JSC!

WebKitGTK 🖥️

The GTK port has gained support for interpreting touch input as pointer events. This matches the behaviour of other browsers by following the corresponding specifications.

WPE WebKit 📟

Fixed an issue that prevented WPE from processing further input events after receiving a secondary mouse button press.

Fixed an issue that caused right mouse button clicks to prevent processing of further pointer events.

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

We landed a patch to add a new signal in WPEDisplay to notify when the connection to the native display has been lost.

Infrastructure 🏗️

Modernized the CMake modules used to find libtasn1, libsecret, libxkbcommon, libhyphen, and Enchant libraries.

Note that this work removed the support for building against Enchant 1.x, and only version 2 will be supported. The first stable release to require Enchant 2.x will be 2.52.0 due in March 2026. Major Linux and BSD distributions have included Enchant 2 packages for years, and therefore this change is not expected to cause any trouble. The Enchant library is used by the GTK port for spell checking.

Community & Events 🤝

We have published an article detailing our work making MathML interoperable across browser engines! It has live demonstrations and feature tables with our progress on WebKit support.

We have published new blogs post highlighting the most important changes in both WPE WebKit and WebKitGTK 2.50. Enjoy!

That’s all for this week!

by Igalia WebKit Team at December 02, 2025 02:15 PM

Alex Bradbury

QEMU-based instruction execution counting

Although analysing performance by way of instruction counting has obvious limitations, it can be helpful (especially when combined with appropriate analysis scripts) to get rapid feedback on the impact of code generation changes or to explore hypotheses about why code from one compiler might be performing differently from another - for instance, by looking at instruction mix in the most executed translation blocks. In this post we'll look at how to capture the necessary data to perform such an analysis using a QEMU plugin. Future posts will give details of the analysis scripts I've used, and walk through an example or two of putting them to use.

Modifying QEMU

Over the past few years, QEMU's plugin API has developed a fair bit. QEMU includes several plugins, and hotblocks provides almost what we want but doesn't allow configurability of the number of blocks it will print information on. I submitted a small patch series (and submitted it a second time addressing this and other minor issues found along the way. The series has now been accepted by the maintainer.

To build QEMU with this patch:

git clone https://github.com/qemu/qemu && cd qemu
git checkout v10.1.2
cat - <<'EOF' > hotblocks.patch
index 98404b6885..8ecf033997 100644
--- a/contrib/plugins/hotblocks.c
+++ b/contrib/plugins/hotblocks.c
@@ -73,28 +73,29 @@ static void exec_count_free(gpointer key, gpointer value, gpointer user_data)
 static void plugin_exit(qemu_plugin_id_t id, void *p)
 {
     g_autoptr(GString) report = g_string_new("collected ");
-    GList *counts, *it;
+    GList *counts, *sorted_counts, *it;
     int i;

     g_string_append_printf(report, "%d entries in the hash table\n",
                            g_hash_table_size(hotblocks));
     counts = g_hash_table_get_values(hotblocks);
-    it = g_list_sort_with_data(counts, cmp_exec_count, NULL);
+    sorted_counts = g_list_sort_with_data(counts, cmp_exec_count, NULL);

-    if (it) {
+    if (sorted_counts) {
         g_string_append_printf(report, "pc, tcount, icount, ecount\n");

-        for (i = 0; i < limit && it->next; i++, it = it->next) {
+        for (i = 0, it = sorted_counts; (limit == 0 || i < limit) && it;
+             i++, it = it->next) {
             ExecCount *rec = (ExecCount *) it->data;
             g_string_append_printf(
-                report, "0x%016"PRIx64", %d, %ld, %"PRId64"\n",
+                report, "0x%016"PRIx64", %d, %ld, %"PRIu64"\n",
                 rec->start_addr, rec->trans_count,
                 rec->insns,
                 qemu_plugin_u64_sum(
                     qemu_plugin_scoreboard_u64(rec->exec_count)));
         }

-        g_list_free(it);
+        g_list_free(sorted_counts);
     }

     qemu_plugin_outs(report->str);
@@ -170,6 +171,13 @@ int qemu_plugin_install(qemu_plugin_id_t id, const qemu_info_t *info,
                 fprintf(stderr, "boolean argument parsing failed: %s\n", opt);
                 return -1;
             }
+        } else if (g_strcmp0(tokens[0], "limit") == 0) {
+            char *endptr = NULL;
+            limit = g_ascii_strtoull(tokens[1], &endptr, 10);
+            if (endptr == tokens[1] || *endptr != '\0') {
+                fprintf(stderr, "unsigned integer parsing failed: %s\n", opt);
+                return -1;
+            }
         } else {
             fprintf(stderr, "option parsing failed: %s\n", opt);
             return -1;
diff --git a/docs/about/emulation.rst b/docs/about/emulation.rst
index 4a7d1f4178..e8793b0f9c 100644
--- a/docs/about/emulation.rst
+++ b/docs/about/emulation.rst
@@ -463,6 +463,18 @@ Example::
   0x000000004002b0, 1, 4, 66087
   ...

+Behaviour can be tweaked with the following arguments:
+
+.. list-table:: Hot Blocks plugin arguments
+  :widths: 20 80
+  :header-rows: 1
+
+  * - Option
+    - Description
+  * - inline=true|false
+    - Use faster inline addition of a single counter.
+  * - limit=N
+    - The number of blocks to be printed. (Default: N = 20, use 0 for no limit).

 Hot Pages
 .........
EOF
patch -p1 < hotblocks.patch
./configure --prefix=$(pwd)/inst --target-list="riscv32-linux-user riscv64-linux-user"
make -j$(nproc)
cd ..

Using this plugin to capture statistics from running a binary under qemu-user

Assuming you have an appropriate sysroot, you can run a binary and have the execution information emitted to stderr by doing something like:

QEMUDIR=$HOME/qemu/build
SYSROOT=$HOME/rvsysroot
$QEMUDIR/qemu-riscv64 \
  -L $SYSROOT \
  -plugin $QEMUDIR/contrib/plugins/libhotblocks.so,limit=0,inline=on \
  -d plugin,nochain \
  my_rv64_binary

This produces output like:

collected 2229 entries in the hash table
pc, tcount, icount, ecount
0x00007fffee7012ba, 1, 1, 3737
0x00007fffee7012be, 1, 3, 3737
0x00007ffff741e738, 1, 23, 1074
0x00007fffee71bb38, 1, 5, 884
0x00007ffff741bb2e, 1, 11, 662
...

This listing indicates the address of the translation block, the number of times it's been translated, the number of instructions it contains, and the number of times it was executed. Note that a translation block is not the same as a basic block in the compiler. A translation block can span multiple basic blocks in the case of fallthrough, and this can also mean an instruction may show up in multiple translation blocks.

At least for my use cases, I need something a bit more involved than this. In order to add collection of these statistics to an existing benchmark harness I need a wrapper script that transparently collects these statistics to a file. It's also helpful to capture the runtime address of executable mappings for loaded libraries, allowing translation blocks to be attributed easily to either the binary itself or libc, libm etc. We have gdb connect to QEMU's gdbserver in order to dump those mappings. Do ensure you're using a recent version of QEMU (the version suggested in the patch application instructions is definitely good) for this as I wasted quite some time running into a bug with file descriptor numbers that caused odd breakage.

This qemu-forwarder.sh script will capture the plugin's output in a .qemu_out file and the mappings in a .map file, both of which can be later consumed by a detailed analysis script.

#!/bin/sh
QEMUDIR=$HOME/qemu/build
SYSROOT=$HOME/rvsysroot
QEMU="$QEMUDIR/qemu-riscv64 \
  -L $SYSROOT \
  -plugin $QEMUDIR/contrib/plugins/libhotblocks.so,limit=0,inline=on \
  -d plugin,nochain"

SUFFIX=""
if [ -e "$1.qemu_out" ]; then
  NUM=1
  while [ -e "$1.qemu_out.$NUM" ]; do
    NUM=$((NUM + 1))
  done
  SUFFIX=".$NUM"
fi

GDB_SOCK=$(mktemp -u)
setarch $(uname -m) -R $QEMU -g $GDB_SOCK -D $1.qemu_out$SUFFIX "$@" &
QEMU_PID=$!

RETRY_COUNT=0
while ! [ -e "$GDB_SOCK" ]; do
  RETRY_COUNT=$((RETRY_COUNT + 1))
  if [ $RETRY_COUNT -eq 10 ]; then
    echo "Timed out waiting for gdb socket to be created"
    exit 1
  fi
  sleep 0.1
  if ! kill -0 $QEMU_PID 2>/dev/null; then
    echo "QEMU process died before gdb socket was created"
    wait $QEMU_PID
    exit $?
  fi
done

gdb -batch \
  -ex "set pagination off" \
  -ex "target remote $GDB_SOCK" \
  -ex "break main" \
  -ex "continue" \
  -ex "set logging file $1.map$SUFFIX" \
  -ex "set logging enabled on" \
  -ex "info proc mappings" \
  -ex "detach" > /dev/null 2>&1
wait $QEMU_PID

The above will work under LLVM's lit, though you will need to use a recent enough version that doesn't strip HOME from the environment (or else edit the script accordingly). It also produces output in sequentially numbered files, again motivated by the desire to run under this script from lit as used by llvm-test-suite's SPEC configuration which can involve multiple invocations of the same binary for a given benchmark (e.g. 500.perlbench_r).

Analysing the output

A follow-up post will introduce the scripting I've built around this.

Recording and analysing results from running SPEC

Assuming you have qemu-forwarder.sh, in your llvm-test-suite directory:

CONF=clang-head-test
CLANG_BIN_DIR=$HOME/llvm-project/build/release/bin
CFLAGS="-march=rv64gc_zba_zbb_zbs"
cat - <<EOF > $CONF.cmake
set(CMAKE_SYSTEM_NAME Linux)

set(CMAKE_SYSROOT $HOME/rvsysroot)

set(CMAKE_C_COMPILER $CLANG_BIN_DIR/clang)
set(CMAKE_CXX_COMPILER $CLANG_BIN_DIR/clang++)

set(CMAKE_C_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_CXX_COMPILER_TARGET riscv64-linux-gnu)
set(CMAKE_C_FLAGS_INIT "$CFLAGS")
set(CMAKE_CXX_FLAGS_INIT "$CFLAGS")

set(CMAKE_LINKER_TYPE LLD)

set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
EOF
cmake -G Ninja \
  -B build.$CONF \
  --toolchain=$CONF.cmake \
  -DTEST_SUITE_SPEC2017_ROOT=~/cpu2017 \
  -DTEST_SUITE_SUBDIRS=External/SPEC \
  -DTEST_SUITE_COLLECT_CODE_SIZE=OFF \
  -DTEST_SUITE_COLLECT_COMPILE_TIME=OFF \
  -DTEST_SUITE_USER_MODE_EMULATION=ON \
  -DTEST_SUITE_RUN_UNDER=$(pwd)/qemu-forwarder.sh
cmake --build build.$CONF
$CLANG_BIN_DIR/llvm-lit -v --filter-out='.+_s|specrand' build.$CONF

The 526.blender_r test takes twice as long as the others, so you may wish to skip it by instead executing something like:

$CLANG_BIN_DIR/llvm-lit -v --filter-out='.+_s|specrand|blender' build.$CONF

If you want to re-run tests you must delete the previous .qemu_out and .map files, which can be done with:

[ -n "build.$CONF" ] && find "build.$CONF" -type f -name "*.qemu_out*" -exec sh -c '
    for q_file do
        base_path="${q_file%.qemu_out*}"
        rm -f "$q_file" "${base_path}.map"*
    done
' sh {} +

In order to compare two SPEC builds, you can use something like the following hacky script. Using the captured translation block execution data to generate a plain executed instruction count is overkill as the example tests/tcg/plugin/insn.c can easily dump for this for you directly. But by collecting the data upfront, you can easily dive right into a more detailed analysis when you see a surprising difference in executed instruction counts without rerunning the binary.

#!/usr/bin/env python3

from pathlib import Path
from collections import defaultdict
import sys

def collect_totals(root_dir):
    totals = defaultdict(int)
    root_path = Path(root_dir)/"External"

    for file_path in root_path.rglob("*.qemu_out*"):
        benchmark_name = file_path.parts[4]

        try:
            with file_path.open("r") as f:
                file_total = 0
                for line in f:
                    parts = line.strip().split(',')
                    # Only sum lines that match the expected format.
                    if len(parts) == 4 and parts[2].strip().isdigit():
                        # icount * ecount.
                        file_total += int(parts[2]) * int(parts[3])
                totals[benchmark_name] += file_total
        except Exception as e:
            print(f"Error reading {file_path}: {e}")

    return totals

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: spec-compare-helper <dir_a> <dir_b>")
        sys.exit(1)

    dir_a, dir_b = sys.argv[1], sys.argv[2]
    totals_a = collect_totals(dir_a)
    totals_b = collect_totals(dir_b)

    benchmarks = sorted(set(totals_a.keys()) | set(totals_b.keys()))

    print(f"{'Benchmark':<20} {'DirA':>15} {'DirB':>15} {'Diff (%)':>10}")
    print("=" * 60)

    for benchmark in benchmarks:
        val_a = totals_a.get(benchmark, 0)
        val_b = totals_b.get(benchmark, 0)
        diff_pct = ((val_b - val_a) / val_a * 100) if val_a else float("inf")

        print(f"{benchmark:<20} {val_a:>15} {val_b:>15} {diff_pct:>9.2f}%")

Which produces output looking something like this:

Benchmark                       DirA            DirB   Diff (%)
============================================================
500.perlbench_r         180245097594    182078714777      1.02%
502.gcc_r               220874510659    219647717585     -0.56%
505.mcf_r               131589945456    134271153130      2.04%
508.namd_r              220648061019    216682202888     -1.80%
510.parest_r            291341820355    291844973715      0.17%
511.povray_r             31911866906     31103201809     -2.53%
519.lbm_r                94166321698     86910581403     -7.71%
520.omnetpp_r           138002605692    137676301622     -0.24%
523.xalancbmk_r         283566182007    284735075518      0.41%
525.x264_r              380165035845    379862173371     -0.08%
526.blender_r           660528270138    659361380750     -0.18%
531.deepsjeng_r         355058534962    349621355155     -1.53%
538.imagick_r           238573643488    238560676372     -0.01%
541.leela_r             421886351310    405423320484     -3.90%
544.nab_r               415595728542    391443973852     -5.81%
557.xz_r                132548718317    130229753780     -1.75%

It's worth highlighting that as we're running this under user-mode emulation, the dynamic instruction count naturally never counts any instructions on the kernel side that you would see if profiling a real system.


Article changelog
  • 2025-12-15: Note that the qemu patches have now been accepted in the maintainer's tree.
  • 2025-12-02: Initial publication date.

December 02, 2025 12:00 PM

Manuel Rego

You can now easily customize find-in-page with the new ::search-text pseudo-element, that is shipping in Chromium 144.0.7547. 🚀

Screenshot of the following CSS as an example of how to customize find-in-page: :root::search-text { background: yellow; } :root::search-text:current { color: white; background: olive; text-decoration: underline; } aside::search-text { background: magenta; } aside::search-text:current { background: darkmagenta; text-decoration: underline; }

Find more details on the blog post by Stephen Chenney. Thanks to Bloomberg for sponsoring this work.

December 02, 2025 12:00 AM

December 01, 2025

Alex Bradbury

Minipost: Olmo 3 training cost

Recently I jotted down some notes on LLM inference vs training costs for DeepSeek and I wanted to add on an additional datapoint for training cost based on the recently released Olmo3 models from the Allen Institute for AI ("Ai2"). The model family has 7B and 32B parameter models, with 'Think' variants available for 7B and 32B but so far only a 7B 'Instruct' non-reasoning version (but watch this space). What's particularly interesting about the Olmo models to me is that beyond providing open weights, the training scripts and datasets are openly available as well.

Going by the reported benchmarks at least it's competitive with less open models at a similar size, and importantly they've increased the supported context length from the rather limiting 4k tokens supported by the Olmo 2 series to a much more usable 64k tokens. Given the relatively small size these models are less capable than relatively chunky models like DeepSeek R1/V3.x or Kimi K2, but I've been impressed by the capability of 32B dense models for basic queries, and from my non-scientific testing both the 32B and 7B Olmo3 variants seem to do a reasonable job of summarising things like discussion threads. You can experiment yourself at playground.allenai.org.

Energy required for training Olmo 3

One of the neat things about this level of openness is that it should act as a floor in terms of performance for future models of this size class assuming they're appropriately funded and don't take too many risks chasing novelty. Rerunning the training process with an updated dataset and some minor tweaks is something you could imagine doing on some regular cadence, ideally as a shared endeavour. Imagining this effort in the future, how much energy is required? The initial version of the detailed Olmo 3 technical report unfortunately has little to say on this. We can get a back of the envelope figure in terms of GPU hours for pre-training based on the reported 7700 tokens per second per GPU for the 7B base model and 1900 tokens per second for the 32B base model and the ~6T token dataset. But even better than that, we can just ask the Ai2 folks (sometimes the internet really does work wonderfully!). After asking on their public Discord I was rapidly furnished with this helpful answer:

For some detailed numbers, we measured power consumption throughout training, along with total GPU hours. We used ~234k H100 hours to pretrain the 7B, and ~1.05m H100 hours to pretrain the 32B. 1900 TPS is generally what our trainer is capable of, but with restarts, evaluations, checkpointing, and occasional network issues, the 32B took 1.05m hours. We measured an average power consumption of ~621W while pretraining the 7B and ~649W while pretraining the 32B, and this means that our GPUs consumed ~146MWh for the 7B and ~681MWh for the 32B. We'll include more detailed GPU hour information in a future version of the paper, including for post-training!

Ai2 Olmo 3 team on their Discord.

So that's 0.681 GWh in GPU power draw for pretraining the 32B model and 0.146 GWh in GPU power draw for pretraining the 7B model. As noted in the quote, this is inclusive of restarts, checkpointing etc. But perhaps won't include previous early stage experimentation. I look forward to an updated technical report with full details, but pretraining should cover the bulk of the compute requirements (as a reference point, today's DeepSeek V3.2 paper found it notable that the post-training compute budget exceeded 10% of the pretraining cost).

The 0.681 GWh figure doesn't account for full system power and cooling cost. I'd love to be corrected, but I believe a 1.5x-2x multiplier would be an assumption towards the upper end. But for the sake of this yardstick comparison let's look at a few comparisons based on the reported number:

  • 0.681 GWh of electricity would cost about £180k at UK residential rates (capped at 26.35p per kWh currently). Substantially less in the USA.
  • A larger leisure centre with a pool consumes ~2.5 GWh of energy per year. I don't know if the idea of a "leisure centre" translates outside of the UK, but basically it's a swimming pool plus gym, squash/tennis courts etc.
    • The linked page claims ~2 GWh of energy in gas and 0.5 GWh in electricity. For the gas, to compare like with like you'd need to consider the source of energy for the electricity used for Olmo training.
  • 0.681 GWh is ~0.11% of LHC's annual 600 GWh energy consumption or ~0.05% of CERN's annual consumption.
  • We can estimate a Boeing 787-9 flying from London Heathrow to SFO consumes jet fuel containing ~0.58 GWh of energy.
    • Calculated with 8638km distance, 5.62kg fuel/km (taking the most economic 787-9 long haul figure from this table on Wikipedia and 11.95kWh/kg specific energy of jet fuel).
    • This is a yardstick rather than a direct comparison. A direct comparison to the GWh of electricity used for the GPU compute of the LLM would depend on the source of the electricity. If it was e.g. gas rather than solar/hydro/wind then you'd want to compare the number of GWh consumed to create that electricity which would of course be higher.
    • As a further point of reference, FlightAware indicates 5 separate direct LHR to SFO flights scheduled per day.

More efficient LLM training

We can hope for new breakthroughs, more efficient hardware, better datasets and so on. But here is some work I noticed in the area. Fair warning: this isn't my field, and we have to recognise applying a research result to a production training run is sure to have challenges even if the research suggests the trade-offs are worthwhile. So consider this vague gesticulating about seemingly interesting work that is going on and find someone who knows what they're talking about to confirm the degree to which it is interesting/viable.

  • Mixture of Experts (MoE) models are substantially cheaper to train which is one reason the industry has moved in that direction. The next Ai2 Olmo model is expected to be MoE. The Qwen blog has a graph comparing the relative training cost in GPU hours of the dense Qwen3-32B vs Qwen3-30B-A3b vs Qwen3-Next-80B-A3B, where the latter makes further architectural changes, reporting a 10.7x reduction. ~2.5x of that is going to come from the reduced corpus size (15T tokens down from 36T), but that still leaves plenty of improvement from other factors.
  • Maybe it will be shown viable to train in lower precision such as MXFP8 or even NVFP4, which would allow much more throughput for a similar energy budget. Nvidia have worked to demonstrate this can be effective for both formats (see also this work from MIT).
  • Also from Nvidia, Nemotron Elastic showed a model architecture that allows deriving smaller models without doing a separate pre-training runs.

Finally, the cheapest way to train an LLM from scratch is...to find a way to avoid the need to. For models like Olmo 3 that release the base model and checkpoints, people can apply their own post-training or perform additional pre-training.

Bonus comparison point: Apertus

Apertus is a Swiss project to produce an open LLM, with 70B and 8B models released so far. Their full tech report notes the following "Once a production environment has been set up, we estimate that the model can be realistically trained in approximately 90 days on 4096 GPUs, accounting for overheads. If we assume 560 W power usage per Grace-Hopper module in this period, below the set power limit of 660 W, we can estimate 5 GWh power usage for the compute of the pretraining run."


Article changelog
  • 2025-12-04: Add link to "Four Over Six" NVFP4 training paper.
  • 2025-12-02: Added clarifying note about energy via gas in the leisure centre comparison.
  • 2025-12-01: Initial publication date.

December 01, 2025 12:00 PM

November 30, 2025

Alex Bradbury

Minipost: Benchmarking the Hetzner AX102 vs CCX53

I recently had reason to do a quick comparison of the performance of the Hetzner AX102 dedicated server and the high-end 'dedicated' CCX53 VPS on Hetzner Cloud and thought I may as well write up the results for posterity. I'm incapable of starting a post without some kind of disclaimer so here comes the one for this post: naturally the two products have major differences in terms of flexibility (spin-up/down at will, vs pay a small setup fee and endure a wait time depending on hardware availability). So depending on your use case, your requirements with respect to that flexibility may override any cost differential.

Specs

All costs are exclusive of VAT, assuming the lowest cost data center location, and inclusive of IPv4 address.

AX102:

  • 16 core Ryzen 9 7950X3D (32 threads)
  • 128GB DDR5 RAM
  • 2 x 1.92TB NVMe
  • 104 EUR/month, 39 EUR one-off setup fee.

CCX53

  • Unknown AMD CPU exposing 32vCPU (physical cores? threads?)
  • 128GB RAM
  • 600GB NVMe
  • 192.49 EUR/month maximum charge. 0.3085 EUR per hour (if you keep the same VPS active over the month it won't exceed the monthly price cap, so you effectively get a small discount on the per-hour cost).

Benchmark

Building Clang+LLVM+LLD, everyone's favourite workload! Both systems are running an up to date Arch Linux (more details on setting this up on the CCX53 in the appendix below) with clang 21.1.6. The dedicated machine has the advantage of RAID 0 across the two SSDs, but also has encrypted rootfs configured. I didn't bother to set that up for the CCX53 VPS.

sudo pacman -Syu --needed clang lld cmake ninja wget
LLVM_VER=21.1.6
wget https://github.com/llvm/llvm-project/releases/download/llvmorg-${LLVM_VER}/llvm-project-${LLVM_VER}.src.tar.xz
tar -xvf llvm-project-${LLVM_VER}.src.tar.xz
cd llvm-project-${LLVM_VER}.src

cmake -G Ninja \
  -DLLVM_ENABLE_PROJECTS='clang;lld' \
  -DLLVM_TARGETS_TO_BUILD="all" \
  -DLLVM_CCACHE_BUILD=OFF \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DLLVM_ENABLE_LLD=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -S llvm \
  -B build
time cmake --build build

printf "### Version info ###\n"
clang --version | head -n 1

On both machines, ninja shows 5575 build steps.

Results:

  • AX102
    • 10m27s (627s)
  • CCX53
    • 14m11s (851s, about 1.36x the AX102)

Running the clang and LLVM tests with ./build/bin/llvm-lit -s --order=lexical llvm/test clang/test (which shows 9402 tests) gives:

  • AX102
    • 3m39s (219s)
  • CCX53
    • 4m28s (268s, about 1.24x the AX102)

I ran these multiple times, and in the case of the CCX53 across two different VMs in different regions and saw only a few percentage points variance.

Focusing on the results for build clang/llvm/lld, let's figure out the cost for 1000 from-scratch builds. Not so much as it's a representative workload, but because it gives an easy to compare metric that captures both the difference in price and in performance. So calculating time_per_build_in_hours * 1000 * cost_per_hour:

  • AX102
    • (626.6 / 3600) * 1000 * (104/720) = 25.14 EUR
    • Or if you include the setup fee and assume it's amortised over 12 months:
      • (626.6/3600) * 1000 * ((104 + (39/12))/720) = 25.93 EUR
  • CCX53
    • (850.6 / 3600) * 1000 * (192.49/720) = 63.17 EUR
    • Or using the 0.3085 EUR/hr price which you would pay if you didn't run for the whole month:
      • (850.6 / 3600) * 1000 * 0.3085 = 72.89 EUR

Appendix: CCX53 Arch Linux setup

This could be scripted, but I just created the VPS via their web UI. Then after it was provisioned, used that web UI to have it boot into a rescue system. Then do an Arch bootstrap that roughly mirrors the one I use on a dedicated build machine except that we don't bother with encrypting the rootfs. The CCX* server types at least use UEFI so we can keep using efistub for boot.

First get a bootstrap environment and enter it:

wget http://mirror.hetzner.de/archlinux/iso/latest/archlinux-bootstrap-x86_64.tar.zst
tar -xvf archlinux-bootstrap-x86_64.tar.zst --numeric-owner
sed -i '1s;^;Server=https://mirror.hetzner.de/archlinux/$repo/os/$arch\n\n;' root.x86_64/etc/pacman.d/mirrorlist
mount --bind root.x86_64/ root.x86_64/ # See <https://bugs.archlinux.org/task/46169>
printf "About to enter bootstrap chroot\n===============================\n"
./root.x86_64/bin/arch-chroot root.x86_64/

Now set info that will be used throughout the process:

export NEW_HOST_NAME=archvps
export PUBLIC_SSH_KEY="ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIOfpPQ1j+XLsapAhONAQmvu6TZGT5y8jeziM4Vio1NrA asb@plurp"
export NEW_USER=asb

And now proceed to set up the disks, create filesystems, perform an initial bootstrap and chroot into the new rootfs:

pacman-key --init
pacman-key --populate archlinux
pacman -Sy --noconfirm xfsprogs dosfstools

sfdisk /dev/sda <<EOF
label: gpt

start=1MiB, size=255MiB, type=uefi
start=256MiB, type=linux
EOF

mkfs.fat -F32 /dev/sda1
mkfs.xfs /dev/sda2

mount /dev/sda2 /mnt
mkdir /mnt/boot
mount /dev/sda1 /mnt/boot
pacstrap /mnt base linux linux-firmware efibootmgr \
  xfsprogs dosfstools \
   python3 \
  openssh sudo net-tools git man-db man-pages vim
genfstab -U /mnt >> /mnt/etc/fstab

printf "About to enter newrootfs chroot\n===============================\n"
arch-chroot /mnt

Do final configuration from within the chroot:

sed /etc/locale.gen -i -e "s/^\#en_GB.UTF-8 UTF-8.*/en_GB.UTF-8 UTF-8/"
locale-gen
# Ignore "System has not been booted with systemd" and "Failed to connect to bus" error for next command.
systemd-firstboot --locale=en_GB.UTF-8 --timezone=UTC --hostname="$NEW_HOST_NAME"
ln -s /dev/null /etc/udev/rules.d/80-net-setup-link.rules # disable persistent network names

# No longer need to disable large fallback image as Arch stopped generating it
# by default

printf "efibootmgr before changes:\n==========================\n"
efibootmgr -u
# Set up efistub
efibootmgr \
  --disk /dev/sda \
  --part 1 \
  --create \
  --label 'Arch Linux' \
  --loader /vmlinuz-linux \
  --unicode "root=/dev/sda2 rw initrd=\initramfs-linux.img" \
  --verbose
printf "efibootmgr after changes:\n=========================\n"
efibootmgr -u

mkswap --size=8G --file /swapfile
cat - <<EOF > /etc/systemd/system/swapfile.swap
[Unit]
Description=Swap file

[Swap]
What=/swapfile

[Install]
WantedBy=multi-user.target
EOF
systemctl enable swapfile.swap

cat - <<EOF > /etc/systemd/network/10-eth0.network
[Match]
Name=eth0

[Network]
DHCP=yes
Address=$(ip -6 addr show dev eth0 scope global | grep "scope global" | cut -d' ' -f6)
Gateway=$(ip route show | head -n 1 | cut -d' ' -f 3)
Gateway=fe80::1
EOF
systemctl enable systemd-networkd.service systemd-resolved.service systemd-timesyncd.service
printf "PasswordAuthentication no\n" > /etc/ssh/sshd_config.d/20-no-password-auth.conf
systemctl enable sshd.service
useradd -m -g users -G wheel -s /bin/bash "$NEW_USER"
usermod --pass='!' root # disable root login
chmod +w /etc/sudoers
printf "%%wheel ALL=(ALL) ALL\n" >> /etc/sudoers
chmod -w /etc/sudoers
mkdir "/home/$NEW_USER/.ssh"
printf "%s\n" "$PUBLIC_SSH_KEY" > "/home/$NEW_USER/.ssh/authorized_keys"
chmod 700 "/home/$NEW_USER/.ssh"
chmod 600 "/home/$NEW_USER/.ssh/authorized_keys"
chown -R "$NEW_USER:users" "/home/$NEW_USER/.ssh"

Now set password:

passwd "$NEW_USER"

Then ctrl-d twice and set a symlink for resolv.conf:

ln -sf ../run/systemd/resolve/stub-resolv.conf root.x86_64/mnt/etc/resolv.conf

Finally, reboot.

Remember to ssh-keygen -R $THE_IP_ADDRESS so you don't get ssh host verification errors.


Article changelog
  • 2025-11-30: Initial publication date.

November 30, 2025 12:00 PM

November 29, 2025

Alex Bradbury

Minipost: LLM inference vs training costs for DeepSeek

Tl;dr: Based on published data from DeepSeek, we can estimate it takes something like ~70 days of inference traffic (served by DeepSeek themselves, ignoring any other providers) to match the GPU hours used for the final training run for V3 and R1.

Simon Willison recently reshared some figures on inference costs for LLMs. I couldn't agree more with the comment further down that thread "The big AI labs continue to be infuriatingly opaque about the actual figures for their total electricity and water consumption".

A number of responses wonder about the cost of training. If you accept the reported figures for serving a query, what impact does it have if you amortise the energy spent training the model over the served queries? Mistral did this for their lifecycle analysis but they grouped together "training and inference" and kept confidential the ratio of energy for training vs inference by reporting a figure that combined the training cost with 18 months of usage. The thread reminded me of another datapoint available for DeepSeek that seemed worth writing up. I think this gives some helpful intuition for the amortised cost of training for a widely used model of that size, but to state the obvious any attempt to apply that intuition to other models is totally reliant on how widely used it is.

DeepSeek have published figures both on training and on inference for DeepSeek's website and API users. I will attempt to consistently refer to the figure for training as "final run training cost" to reflect the fact the number of GPU hours used in experimentation and failed attempts isn't reported. For final run training for DeepSeek-R1:

Now for inference, back in February DeepSeek wrote up details of their inference system giving details of cost of serving, profit margin, and load over a 24h period. So yes, we're extrapolating from this datapoint and assuming it's representative. Given the worldwide inference of DeepSeek R1/V3 is surely much larger (being openly licensed there are many vendors who are serving it), I'm not overly worried about this aspect. Their reported average inference serving infrastructure occupancy is 226.75 nodes (each node containing 8 H800 GPUs), meaning 43536 H800 GPU hours per day. At that rate, it will take ~67.5 days of traffic for the same number of H800 GPU hours to be used for inference as for the final training run.

All this to say, for a widely used model of DeepSeek R1 scale when looking at the cost of inference, accounting for the amortised final run training cost is more likely to be a multiplier of 2x or less rather than something much larger. In terms of energy, this does assume that the power draw of the H800 GPUs while running inference is similar to the draw during training. And to underline again, the reported training cost surely doesn't include experimentation, aborted runs etc.


Article changelog
  • 2025-11-29: Initial publication date.

November 29, 2025 12:00 PM

November 27, 2025

Eri Pazos

interop and mathml core

math { font-size: 2em; } .math-example { display: flex; justify-content: center; align-items: center; & > div { display: inline-grid; grid-template-columns: 1fr 2px 1fr; grid-template-rows: fit-content(0); grid-gap: var(--spacing); & > img { height: 100%; width: auto; border: none; padding: 0.25rem; } @media (width 768px) { grid-template-columns: 1fr; text-align: center; justify-items: center; & > hr { width: 100%; } } } }

Interoperability makes the web better for everyone, allowing users to have a great experience regardless of their choice of browser. We have been working on MathML Core making across browser engines as part of an agreement with the Sovereign Tech Fund. There are some exciting developments and new features!

Interoperability makes the web better for everyone, allowing users to have a great experience regardless of their choice of browser. We have many standards that shape how the internet should work, drafted from consensus between different engine makers and third parties. While having specs on how everything should function is great, we still need to align the different browser implementations. This can be tricky as all of them have their peculiarities, and not all browsers agree on what is a priority for them. The goal of the Interop program is to select a few important features that all engines will prioritize, so users and editors can finally benefit from them.

A few months ago I joined Igalia's web platform team (and I'm really happy about it!). Thanks to an agreement with the Sovereign Tech Fund, this year we will be working on MathML and other important Interop areas.

This post contains MathML examples. Each formula is represented twice. Your browser renders the left one from the HTML code, while on the right there is a pre-printed SVG as a reference of how it should look. Keep in mind that most of these features are either experimental or have just landed, so you may need the latest version of a browser to view them correctly.

A bit of history

MathML was first published in 1998, and it grew to be a gigantic project that sought to define how mathematical notation should be rendered. However, due to its complexity, the implementations of the browser engines were wildly different and incomplete. This meant that editors could not rely on it, since users would see very different content depending on what they were browsing with.

<math>
  <msubsup>
    <mo></mo>
    <mn>0</mn>
    <mn>1</mn>
  </msubsup>
  <mrow>
    <msup>
      <mi>x</mi>
      <mn>2</mn>
    </msup>
    <mo>+</mo>
    <mn>1</mn>
  </mrow>
</math>
∫ 0 1 x 2 + 1
An integral from 0 to 1 of x squared plus one

This is why MathML Core was born. It is a small subset of MathML 3 that is feasible to implement in browsers. It is based on the parts of the specification that are used in practice, adding important implementation details and testing.

To illustrate why this is important, Chromium had support for some parts of MathML when it was forked from WebKit. However, it proved to be very difficult to maintain and complete, so it was removed in 2013. My colleague Frédéric Wang led the effort to create a new implementation based on MathML Core, which was shipped in 2023, a huge milestone for the standard.

We are in a very exciting moment in the MathML history, since all three major browser engines have overlapping support. However, there is still work to be done to align the different implementations so they follow the MathML Core specification. The goal is that one could write formulas on a website and have it look the same everywhere (like Wikipedia, which is now transitioning to native MathML instead of prerendered SVGs).

So, what have we been working on?

RTL mirroring

Some scripts are written from right to left, including Arabic. Browsers should be able to correctly render text and math in either direction, making use of the Unicode BiDi specification and the rtlm font feature. However, the existing implementations either didn't support mirroring or had hacky behaviour that didn't work correctly for all cases. Read this explainer that Frédéric made for a great visualization of the differences.

<link rel="stylesheet" href="https://fred-wang.github.io/MathFonts/XITS/mathfonts.css"/>

<math>
  <mrow>
    <mo>{</mo>
    <mfrac>
      <mn>5</mn>
      <mn>6</mn>
    </mfrac>
    <mo>)</mo>
  </mrow>
  <msqrt>
    <mfrac>
      <mn>3</mn>
      <mn>4</mn>
    </mfrac>
  </msqrt>
  <msub displaystyle="true">
    <mo></mo>
    <mi>C</mi>
  </msub>
</math>

<math dir="rtl">
  <mrow>
    <mo>{</mo>
    <mfrac>
      <mn>٥</mn>
      <mn>٦</mn>
    </mfrac>
    <mo>)</mo>
  </mrow>
  <msqrt>
    <mfrac>
      <mn>٣</mn>
      <mn>٤</mn>
    </mfrac>
  </msqrt>
  <msub displaystyle="true">
    <mo></mo>
    <mi>ج</mi>
  </msub>
</math>
{ 5 6 ) 3 4 ∲ C { ٥ ٦ ) ٣ ٤ ∲ ج
A series of math formulas, first from left to right, then from right to left

There are two cases when it comes to mirroring. If there is a corresponding mirrored character (e.g. opening parenthesis to closing parenthesis), it is called character-level mirroring or Unicode BiDi, and the browser just needs to swap one character for the other. Sadly, this doesn't apply to every operator.

Take the contour clockwise integral. If we just mirror the symbol by applying a reflection symmetry about a vertical line, the arrow is suddenly pointing in the other direction, making it counterclockwise. This changes the meaning of the formula!

Three clockwise integrals: left to right, incorrectly mirrored (arrow pointing to the other side), and right to left

To avoid this, the rtlm font feature can use glyph-level mirroring to provide a different set of correctly mirrored glyphs. Glyphs plural since a math symbol can have different size variants to accommodate multiple contents. Not only that, when the variants are not enough, there are glyphs for assembling arbitrarily long operators.

<link rel="stylesheet" href="https://fred-wang.github.io/MathFonts/XITS/mathfonts.css"/>

<math>
  <msqrt>
    <mspace height="0.8em" width="0.8em" style="background: tomato"></mspace>
  </msqrt>
  <msqrt>
     <mspace height="1.5em" width="0.8em" style="background: gold"></mspace>
  </msqrt>
  <msqrt>
    <mspace height="2.5em" width="0.8em" style="background: mediumseagreen"></mspace>
  </msqrt>
  <msqrt>
    <mspace height="4.5em" width="0.8em" style="background: cornflowerblue"></mspace>
  </msqrt>
</math>

<math dir="rtl">
  <msqrt>
    <mspace height="0.8em" width="0.8em" style="background: tomato"></mspace>
  </msqrt>
  <msqrt>
     <mspace height="1.5em" width="0.8em" style="background: gold"></mspace>
  </msqrt>
  <msqrt>
    <mspace height="2.5em" width="0.8em" style="background: mediumseagreen"></mspace>
  </msqrt>
  <msqrt>
    <mspace height="4.5em" width="0.8em" style="background: cornflowerblue"></mspace>
  </msqrt>
</math>

A series square roots, each taller than the last. First from left to right, then from right to left

No browser engine supported glyph-level mirroring for MathML operators, so we had to implement it in all of them. Thankfully harfbuzz, the underlying font rendering library used by Chromium and Firefox, already supported it. WebKit is a work in progress, since there is more complexity because of different ports using different backends. As for character-level mirroring, Chromium and WebKit did it right, but Firefox applied reflection symmetry instead of replacing the correct pair. The changes in Firefox and Chromium are now stable and ready to be used!

Feature Firefox WebKit Chromium
Character level mirroring (BiDi) ✅✨
Glyph level mirroring (rtlm) ✅✨ 🚧 ✅✨

math-shift and math-depth

Details are important, especially when rendering complex and layered formulas. One may think that a few pixels do not make that much of a difference. However, when you have multiple levels of nesting, offsets, and multiple elements, a slight change can make everything look ugly at best, wrong at worst.

Enter math-shift: compact. Look at this example from the MathML Core spec:

<math display="block">
  <msqrt>
    <msup>
      <mi>x</mi>
      <mn style="color: mediumseagreen">2</mn>
    </msup>
  </msqrt>
  <mo></mo>
  <msup>
    <mi>x</mi>
    <mn style="color: cornflowerblue">2</mn>
  </msup>
</math>
x 2 ≠ x 2
Square root of x squared does not equal x squared. The exponent under the root is lower than the exponent on the right

At first glance, you may not see anything too different. But looking closely, the green "2" on the left is a bit lower than then blue one on the right. It is trying to fit under the square root bar. This is what LaTeX calls cramped mode.

Chromium already supported the definition given by MathML Core, so that left Firefox and WebKit, both of which used hardcoded rules for specific cases in C++ objects. MathML Core takes another approach, and incentivizes using CSS styling rules instead.

Another interesting property is math-depth. It is used to make nested elements, such as those inside fractions, scripts or radicals a bit smaller. That way, if you have an exponent of an exponent of an exponent (of an exponent...), each one is displayed a bit tinier than the last.

<math display="block">
  <msup>
    <mi>A</mi>
    <msup>
      <mi style="color: cornflowerblue">A</mi>
      <msup>
        <mi style="color: mediumseagreen">A</mi>
        <mi style="color: tomato">A</mi>
      </msup>
    </msup>
  </msup>
  <mo>+</mo>
  <mroot>
    <mi>A</mi>
    <mi style="color: mediumseagreen">A</mi>
  </mroot>
  <mo>+</mo>
  <mfrac>
    <mrow>
      <mi>A</mi>
      <mo>+</mo>
      <mfrac>
        <mi style="color: cornflowerblue">A</mi>
        <mi style="color: cornflowerblue">A</mi>
      </mfrac>
    </mrow>
    <mi>A</mi>
  </mfrac>
</math>
A A A A + A A + A + A A A
A variable with nested exponents, each smaller than the last. A radical with index A, smaller than the value inside the root. A nested fraction, whose variables are also displayed smaller.

In this case, Firefox and Chromium already had compliant implementations, so only WebKit needed to catch up. Support for math-depth and the scriptlevel attribute (which allows to modify this depth) has now landed, while a patch for font-size: math (which sets the size of the element based on its depth) is on the way.

Feature Firefox WebKit Chromium
math-shift: compact ✅✨ ✅✨
math-depth ✅✨
font-size: math 🚧
scriptlevel ✅✨

Other work

Rendering unknown elements as mrow

MathML 3 defined 195 elements. MathML Core focuses on about 30, leaving the rest to styling or polyfills. This means deprecating some features that were previously implemented in some browsers, like mfenced, semantics, and maction, as it would be too difficult to make them interoperable right now. To prevent breaking existing content too much, they are rendered like an mrow.

font-family: math

Selecting a good math font is essential for rendering. Stretchy operators, math symbols, and italics are not available with every font, so without one they are presented very poorly. font-family: math is a CSS property that specifies that the content should use a suitable font for mathematics. Previously browsers had a hardcoded list of CSS fallbacks, but now this has been standardized and implemented.

Android doesn't come with a math font installed, so it mixes symbols from different fonts, producing a rather unappealing result:

A math formula containing different symbols, all of them with varying font styling and weights as the result of not having an unified math font family

mathvariant and text-transform: math-auto

Single letter identifiers inside a <mi> tag are treated as variables, and so they should be rendered with fancy italics. This is still supported by MathML Core. However, MathML 3 allows a plethora of transformations using mathvariant, from bold to gothic text. The new spec says that while italic transformation should still happen by default, other text should use the specific Unicode codepoint directly, as it just adds too much complexity for the browser implementation.

text-transform: math-auto is a CSS property applied by default to <mi> elements that enables the italic transformation for them. Setting the new mathvariant attribute to normal will make the text-transform of the element be none, removing the italic styling.

Different stylings of the letter A. Italic, regular, bold italic, bold regular, double struck, script, fraktur, sans serif and monospace

DisplayOperatorMinHeight and Cambria Math

Microsoft made a mistake in Cambria Math, one of the math fonts used in Windows. They switched the DisplayOperatorMinHeight and DelimitedSubFormulaMinHeight, so operators weren't being displayed correctly. Some browsers had a workaround for this, but a more general fix was implemented in harfbuzz, so we removed the workarounds in favour of relying on the upstream library instead.

Animation for math-* properties

When implementing math-shift in Firefox, we noticed that the spec said the new properties are not supposed to be animatable. In the new CSS spec, most properties are defined as animatable (fun!). After some discussion with the MathML Working Group, we decided to change the spec, and we are adding this feature to the browser engines.

@keyframes math-anim { 0% { color: royalblue; math-depth: 1; } 20% { color: mediumseagreen; } 40% { color: gold; } 60% { color: tomato; math-depth: 3; } 80% { color: mediumpurple; } 100% { color: royalblue; math-depth: 1; } } #anim-target { animation: math-anim 5s infinite; } #anim-container { height: 4.5rem; & > math { font-size: 4rem; } }

x 2

Feature Firefox WebKit Chromium
Render unknown elements as mrow ✅✨ ✅✨
font-family: math ✅✨ ✅✨
text-transform: math-auto ✅✨
New mathvariant behaviour 🚧
DisplayOperatorMinHeight fix ✅✨ ✅✨ ✅✨
Animation for math-* properties ✅✨ 🚧 🚧

What's next?

Many of these improvements have already shipped, but our work continues on making mathematics more interoperable in browsers. This includes some exciting new features ahead:

  • Updates to the operator dictionary: MathML Core revamped the existing list of operators and their default layouts. Additionally, there is a new compact form that removes redundancies.
  • More improvements to operator stretching and spacing: There are still some inconsistencies between browsers and some long standing bugs that we would love to tackle.
  • Handling positioned elements and forbidding floats in MathML: Like flex or grid, MathML doesn't create floating children for elements with a math display type. However, they can still have out of flow positioned children. At the moment this isn't consistent across browsers and it is something we want to improve.

Working on MathML is very rewarding, specially because of the people that have helped along the way. I'd like to specially thank my colleague @fredw, reviewers from Mozilla, Apple and Google, and the W3C Math Working Group. Also @delan for reviewing the first draft of this post.

We are very grateful to the Sovereign Tech Fund for supporting this work!

November 27, 2025 12:00 AM

November 24, 2025

Igalia WebKit Team

WebKit Igalia Periodical #48

Update on what happened in WebKit in the week from November 17 to November 24.

In this week's rendition, the WebView snapshot API was enabled on the WPE port, further progress on the Temporal and Trusted Types implementations, and the release of WebKitGTK and WPE WebKit 2.50.2.

Cross-Port 🐱

A WebKitImage-based implementation of WebView snapshot landed this week, enabling this feature on WPE when it was previously only available in GTK. This means you can now use webkit_web_view_get_snapshot (and webkit_web_view_get_snapshot_finish) to get a WebKitImage-representation of your screenshot.

WebKitImage implements the GLoadableIcon interface (as well as GIcon's), so you can get a PNG-encoded image using g_loadable_icon_load.

Remove incorrect early return in Trusted Types DOM attribute handling to align with spec changes.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

In JavaScriptCore's implementation of Temporal, implemented the with method for PlainMonthDay objects.

In JavaScriptCore's implementation of Temporal, implemented the from and equals methods for PlainMonthDay objects.

Releases 📦️

WebKitGTK 2.50.2 and WPE WebKit 2.50.2 have been released.

These stable releases include a number of patches for security issues, and as such a new security advisory, WSA-2025-0008, has been issued (GTK, WPE).

It is recommend to apply an additional patch that fixes building with the JavaScriptCore “CLoop” interpreter is enabled, which is typicall for architectures where JIT compilation is unsupported. Releases after 2.50.2 will include it and manual patching will no longer be needed.

That’s all for this week!

by Igalia WebKit Team at November 24, 2025 08:12 PM

November 23, 2025

Juan A. Suárez

Major Upgrades to the Raspberry Pi GPU Driver Stack (XDC 2025 Recap)

XDC 2025 happened at the end of September, beginning of October this year, in Kuppelsaal, the historic TU Wien building in Vienna. XDC, The X.Org Developer’s Conference, is truly the premier gathering for open-source graphics development. The atmosphere was, as always, highly collaborative and packed with experts across the entire stack.

I was thrilled to present, together with my workmate Ella Stanforth, on the progress we have made in enhancing the Raspberry Pi GPU driver stack. Representing the broader Igalia Graphics Team that work on this GPU, Ella and I detailed the strides we have made in the OpenGL driver, though part of the improvements affect also the Vulkan driver.

The presentation was divided in two parts. In the first one, we talked about the new features that we were implementing, or are under implementation, mainly to make the driver more closely aligned with OpenGL 3.2. Key features explained were 16-bit Normalized Format support, Robust Context support, and Seamless cubemap implementation.

Beyond these core OpenGL updates, we also highlighted other features, such as NIR printf support, framebuffer fetch or dual source blend, which is important for some game emulators.

The second part was focused on specific work done to improve the performance. Here, we started with different traces from the popular GFXBench application, and explained the main improvements done throughout the year, with a look at how much each of these changes improved the performance for each of the benchmarks (or in average).

At the end, for some benchmarks we nearly doubled the performance compared to last year. I won’t explain here each of the changes done, But I encourage the reader to watch the talk, which is already available.

For those that prefer to check the slides instead of the full video, you can view them here:

Outside of the technical track, the venue’s location provided some excellent down time opportunities to have lunch at different nearby places. I need to highlight here one that I really enjoyed: An’s Kitchen Karlsplatz. This cozy Vietnamese street food spot quickly became one of my favourite places, and I went there a couple of times.

On the last day, I also had the opportunity to visit some of the most recomendable sightseeings spots in Vienna. Of course, one needs more than a half-day to do a proper visit, but at least it helps to spark an interest to write it down to pay a full visit to the city.

Meanwhile, I would like to thank all the conference organizers, as well as all the attendees, and I look forward to see them again.

November 23, 2025 11:00 PM

November 22, 2025

Igalia Compilers Team

Unlocking 15% More Performance: A Case Study in LLVM Optimization for RISC-V

This blog post summarizes a talk given by Mikhail R. Gadelha at the RISC-V Summit North America 2025.

You can also watch the presentation here. And download the full presentation here.

Title slide

Introduction #

In this post, I will walk through the results of a 10-month RISE project we completed in September, focused on improving the performance of the LLVM toolchain for RISC-V.

Since the moment I originally submitted this talk, we have actually squeezed out a bit more performance: what used to be a 15% speed-up is now up to 16% on SPEC CPU® 2017. Small change, but still measurable.

The project targeted the Banana Pi BPI-F3 board, which uses the SpacemiT X60: an in-order, 8-core RISC-V processor supporting the RVA22U64 profile and RVV 1.0, 256-bit vectors.

Our high-level goal was straightforward: to reduce the performance gap between LLVM and GCC for RISC-V. However, there wasn't (and still isn't) one single fix; LLVM is an extremely active codebase where improvements and regressions happen constantly. Instead, we focused on three major contributions:

  1. A full scheduling model for the SpacemiT X60.

  2. Improvements to vectorization across calls.

  3. IPRA support for the RISC-V backend.

Let's walk through each contribution.

1. SpacemiT-X60 Scheduling Model #

By far, our main contribution to the LLVM project was the scheduling model for the Spacemit-X60, but before we delve deeper into the changes, let's understand what a scheduling model is.

Explaining a scheduling model with an example

Instruction scheduling directly impacts performance, especially on in-order processors. Without accurate instruction latencies, and instruction resource usage, the compiler can make poor scheduling decisions.

In our slides, there is an example where:

load -> uses ft0
fadd -> depends on ft0
fmul -> independent

The above is a possible naive code generated by the compiler, which has latency(load + fadd + fmul). Thanks to an accurate scheduling model the compiler can reason that it is better to emit the following code with latency(max(load, fmul) + fadd).

load -> uses ft0
fmul -> independent of preceding load
fadd -> depends on load

This was just an illustrative example, not something LLVM typically emits, but it demonstrates how missing scheduling information leads to unnecessary stalls.

How we got the latencies and throughput of all instructions

The biggest piece of the work was to actually collect the data on every single instruction supported by the board. To build a correct model, we:

  • wrote custom microbenchmarks to measure latency for every instruction.
  • used throughput data from camel-cdr's RVV benchmark results.
  • tracked all combinations of LMUL × SEW, which leads to:
    • 201 scalar instructions.
    • 82 FP instructions.
    • 9185 RVV "instructions" (combinations).

This resulted in a very, very large spreadsheet, and it took some time to analyse and then produce a series of upstream patches to reflect that data in LLVM.

RVA22U64 SPEC execution time results

With scheduling enabled on RVA22U64 (scalar-only), we got up to 16.8% execution time improvements (538.imagick_r) and no regressions. The combined results show a -4.75% geometric mean improvement on execution time.

RVA22U64_V SPEC execution time results

When we enable vectors (RVA22U64_V), we got up to 16% improvement (508.namd_r) and no regressions. The combined results show a -3.28% geometric mean improvement on execution time.

Comparison between RVA22U64 and RVA22U64_V execution time

One interesting result: scheduling nearly eliminated the gap between scalar and vector configurations on the x60; only one SPEC benchmark (x264) still favored the vectorized build.

This shows there may be more work in improving our ability to find profitable vectorization.

2. Improving Vectorization Across Calls #

During benchmarking, we found strange cases where scalar code was faster than vectorized code. The root cause: register spills, especially around function call boundaries.

Example: 544.nab_r #

Reduced test from 544.nab_r that triggered the issue

In this function, the SLP vectorizer would look only at the basic blocks performing loads/stores, and ignore the blocks containing function calls in-between blocks.

Previous image but showing which blocks were not being analysed

Because those call blocks weren't considered when computing profitability, the vectorizer assumed vectorization was cheap, but in reality, it caused expensive vector register spills.

We modified SLP to walk all blocks in the region and estimate the cost properly. This helped: +9.9% faster on 544.nab_r, but hurt compilation time: +6.9% slower on 502.gcc_r. After discussion, Alexey Bataev (SLP vectorizer maintainer) created a refined version that fixed the issue and avoided compilation slowdowns. This shows how the open-source community is important and that collaboration can take us further.

RVA22U64_V SPEC execution time results

With the refined patch, we got up to 11.9% improvement (544.nab_r), no regressions, with negligible compile-time regressions. The combined results show a -3.28% geometric mean improvement on execution time.

Note that we only show results for RVA22U64_V, because the regression only happened when vectors were enabled. Now, the execution time is on par or better than the scalar-only execution time.

3. IPRA Support (Inter-Procedural Register Allocation) #

IPRA tracks which registers are actually used across call boundaries. Without it, LLVM spills registers conservatively, including registers that aren’t truly live.

IPRA illustrative example

Let's consider the illustrative example above, and assume s0 and s1 are not live in this function. LLVM will still save and restore these registers when IPRA is disabled. By enabling IPRA, we reduce register pressure and create shorter function prologues and epilogues.

RVA22U64 SPEC execution time results

With IPRA enabled on RVA22U64, we got up to 3.2% execution time improvements (538.lbm_r) and no regressions. The combined results show a -0.50% geometric mean improvement on execution time.

RVA22U64_V SPEC execution time results

When we enable vectors (RVA22U64_V), we got up to 3.4% improvement (508.deepsjeng_r) and no regressions. The combined results show a -0.39% geometric mean improvement on execution time.

Small but consistent wins; however, IPRA can’t be enabled by default due to an open bug, though it does not affect SPEC.

LLVM vs GCC #

This comparison isn’t perfectly apples-to-apples because LLVM has X60-specific scheduling and GCC does not. Still, it's useful to see progress.

LLVM vs GCC on RVA22U64

On RVA22U64, LLVM can be 27% faster on some benchmarks but also 27% slower on others (notably x264).

RVA22U64_V SPEC execution time results

The results when we enable vectors (RVA22U64_V) are similar: LLVM can be 8% faster in some benchmarks but also 9.2% slower in others.

My colleague Luke Lau is currently investigating these results to try to address the cases where we are slower.

What we learned #

  1. Adding a scheduling model can give meaningful performance improvements, especially for in-order cores.

  2. A default in-order scheduling model may be needed. Other backends do this already, and we have a PR open for that at PR#167008.

  3. Many contributions don’t show results until the entire system comes together. When the project started, I spent some time modeling individual instruction, but only when the full model was integrated did we see actual improvements.

  4. Vectorization must be tuned carefully; incorrect cost modeling leads to regressions.

Thank you #

Thank you slide

Thank you for reading!

November 22, 2025 12:00 AM

November 20, 2025

Víctor Jáquez

GStreamer Conference 2025

The GStreamer Conference is an annual gathering that brings together developers, contributors, and users of the GStreamer multimedia framework. It serves as a platform for sharing knowledge, discussing the latest advancements, and fostering collaboration within the open-source multimedia community.

This year’s conference was held in London at the impressive Barbican Center, located within the Barbican Estate, a residential complex rebuilt after World War II in the brutalist architectural style.

Barbican Estate
Barbican Estate in the London City

It was a pleasure to meet in person with the colleagues, from different companies and backgrounds, I usually collaborate with remotely, sharing and discussing their projects and results.

Recently, UbiCast, which generously records and streams the conference, has uploaded all talks of this year conference to their site, exclusively for the GStreamer Conference.

In this blog post, I’ll share the talks delivered by my fellow igalians:

Animate Your Subtimelines in GES #

Direct link

GstVA and GStreamer-VAAPI updates #

Direct link

Time Remapping and GES: Implementation Details and Latest Updates #

Direct link

soothe: a proposal for encoder testing #

Direct link

GstWebRTC in WebKit, current status & plans #

Direct link

VVC/H.266 in GStreamer #

Direct link

Video Reshaping with Skia #

Direct link

Vulkan Video: pipeline update #

Direct link

Following the GStreamer Conference, we hosted our Autumn hackfest at Amazon’s offices in the City of London. This time I worked on GStreamer Vulkan.

This year, two other conferences typically held in the US, FOMS and Demuxed, also took place in London. I attended FOMS, where I discovered the vibrant MOQ project.

Finally, I’d like to thank Centricular for organizing the event, especially Tim-Philipp Müller, and even more particularly Igalia for sponsoring it and allowing me to participate in this project that’s close to my heart.

And that’s all, mates. Cheers!

Pint of Guinness
Pint of Guinness

November 20, 2025 12:00 AM

November 17, 2025

Igalia WebKit Team

WebKit Igalia Periodical #47

Update on what happened in WebKit in the week from November 10 to November 17.

This week's update is composed of a new CStringView internal API, more MathML progress with the implementation of the "scriptlevel" attribute, the removal of the Flatpak-based SDK, and the maintanance update of WPEBackend-fdo.

Cross-Port 🐱

Implement the MathML scriptlevel attribute using math-depth.

Finished implementing CStringView, which is a wrapper around UTF8 C strings. It allows you to recover the string without making any copies and perform string operations safely by taking into account the encoding at compile time.

Releases 📦️

WPEBackend-fdo 1.16.1 has been released. This is a maintenance update which adds compatibility with newer Mesa versions.

Infrastructure 🏗️

Most of the Flatpak-based SDK was removed. Developers are warmly encouraged to use the new SDK for their contributions to the Linux ports, this SDK has been successfully deployed on EWS and post-commits bots.

That’s all for this week!

by Igalia WebKit Team at November 17, 2025 09:22 PM

November 13, 2025

Andy Wingo

the last couple years in v8's garbage collector

Let’s talk about memory management! Following up on my article about 5 years of developments in V8’s garbage collector, today I’d like to bring that up to date with what went down in V8’s GC over the last couple years.

methodololology

I selected all of the commits to src/heap since my previous roundup. There were 1600 of them, including reverts and relands. I read all of the commit logs, some of the changes, some of the linked bugs, and any design document I could get my hands on. From what I can tell, there have been about 4 FTE from Google over this period, and the commit rate is fairly constant. There are very occasional patches from Igalia, Cloudflare, Intel, and Red Hat, but it’s mostly a Google affair.

Then, by the very rigorous process of, um, just writing things down and thinking about it, I see three big stories for V8’s GC over this time, and I’m going to give them to you with some made-up numbers for how much of the effort was spent on them. Firstly, the effort to improve memory safety via the sandbox: this is around 20% of the time. Secondly, the Oilpan odyssey: maybe 40%. Third, preparation for multiple JavaScript and WebAssembly mutator threads: 20%. Then there are a number of lesser side quests: heuristics wrangling (10%!!!!), and a long list of miscellanea. Let’s take a deeper look at each of these in turn.

the sandbox

There was a nice blog post in June last year summarizing the sandbox effort: basically, the goal is to prevent user-controlled writes from corrupting memory outside the JavaScript heap. We start from the assumption that the user is somehow able to obtain a write-anywhere primitive, and we work to mitigate the effect of such writes. The most fundamental way is to reduce the range of addressable memory, notably by encoding pointers as 32-bit offsets and then ensuring that no host memory is within the addressable virtual memory that an attacker can write. The sandbox also uses some 40-bit offsets for references to larger objects, with similar guarantees. (Yes, a sandbox really does reserve a terabyte of virtual memory).

But there are many, many details. Access to external objects is intermediated via type-checked external pointer tables. Some objects that should never be directly referenced by user code go in a separate “trusted space”, which is outside the sandbox. Then you have read-only spaces, used to allocate data that might be shared between different isolates, you might want multiple cages, there are “shared” variants of the other spaces, for use in shared-memory multi-threading, executable code spaces with embedded object references, and so on and so on. Tweaking, elaborating, and maintaining all of these details has taken a lot of V8 GC developer time.

I think it has paid off, though, because the new development is that V8 has managed to turn on hardware memory protection for the sandbox: sandboxed code is prevented by the hardware from writing memory outside the sandbox.

Leaning into the “attacker can write anything in their address space” threat model has led to some funny patches. For example, sometimes code needs to check flags about the page that an object is on, as part of a write barrier. So some GC-managed metadata needs to be in the sandbox. However, the garbage collector itself, which is outside the sandbox, can’t trust that the metadata is valid. We end up having two copies of state in some cases: in the sandbox, for use by sandboxed code, and outside, for use by the collector.

The best and most amusing instance of this phenomenon is related to integers. Google’s style guide recommends signed integers by default, so you end up with on-heap data structures with int32_t len and such. But if an attacker overwrites a length with a negative number, there are a couple funny things that can happen. The first is a sign-extending conversion to size_t by run-time code, which can lead to sandbox escapes. The other is mistakenly concluding that an object is small, because its length is less than a limit, because it is unexpectedly negative. Good times!

oilpan

It took 10 years for Odysseus to get back from Troy, which is about as long as it has taken for conservative stack scanning to make it from Oilpan into V8 proper. Basically, Oilpan is garbage collection for C++ as used in Blink and Chromium. Sometimes it runs when the stack is empty; then it can be precise. But sometimes it runs when there might be references to GC-managed objects on the stack; in that case it runs conservatively.

Last time I described how V8 would like to add support for generational garbage collection to Oilpan, but that for that, you’d need a way to promote objects to the old generation that is compatible with the ambiguous references visited by conservative stack scanning. I thought V8 had a chance at success with their new mark-sweep nursery, but that seems to have turned out to be a lose relative to the copying nursery. They even tried sticky mark-bit generational collection, but it didn’t work out. Oh well; one good thing about Google is that they seem willing to try projects that have uncertain payoff, though I hope that the hackers involved came through their OKR reviews with their mental health intact.

Instead, V8 added support for pinning to the Scavenger copying nursery implementation. If a page has incoming ambiguous edges, it will be placed in a kind of quarantine area for a while. I am not sure what the difference is between a quarantined page, which logically belongs to the nursery, and a pinned page from the mark-compact old-space; they seem to require similar treatment. In any case, we seem to have settled into a design that was mostly the same as before, but in which any given page can opt out of evacuation-based collection.

What do we get out of all of this? Well, not only can we get generational collection for Oilpan, but also we unlock cheaper, less bug-prone “direct handles” in V8 itself.

The funny thing is that I don’t think any of this is shipping yet; or, if it is, it’s only in a Finch trial to a minority of users or something. I am looking forward in interest to seeing a post from upstream V8 folks; whole doctoral theses have been written on this topic, and it would be a delight to see some actual numbers.

shared-memory multi-threading

JavaScript implementations have had the luxury of a single-threadedness: with just one mutator, garbage collection is a lot simpler. But this is ending. I don’t know what the state of shared-memory multi-threading is in JS, but in WebAssembly it seems to be moving apace, and Wasm uses the JS GC. Maybe I am overstating the effort here—probably it doesn’t come to 20%—but wiring this up has been a whole thing.

I will mention just one patch here that I found to be funny. So with pointer compression, an object’s fields are mostly 32-bit words, with the exception of 64-bit doubles, so we can reduce the alignment on most objects to 4 bytes. V8 has had a bug open forever about alignment of double-holding objects that it mostly ignores via unaligned loads.

Thing is, if you have an object visible to multiple threads, and that object might have a 64-bit field, then the field should be 64-bit aligned to prevent tearing during atomic access, which usually means the object should be 64-bit aligned. That is now the case for Wasm structs and arrays in the shared space.

side quests

Right, we’ve covered what to me are the main stories of V8’s GC over the past couple years. But let me mention a few funny side quests that I saw.

the heuristics two-step

This one I find to be hilariousad. Tragicomical. Anyway I am amused. So any real GC has a bunch of heuristics: when to promote an object or a page, when to kick off incremental marking, how to use background threads, when to grow the heap, how to choose whether to make a minor or major collection, when to aggressively reduce memory, how much virtual address space can you reasonably reserve, what to do on hard out-of-memory situations, how to account for off-heap mallocated memory, how to compute whether concurrent marking is going to finish in time or if you need to pause... and V8 needs to do this all in all its many configurations, with pointer compression off or on, on desktop, high-end Android, low-end Android, iOS where everything is weird, something called Starboard which is apparently part of Cobalt which is apparently a whole new platform that Youtube uses to show videos on set-top boxes, on machines with different memory models and operating systems with different interfaces, and on and on and on. Simply tuning the system appears to involve a dose of science, a dose of flailing around and trying things, and a whole cauldron of witchcraft. There appears to be one person whose full-time job it is to implement and monitor metrics on V8 memory performance and implement appropriate tweaks. Good grief!

mutex mayhem

Toon Verwaest noticed that V8 was exhibiting many more context switches on MacOS than Safari, and identified V8’s use of platform mutexes as the problem. So he rewrote them to use os_unfair_lock on MacOS. Then implemented adaptive locking on all platforms. Then... removed it all and switched to abseil.

Personally, I am delighted to see this patch series, I wouldn’t have thought that there was juice to squeeze in V8’s use of locking. It gives me hope that I will find a place to do the same in one of my projects :)

ta-ta, third-party heap

It used to be that MMTk was trying to get a number of production language virtual machines to support abstract APIs so that MMTk could slot in a garbage collector implementation. Though this seems to work with OpenJDK, with V8 I think the churn rate and laser-like focus on the browser use-case makes an interstitial API abstraction a lose. V8 removed it a little more than a year ago.

fin

So what’s next? I don’t know; it’s been a while since I have been to Munich to drink from the source. That said, shared-memory multithreading and wasm effect handlers will extend the memory management hacker’s full employment act indefinitely, not to mention actually landing and shipping conservative stack scanning. There is a lot to be done in non-browser V8 environments, whether in Node or on the edge, but it is admittedly harder to read the future than the past.

In any case, it was fun taking this look back, and perhaps I will have the opportunity to do this again in a few years. Until then, happy hacking!

by Andy Wingo at November 13, 2025 03:21 PM

November 10, 2025

Igalia WebKit Team

WebKit Igalia Periodical #46

Update on what happened in WebKit in the week from November 3 to November 10.

This week brought a hodgepodge of fixes in Temporal and multimedia, a small addition to the public API in preparation for future work, plus advances in WebExtensions, WebXR, and Android support.

Cross-Port 🐱

The platform-independent part of the WebXR Hit Test Module has been implemented. The rest, including the FakeXRDevice mock implementation used for testing will be done later.

On the WebExtensions front, parts of the WebExtensionCallbackHandler code have been rewritten to use more C++ constructs and helper functions, in preparation to share more code among the different WebKit ports.

A new WebKitImage utility class landed this week. This image abstraction is one of the steps towards delivering a new improved API for page favicons, and it is also expected to be useful for the WebExtensions work, and to enable the webkit_web_view_get_snapshot() API for the WPE port.

Multimedia 🎥

GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.

Videos with BT2100-PQ colorspace are now tone-mapped to SDR in WebKit's compositor, ensuring colours do not appear washed out.

Lots of deadlock fixes this week, one among many in the MediaStream GStreamer source element.

Video frame rendering to WebGL was fixed. Another pending improvement is GPU-to-GPU texture copies, which might be coming soon.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

JavaScriptCore's implementation of Temporal received a number of improvements this week:

  • Fixed a bug that would cause wrong results when adding a duration with a very large microseconds or nanoseconds value to a PlainTime.

  • Fixed a rounding bug of Instant values.

  • Fixed a bug that resulted in incorrect printing of certain Instant values before the Epoch.

  • Fixed a bug that resulted in wrong results instead of exceptions when a date addition operation would result in an out-of-range date.

WPE WebKit 📟

WPE Android 🤖

Adaptation of WPE WebKit targeting the Android operating system.

One of the last pieces needed to have the WPEPlatform API working on Android has been merged: a custom platform EGL display implementation, and enabling the default display as fallback.

Community & Events 🤝

The dates for the next Web Engines Hackfest have been announced: it will take place from Monday, June 15th to Wednesday, June 17th. As it has been the case in the last years, it will be possible to attend both on-site, and remotely for those who cannot to travel to A Coruña.

The video recording for Adrian Pérez's “WPE Android 🤖 State of the Bot” talk from this year's edition of the WebKit Contributors' Meeting has been published. This was an update on what the Igalia WebKit team has been done during the last year to improve WPE WebKit on Android, and what is coming up next.

That’s all for this week!

by Igalia WebKit Team at November 10, 2025 11:04 PM

November 03, 2025

Melissa Wen

Kworkflow at Kernel Recipes 2025

Franks drawing of Melissa Wen with Kernel Recipes mascots around

This was the first year I attended Kernel Recipes and I have nothing but say how much I enjoyed it and how grateful I’m for the opportunity to talk more about kworkflow to very experienced kernel developers. What I mostly like about Kernel Recipes is its intimate format, with only one track and many moments to get closer to experts and people that you commonly talk online during your whole year.

In the beginning of this year, I gave the talk Don’t let your motivation go, save time with kworkflow at FOSDEM, introducing kworkflow to a more diversified audience, with different levels of involvement in the Linux kernel development.

At this year’s Kernel Recipes I presented the second talk of the first day: Kworkflow - mix & match kernel recipes end-to-end.

The Kernel Recipes audience is a bit different from FOSDEM, with mostly long-term kernel developers, so I decided to just go directly to the point. I showed kworkflow being part of the daily life of a typical kernel developer from the local setup to install a custom kernel in different target machines to the point of sending and applying patches to/from the mailing list. In short, I showed how to mix and match kernel workflow recipes end-to-end.

As I was a bit fast when showing some features during my presentation, in this blog post I explain each slide from my speaker notes. You can see a summary of this presentation in the Kernel Recipe Live Blog Day 1: morning.


Introduction

First slide: Kworkflow by Melissa Wen

Hi, I’m Melissa Wen from Igalia. As we already started sharing kernel recipes and even more is coming in the next three days, in this presentation I’ll talk about kworkflow: a cookbook to mix & match kernel recipes end-to-end.

Second slide: About Melissa Wen, the speaker of this talk

This is my first time attending Kernel Recipes, so lemme introduce myself briefly.

  • As I said, I work for Igalia, I work mostly on kernel GPU drivers in the DRM subsystem.
  • In the past, I co-maintained VKMS and the v3d driver. Nowadays I focus on the AMD display driver, mostly for the Steam Deck.
  • Besides code, I contribute to the Linux kernel by mentoring several newcomers in Outreachy, Google Summer of Code and Igalia Coding Experience. Also, by documenting and tooling the kernel.

Slide 3: and what's this cookbook called Kwokflow? - with kworkflow logo and KR penguin

And what’s this cookbook called kworkflow?

Kworkflow (kw)

Slide 4: text below

Kworkflow is a tool created by Rodrigo Siqueira, my colleague at Igalia. It’s a single platform that combines software and tools to:

  • optimize your kernel development workflow;
  • reduce time spent in repetitive tasks;
  • standardize best practices;
  • ensure that deployment data flows smoothly and reliably between different kernel workflows;

Slide 5: kworkflow is mostly a voluntary work

It’s mostly done by volunteers, kernel developers using their spare time. Its features cover real use cases according to kernel developer needs.

Slide 6: Mix & Match the daily life of a kernel developer

Basically it’s mixing and matching the daily life of a typical kernel developer with kernel workflow recipes with some secret sauces.

First recipe: A good GPU driver for my AMD laptop

Slide 7: Let's prepare our first recipe

So, it’s time to start the first recipe: A good GPU driver for my AMD laptop.

Slide 8: Ingredients and Tools

Before starting any recipe we need to check the necessary ingredients and tools. So, let’s check what you have at home.

With kworkflow, you can use:

Slide 9: kw device and kw remote

  • kw device: to get information about the target machine, such as: CPU model, kernel version, distribution, GPU model,

  • kw remote: to set the address of this machine for remote access

Slide 11: kw config

  • kw config: you can configure kw with kw config. With this command you can basically select the tools, flags and preferences that kw will use to build and deploy a custom kernel in a target machine. You can also define recipients of your patches when sending it using kw send-patch. I’ll explain more about each feature later in this presentation.

Slide 13: kw kernel-config-manager

  • kw kernel-config manager (or just kw k): to fetch the kernel .config file from a given machine, store multiple .config files, list and retrieve them according to your needs.

Slide 15: Preparation

Now, with all ingredients and tools selected and well portioned, follow the right steps to prepare your custom kernel!

First step: Mix ingredients with kw build or just kw b

Slide 16: kw build

  • kw b and its options wrap many routines of compiling a custom kernel.
    • You can run kw b -i to check the name and kernel version and the number of modules that will be compiled and kw b --menu to change kernel configurations.
    • You can also pre-configure compiling preferences in kw config regarding kernel building. For example, target architecture, the name of the generated kernel image, if you need to cross-compile this kernel for a different system and which tool to use for it, setting different warning levels, compiling with CFlags, etc.
    • Then you can just run kw b to compile the custom kernel for a target machine.

Second step: Bake it with kw deploy or just kw d

Slide 18: kw deploy

After compiling the custom kernel, we want to install it in the target machine. Check the name of the custom kernel built: 6.17.0-rc6 and with kw s SSH access the target machine and see it’s running the kernel from the Debian distribution 6.16.7+deb14-amd64.

As with building settings, you can also pre-configure some deployment settings, such as compression type, path to device tree binaries, target machine (remote, local, vm), if you want to reboot the target machine just after deploying your custom kernel, and if you want to boot in the custom kernel when restarting the system after deployment.

If you didn’t pre-configured some options, you can still customize as a command option, for example: kw d --reboot will reboot the system after deployment, even if I didn’t set this in my preference.

With just running kw d --reboot I have installed the kernel in a given target machine and rebooted it. So when accessing the system again I can see it was booted in my custom kernel.

Third step: Time to taste with kw debug

Slide 20: kw debug

  • kw debug wraps many tools for validating a kernel in a target machine. We can log basic dmesg info but also tracking events and ftrace.
    • With kw debug --dmesg --history we can grab the full dmesg log from a remote machine, if you use the --follow option, you will monitor dmesg outputs. You can also run a command with kw debug --dmesg --cmd="<my command>" and just collect the dmesg output related to this specific execution period.
    • In the example, I’ll just unload the amdgpu driver. I use kw drm --gui-off to drop the graphical interface and release the amdgpu for unloading it. So I run kw debug --dmesg --cmd="modprobe -r amdgpu" to unload the amdgpu driver, but it fails and I couldn’t unload it.

Cooking Problems

Slide 22: kw patch-hub

Oh no! That custom kernel isn’t tasting good. Don’t worry, as in many recipes preparations, we can search on the internet to find suggestions on how to make it tasteful, alternative ingredients and other flavours according to your taste.

With kw patch-hub you can search on the lore kernel mailing list for possible patches that can fix your kernel issue. You can navigate in the mailing lists, check series, bookmark it if you find it relevant and apply it in your local kernel tree, creating a different branch for tasting… oops, for testing. In this example, I’m opening the amd-gfx mailing list where I can find contributions related to the AMD GPU driver, bookmark and/or just apply the series to my work tree and with kw bd I can compile & install the custom kernel with this possible bug fix in one shot.

As I changed my kw config to reboot after deployment, I just need to wait for the system to boot to try again unloading the amdgpu driver with kw debug --dmesg --cm=modprobe -r amdgpu. From the dmesg output retrieved by kw for this command, the driver was unloaded, the problem is fixed by this series and the kernel tastes good now.

If I’m satisfied with the solution, I can even use kw patch-hub to access the bookmarked series and marking the checkbox that will reply the patch thread with a Reviewed-by tag for me.

Second Recipe: Raspberry Pi 4 with Upstream Kernel

Slide 25: Second Recipe RPi 4 with upstream kernel

As in all recipes, we need ingredients and tools, but with kworkflow you can get everything set as when changing scenarios in a TV show. We can use kw env to change to a different environment with all kw and kernel configuration set and also with the latest compiled kernel cached.

I was preparing the first recipe for a x86 AMD laptop and with kw env --use RPI_64 I use the same worktree but moved to a different kernel workflow, now for Raspberry Pi 4 64 bits. The previous compiled kernel 6.17.0-rc6-mainline+ is there with 1266 modules, not the 6.17.0-rc6 kernel with 285 modules that I just built&deployed. kw build settings are also different, now I’m targeting a arm64 architecture with a cross-compiled kernel using aarch64-linu-gnu- cross-compilation tool and my kernel image calls kernel8 now.

Slide 27: kw env

If you didn’t plan for this recipe in advance, don’t worry. You can create a new environment with kw env --create RPI_64_V2 and run kw init --template to start preparing your kernel recipe with the mirepoix ready.

I mean, with the basic ingredients already cut…

I mean, with the kw configuration set from a template.

And you can use kw remote to set the IP address of your target machine and kw kernel-config-manager to fetch/retrieve the .config file from your target machine. So just run kw bd to compile and install a upstream kernel for Raspberry Pi 4.

Third Recipe: The Mainline Kernel Ringing on my Steam Deck (Live Demo)

Slide 30: Third Recipe - The Mainline Kernel Ringing on my Steam Deck

Let’s show you how easy is to build, install and test a custom kernel for Steam Deck with Kworkflow. It’s a live demo, but I also recorded it because I know the risks I’m exposed to and something can go very wrong just because of reasons :)

Report: how was the live demo

For this live demo, I took my OLED Steam Deck to the stage. I explained that, if I boot mainline kernel on this device, there is no audio. So I turned it on and booted the mainline kernel I’ve installed beforehand. It was clear that there was no typical Steam Deck startup audio when the system was loaded.

Franks drawing of Melissa Wen doing a demo of kworkflow with the Steam Deck

As I started the demo in the kw environment for Raspberry Pi 4, I first moved to another environment previously used for Steam Deck. In this STEAMDECK environment, the mainline kernel was already compiled and cached, and all settings for accessing the target machine, compiling and installing a custom kernel were retrieved automatically.

My live demo followed these steps:

  1. With kw env --use STEAMDECK, switch to a kworkflow environment for Steam Deck kernel development.

  2. With kw b -i, shows that kw will compile and install a kernel with 285 modules named 6.17.0-rc6-mainline-for-deck.

  3. Run kw config to show that, in this environment, kw configuration changes to x86 architecture and without cross-compilation.

  4. Run kw device to display information about the Steam Deck device, i.e. the target machine. It also proves that the remote access - user and IP - for this Steam Deck was already configured when using the STEAMDECK environment, as expected.

  5. Using git am, as usual, apply a hot fix on top of the mainline kernel. This hot fix makes the audio play again on Steam Deck.

  6. With kw b, build the kernel with the audio change. It will be fast because we are only compiling the affected files since everything was previously done and cached. Compiled kernel, kw configuration and kernel configuration is retrieved by just moving to the “STEAMDECK” environment.

  7. Run kw d --force --reboot to deploy the new custom kernel to the target machine. The --force option enables us to install the mainline kernel even if mkinitcpio complains about missing support for downstream packages when generating initramfs. The --reboot option makes the device reboot the Steam Deck automatically, just after the deployment completion.

  8. After finishing deployment, the Steam Deck will reboot on the new custom kernel version and made a clear resonant or vibrating sound. [Hopefully]

Finally, I showed to the audience that, if I wanted to send this patch upstream, I just needed to run kw send-patch and kw would automatically add subsystem maintainers, reviewers and mailing lists for the affected files as recipients, and send the patch to the upstream community assessment. As I didn’t want to create unnecessary noise, I just did a dry-run with kw send-patch -s --simulate to explain how it looks.

What else can kworkflow already mix & match?

In this presentation, I showed that kworkflow supported different kernel development workflows, i.e., multiple distributions, different bootloaders and architectures, different target machines, different debugging tools and automatize your kernel development routines best practices, from development environment setup and verifying a custom kernel in bare-metal to sending contributions upstream following the contributions-by-e-mail structure. I exemplified it with three different target machines: my ordinary x86 AMD laptop with Debian, Raspberry Pi 4 with arm64 Raspbian (cross-compilation) and the Steam Deck with SteamOS (x86 Arch-based OS). Besides those distributions, Kworkflow also supports Ubuntu, Fedora and PopOS.

Now it’s your turn: Do you have any secret recipes to share? Please share with us via kworkflow.


November 03, 2025 09:30 PM

Igalia WebKit Team

WebKit Igalia Periodical #45

Update on what happened in WebKit in the week from October 27 to November 3.

A calmer week this time! This week we have the GTK and WPE ports implementing the RunLoopObserver infrastructure, which enables more sophisticated scheduling in WebKit Linux ports, as well as more information in webkit://gpu. On the Trusted Types front, the timing of check was changed to align with spec changes.

Cross-Port 🐱

Implemented the RunLoopObserver infrastructure for GTK and WPE ports, a critical piece of technology previously exclusive to Apple ports that enables sophisticated scheduling features like OpportunisticTaskScheduler for optimal garbage collection timing.

The implementation refactored the GLib run loop to notify clients about activity-state transitions (BeforeWaiting, Entry, Exit, AfterWaiting), then moved from timer-based to observer-based layer flushing for more precise control over rendering updates. Finally support was added to support cross-thread scheduling of RunLoopObservers, allowing the ThreadedCompositor to use them, enabling deterministic composition notifications across thread boundaries.

Changed timing of Trusted Types checks within DOM attribute handling to align with spec changes.

Graphics 🖼️

The webkit://gpu page now shows more information like the list of preferred buffer formats, the list of supported buffer formats, threaded rendering information, number of MSAA samples, view size, and toplevel state.

It is also now possible to make the page autorefresh every the given amount of seconds by passing a ?refresh=<seconds> parameter in the URL.

That’s all for this week!

by Igalia WebKit Team at November 03, 2025 07:16 PM

October 30, 2025

Andy Wingo

wastrel, a profligate implementation of webassembly

Hey hey hey good evening! Tonight a quick note on wastrel, a new WebAssembly implementation.

a wasm-to-native compiler that goes through c

Wastrel compiles Wasm modules to standalone binaries. It does so by emitting C and then compiling that C.

Compiling Wasm to C isn’t new: Ben Smith wrote wasm2c back in the day and these days most people in this space use Bastien Müller‘s w2c2. These are great projects!

Wastrel has two or three minor differences from these projects. Let’s lead with the most important one, despite the fact that it’s as yet vaporware: Wastrel aims to support automatic memory managment via WasmGC, by embedding the Whippet garbage collection library. (For the wingolog faithful, you can think of Wastrel as a Whiffle for Wasm.) This is the whole point! But let’s come back to it.

The other differences are minor. Firstly, the CLI is more like wasmtime: instead of privileging the production of C, which you then incorporate into your project, Wastrel also compiles the C (by default), and even runs it, like wasmtime run.

Unlike wasm2c (but like w2c2), Wastrel implements WASI. Specifically, WASI 0.1, sometimes known as “WASI preview 1”. It’s nice to be able to take the wasi-sdk‘s C compiler, compile your program to a binary that uses WASI imports, and then run it directly.

In a past life, I once took a week-long sailing course on a 12-meter yacht. One thing that comes back to me often is the way the instructor would insist on taking in the bumpers immediately as we left port, that to sail with them was no muy marinero, not very seamanlike. Well one thing about Wastrel is that it emits nice C: nice in the sense that it avoids many useless temporaries. It does so with a lightweight effects analysis, in which as temporaries are produced, they record which bits of the world they depend on, in a coarse way: one bit for the contents of all global state (memories, tables, globals), and one bit for each local. When compiling an operation that writes to state, we flush all temporaries that read from that state (but only that state). It’s a small thing, and I am sure it has very little or zero impact after SROA turns locals into SSA values, but we are vessels of the divine, and it is important for vessels to be C worthy.

Finally, w2c2 at least is built in such a way that you can instantiate a module multiple times. Wastrel doesn’t do that: the Wasm instance is statically allocated, once. It’s a restriction, but that’s the use case I’m going for.

on performance

Oh buddy, who knows?!? What is real anyway? I would love to have proper perf tests, but in the meantime, I compiled coremark using my GCC on x86-64 (-02, no other options), then also compiled it with the current wasi-sdk and then ran with w2c2, wastrel, and wasmtime. I am well aware of the many pitfalls of benchmarking, and so I should not say anything because it is irresponsible to make conclusions from useless microbenchmarks. However, we’re all friends here, and I am a dude with hubris who also believes blogs are better out than in, and so I will give some small indications. Please obtain your own salt.

So on coremark, Wastrel is some 2-5% percent slower than native, and w2c2 is some 2-5% slower than that. Wasmtime is 30-40% slower than GCC. Voilà.

My conclusion is, Wastrel provides state-of-the-art performance. Like w2c2. It’s no wonder, these are simple translators that use industrial compilers underneath. But it’s neat to see that performance is close to native.

on wasi

OK this is going to sound incredibly arrogant but here it is: writing Wastrel was easy. I have worked on Wasm for a while, and on Firefox’s baseline compiler, and Wastrel is kinda like a baseline compiler in shape: it just has to avoid emitting boneheaded code, and can leave the serious work to someone else (Ion in the case of Firefox, GCC in the case of Wastrel). I just had to use the Wasm libraries I already had and make it emit some C for each instruction. It took 2 days.

WASI, though, took two and a half weeks of agony. Three reasons: One, you can be sloppy when implementing just wasm, but when you do WASI you have to implement an ABI using sticks and glue, but you have no glue, it’s all just i32. Truly excruciating, it makes you doubt everything, and I had to refactor Wastrel to use C’s meager type system to the max. (Basically, structs-as-values to avoid type confusion, but via inline functions to avoid overhead.)

Two, WASI is not huge but not tiny either. Implementing poll_oneoff is annoying. And so on. Wastrel’s WASI implementation is thin but it’s still a couple thousand lines of code.

Three, WASI is underspecified, and in practice what is “conforming” is a function of what the Rust and C toolchains produce. I used wasi-testsuite to burn down most of the issues, but it was a slog. I neglected email and important things but now things pass so it was worth it maybe? Maybe?

on wasi’s filesystem sandboxing

WASI preview 1 has this “rights” interface that associated capabilities with file descriptors. I think it was an attempt at replacing and expanding file permissions with a capabilities-oriented security approach to sandboxing, but it was only a veneer. In practice most WASI implementations effectively implement the sandbox via a permissions layer: for example the process has capabilities to access the parents of preopened directories via .., but the WASI implementation has to actively prevent this capability from leaking to the compiled module via run-time checks.

Wastrel takes a different approach, which is to use Linux’s filesystem namespaces to build a tree in which only the exposed files are accessible. No run-time checks are necessary; the system is secure by construction. He says. It’s very hard to be categorical in this domain but a true capabilities-based approach is the only way I can have any confidence in the results, and that’s what I did.

The upshot is that Wastrel is only for Linux. And honestly, if you are on MacOS or Windows, what are you doing with your life? I get that it’s important to meet users where they are but it’s just gross to build on a corporate-controlled platform.

The current versions of WASI keep a vestigial capabilities-based API, but given that the goal is to compile POSIX programs, I would prefer if wasi-filesystem leaned into the approach of WASI just having access to a filesystem instead of a small set of descriptors plus scoped openat, linkat, and so on APIs. The security properties would be the same, except with fewer bug possibilities and with a more conventional interface.

on wtf

So Wastrel is Wasm to native via C, but with an as-yet-unbuilt GC aim. Why?

This is hard to explain and I am still workshopping it.

Firstly I am annoyed at the WASI working group’s focus on shared-nothing architectures as a principle of composition. Yes, it works, but garbage collection also works; we could be building different, simpler systems if we leaned in to a more capable virtual machine. Many of the problems that WASI is currently addressing are ownership-related, and would be comprehensively avoided with automatic memory management. Nobody is really pushing for GC in this space and I would like for people to be able to build out counterfactuals to the shared-nothing orthodoxy.

Secondly there are quite a number of languages that are targetting WasmGC these days, and it would be nice for them to have a good run-time outside the browser. I know that Wasmtime is working on GC, but it needs competition :)

Finally, and selfishly, I have a GC library! I would love to spend more time on it. One way that can happen is for it to prove itself useful, and maybe a Wasm implementation is a way to do that. Could Wastrel on wasm_of_ocaml output beat ocamlopt? I don’t know but it would be worth it to find out! And I would love to get Guile programs compiled to native, and perhaps with Hoot and Whippet and Wastrel that is a possibility.

Welp, there we go, blog out, dude to bed. Hack at y’all later and wonderful wasming to you all!

by Andy Wingo at October 30, 2025 10:19 PM

October 29, 2025

Eric Meyer

Custom Asidenotes

Previously on meyerweb, I crawled through a way to turn parenthetical comments into sidenotes, which I called “asidenotes”.  As a recap, these are inline asides in parentheses, which is something I like to do.  The constraints are that the text has to start inline, with its enclosing parentheses as part of the static content, so that the parentheses are present if CSS isn’t applied, but should lose those parentheses when turned into asidenotes, while also adding a sentence-terminating period when needed.

At the end of that post, I said I wouldn’t use the technique I developed, because the markup was too cluttered and unwieldy, and there were failure states that CSS alone couldn’t handle.  So what can we do instead?  Extend HTML to do things automatically!

If you’ve read my old post “Blinded By the DOM Light”, you can probably guess how this will go.  Basically, we can write a little bit of JavaScript to take an invented element and Do Things To It™.  What things?  Anything JavaScript makes possible.

So first, we need an element, one with a hyphen in the middle of its name (because all custom elements require an interior hyphen, similar to how all custom properties and most custom identifiers in CSS require two leading dashes).  Something like:

<aside-note>(actual text content)</aside-note>

Okay, great!  Thanks to HTML’s permissive handling of unrecognized elements, this completely new element will be essentially treated like a <span> in older browsers.  In newer browsers, we can massage it.

class asideNote extends HTMLElement {
	connectedCallback() {
		let marker = document.createElement('sup');
		marker.classList.add('asidenote-marker');
		this.after(marker);
	}
}
customElements.define("aside-note",asideNote);

With this in place, whenever a supporting browser encounters an <aside-note> element, it will run the JS above.  Right now, what that does is insert a <sup> element just after the <aside-note>.

“Whoa, wait a minute”, I thought to myself at this point. “There will be browsers (mostly older browser versions) that understand custom elements, but don’t support anchor positioning.  I should only run this JS if the browser can position with anchors, because I don’t want to needlessly clutter the DOM.  I need an @supports query, except in JS!” And wouldn’t you know it, such things do exist.

class asideNote extends HTMLElement {
	connectedCallback() {
		if (CSS.supports('bottom','anchor(top)')) {
			let marker = document.createElement('sup');
			marker.classList.add('asidenote-marker');
			this.after(marker);
		}
	}
}

That will yield the following DOM structure:

<aside-note>(and brower versions)</aside-note><sup></sup>

That’s all we need to generate some markers and do some positioning, as was done in my previous post.  To wit:

@supports (anchor-name: --main) {
	#thoughts {
		anchor-name: --main;
	}
	#thoughts article {
		counter-reset: asidenotes;
	}
	#thoughts article sup {
		font-size: 89%;
		line-height: 0.5;
		color: inherit;
		text-decoration: none;
	}
	#thoughts article aside-note::after,
	#thoughts article aside-note + sup::before {
		content: counter(asidenotes);
	}
	#thoughts article aside-note {
		counter-increment: asidenotes;
		position: absolute;
		anchor-name: --asidenote;
		top: max(anchor(top), calc(anchor(--asidenote bottom, 0px) + 0.67em));
		bottom: auto;
		left: calc(anchor(--main right) + 4em);
		max-width: 23em;
		margin-block: 0.15em 0;
		text-wrap: balance;
		text-indent: 0;
		font-size: 89%;
		line-height: 1.25;
		list-style: none;
	}
	#thoughts article aside-note::before {
		content: counter(asidenotes);
		position: absolute;
		top: -0.4em;
		right: calc(100% + 0.25em);
	}
	#thoughts article aside-note::first-letter {
		text-transform: uppercase;
	}
}

I went through a lot of that CSS in the previous post, so jump over there to get details on what all that means if the above has you agog.  I did add a few bits of text styling like an explicit line height and slight size reduction, and changed all the asidenote classes there to aside-note elements here, but nothing is different with the positioning and such.

Let’s go back to the JavaScript, where we can strip off the leading and trailing parentheses with relative ease.

class asideNote extends HTMLElement {
	connectedCallback() {
		if (CSS.supports('bottom','anchor(top)')) {
			let marker = document.createElement('sup');
			marker.classList.add('asidenote-marker');
			this.after(marker);
			let inner = this.innerText;
			if (inner.slice(0,1) == '(' && inner.slice(-1) == ')') {
				inner = inner.slice(1,inner.length-1);}
			this.innerText = inner;
		}
	}
}

This code looks at the innerText of the asidenote, checks to see if it both begins and ends with parentheses (which all asidenotes should!), and then if so, it strips them out of the text and sets the <aside-note>’s innerText to be that stripped string.  I decided to set it up so that the stripping only happens if there are balanced parentheses because if there aren’t, I’ll see that in the post preview and fix it before publishing.

I still haven’t added the full stop at the end of the asidenotes, nor have I accounted for asidenotes that end in punctuation, so let’s add in a little bit more code to check for and do that:

class asideNote extends HTMLElement {
	connectedCallback() {
		if (CSS.supports('bottom','anchor(top)')) {
			let marker = document.createElement('sup');
			marker.classList.add('asidenote-marker');
			this.after(marker);
			let inner = this.innerText;
			if (inner.slice(0,1) == '(' && inner.slice(-1) == ')') {
				inner = inner.slice(1,inner.length-1);}
			if (!isLastCharSpecial(inner)) {
				inner += '.';}
			this.innerText = inner;
		}
	}
}
function isLastCharSpecial(str) {
	const punctuationRegex = /[!/?/‽/.\\]/;
	return punctuationRegex.test(str.slice(-1));
}

And with that, there is really only one more point of concern: what will happen to my asidenotes in mobile contexts?  Probably be positioned just offscreen, creating a horizontal scrollbar or just cutting off the content completely.  Thus, I don’t just need a supports query in my JS.  I also need a media query.  It’s a good thing those also exist!

class asideNote extends HTMLElement {
	connectedCallback() {
		if (CSS.supports('bottom','anchor(top)') &&
			window.matchMedia('(width >= 65em)').matches) {
			let marker = document.createElement('sup');
			marker.classList.add('asidenote-marker');
			this.after(marker);

Adding that window.matchMedia to the if statement’s test means all the DOM and content massaging will be done only if the browser understands anchor positioning and the window width is above 65 ems, which is my site’s first mobile media breakpoint that would cause real layout problems.  Otherwise, it will leave the asidenote content embedded and fully parenthetical.  Your breakpoint will very likely differ, but the principle still holds.

The one thing about this JS is that the media query only happens when the custom element is set up, same as the support query.  There are ways to watch for changes to the media environment due to things like window resizes, but I’m not going to use them here.  I probably should, but I’m still not going to.

So: will I use this version of asidenotes on meyerweb?  I might, Rabbit, I might.  I mean, I’m already using them in this post, so it seems like I should just add the JS to my blog templates and the CSS to my stylesheets so I can keep doing this sort of thing going forward.  Any objections?  Let’s hear ’em!

@supports (anchor-name: --main) { @media (width >= 65em) { #thoughts { anchor-name: --main; } #thoughts article { counter-reset: asidenotes; } #thoughts article aside-note::before, #thoughts article aside-note + sup::before { content: counter(asidenotes); } #thoughts article aside-note { counter-increment: asidenotes; position: absolute; anchor-name: --sidenote; top: max(calc(anchor(--sidenote bottom, 0px) + 0.67em), anchor(top)); bottom: auto; left: calc(anchor(--main right) + 4em); max-width: 23em; text-wrap: balance; margin-block: 0.15em 0; text-indent: 0; font-size: 89%; line-height: 1.25; list-style: none; } #thoughts article aside-note::before { position: absolute; top: -0.4em; right: calc(100% + 0.25em); } #thoughts article aside-note::first-letter { text-transform: uppercase; } #thoughts article sup { font-size: 89%; line-height: 0.5; color: inherit; text-decoration: none; } } }

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at October 29, 2025 01:19 PM

October 28, 2025

Igalia WebKit Team

WebKit Igalia Periodical #44

Update on what happened in WebKit in the week from October 21 to October 28.

This week has again seen a spike in activity related to WebXR and graphics performance improvements. Additionally, we got in some MathML additions, a fix for hue interpolation, a fix for WebDriver screenshots, development releases, and a blog post about memory profiling.

Cross-Port 🐱

Support for WebXR Layers has seen the very first changes needed to have them working on WebKit. This is expected to take time to complete, but should bring improvements in performance, rendering quality, latency, and power consumption down the road.

Work has started on the WebXR Hit Test Module, which will allow WebXR experiences to check for real world surfaces. The JavaScript API bindings were added, followed by an initial XRRay implementation. More work is needed to actually provide data from device sensors.

Now that the WebXR implementation used for the GTK and WPE ports is closer to the Cocoa ones, it was possible to unify the code used to handle opaque buffers.

Implemented the text-transform: math-auto CSS property, which replaces the legacy mathvariant system and is used to make identifiers italic in MathML Core.

Implemented the math-depth CSS extension from MathML Core.

Graphics 🖼️

The hue interpolation method for gradients has been fixed. This is expected to be part of the upcoming 2.50.2 stable release.

Usage of Multi-Sample Antialiasing (MSAA) has been enabled when using GPU rendering, and then further changed to use dynamic MSAA to improve performance.

Paths that contain a single arc, oval, or line have been changed to use a specialized code path, resulting in improved performance.

WebGL content rendering will be handled by a new isolated process (dubbed “GPU Process”) by default. This is the first step towards moving more graphics processing out of the process that handles processing Web content (the “Web Process”), which will result in increased resilience against buggy graphics drivers and certain kinds of malicious content.

The internal webkit://gpu page has been improved to also display information about the graphics configuration used in the rendering process.

WPE WebKit 📟

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

The new WPE Platform, when using Skia (the default), now takes WebDriver screenshots in the UI Process, using the final assembled frame that was sent to the system compositor. This fixes the issues of some operations like 3D CSS animations that were not correctly captured in screenshots.

Releases 📦️

The first development releases for the current development cycle have been published: WebKitGTK 2.51.1 and WPE WebKit 2.51.1. These are intended to let third parties test upcoming features and improvements and as such bug reports for those are particularly welcome in Bugzilla. We are particularly interested in reports related to WebGL, now that it is handled in an isolated process.

Community & Events 🤝

Paweł Lampe has published a blog post that discusses GTK/WPE WebKit memory profiling using industry-standard tools and a built-in "Malloc Heap Breakdown" WebKit feature.

That’s all for this week!

by Igalia WebKit Team at October 28, 2025 02:31 PM

Eric Meyer

Parenthetical Asidenotes

It’s not really a secret I have a thing for sidenotes, and thus for CSS anchored positioning.  But a thing I realized about myself is that most of my sidenotes are likely to be tiny asides commenting on the main throughline of the text, as opposed to bibliographic references or other things that usually become actual footnotes or endnotes.  The things I would sidenote currently get written as parenthetical inline comments (you know, like this).  Asidenotes, if you will.

Once I had realized that, I wondered: could I set up a way to turn those parenthetical asides into asidenotes in supporting browsers, using only HTML and CSS?  As it turns out, yes, though not in a way I would actually use.  In fact, the thing I eventually arrived at is pretty terrible.

Okay, allow me to explain.

To be crystal clear about this, here’s how I would want one of these parenthetical asides to be rendered in browsers that don’t support anchored positioning, and then how to render in those that do (which are, as I write this, recent Chromium browsers and Safari Technology Previews; see theanchor() MDN page for the latest):

A parenthetical sitting inline (top) and turned into an asidenote (bottom).

My thinking is, the parenthetical text should be the base state, with some HTML to flag the bit that’s an asidenote, and then CSS is applied in supporting browsers to lay out the text as an asidenote.  There is a marker pair to allow an unambiguous association between the two, which is tricky, because that marker should not be in the base text, but should appear when styled.

I thought for a minute that I would wrap these little notes in <aside>s, but quickly realized that would probably be a bad idea for accessibility and other reasons.  I mean, I could use CSS to cast the <aside> to an inline box instead of its browser-default block box, but I’d need to label each one separately, be very careful with roles, and so on and so on.  It was just the wrong tool, it seemed to me.  (Feel free to disagree with me in the comments!)

So, I started with this:

<span class="asidenote">(Feel free to disagree with me in the comments!)</span>

That wasn’t going to be enough, though, because I can certainly position this <span>, but there’s nothing available to leave a maker behind when I do!  Given the intended result, then, there needs to be something in the not-positioned text that serves in that role (by which I mean a utility role, not an ARIA role).  Here’s where my mind went:

<span class="asidenote">(by which I mean a utility role, not an ARIA role)</span><sup></sup>

The added <sup> is what will contain the marker text, like 1 or a or whatever.

This seemed like it was the minimum viable structure, so I started writing some styles.  These asidenotes would be used in my posts, and I’d want the marker counters to reset with each blog post, so I built the selectors accordingly:

@supports not (anchor-name: --main) {
	#thoughts article .asidenote + sup {
		display: none;
	}
} 
@supports (anchor-name: --main) {
	#thoughts {
		anchor-name: --main;
	}
	#thoughts article {
		counter-reset: asidenotes;
	}
	#thoughts article :is(.asidenote::before, .asidenote + sup::before) {
		content: counter(asidenotes);
	}
}

So far, I’ve set a named anchor on the <main> element (which has an id of thoughts) that encloses a page’s content, reset a counter on each <article>, and inserted that counter as the ::before content for both the asidenotes’ <span>s and the <sup>s that follow them.  That done, it’s time to actually position the asidenotes:

	#thoughts article .asidenote {
		counter-increment: asidenotes;
		position: absolute;
		anchor-name: --sidenote;
		top: max(calc(anchor(--sidenote bottom, 0px) + 0.67em), anchor(top));
		bottom: auto;
		left: calc(anchor(--main right) + 4em);
		max-width: 23em;
		margin-block: 0.15em 0;
		text-wrap: balance;
		text-indent: 0;
	}

Here, each class="asidenote" element increments the asidenotes counter by one, and then the asidenote is absolutely positioned so its top is placed at the larger value of two-thirds of an em below the bottom of the previous asidenote, if any; or else the top of its implicit anchor, which, because I didn’t set an explicit named anchor for it in this case, seems to be the place it would have occupied in the normal flow of the text.  This latter bit is long-standing behavior in absolute positioning of inline elements, so it makes sense.  I’m just not sure it fully conforms to the specification, though it’s particularly hard for me to tell in this case.

Moving on!  The left edge of the asidenote is set 4em to the right of the right edge of --main and then some formatting stuff is done to keep it balanced and nicely sized for its context.  Some of you will already have seen what’s going to happen here.

An asidenote with some typographic decoration it definitely should not have in this context.

Yep, the parentheses came right along with the text, and in general the whole thing looks a little odd.  I could certainly argue that these are acceptable design choices, but it’s not what I want to see.  I want the parentheses to go away when laid out as a asidenote, and also capitalize the first letter if it isn’t already, plus close out the text with a full stop.

And this is where the whole thing tipped over into “I don’t love this” territory.  I can certainly add bits of text before and after an element’s content with pseudo-elements, but I can’t subtract bits of text (not without JavaScript, anyway).  The best I can do is suppress their display, but for that, I need structure.  So I went this route with the markup and CSS:

<span class="asidenote"><span>(</span>by which I mean a utility role, not an ARIA role<span>)</span></span><sup></sup>
	#thoughts article .asidenote span:is(:first-child, :last-child) {
		display: none;
	}

I could have used shorter elements like <b> or <i>, and then styled them to look normal, but nah.  I don’t love the clutter, but <span> makes more sense here.

With those parentheses gone, I can uppercase the the first visible letter and full-stop the end of each asidenote like so:

	#thoughts article .asidenote::first-letter {
		text-transform: uppercase;
	}
	#thoughts article .asidenote::after {
		content: ".";
	}

Then I do a little styling of the asidenote’s marker:

	#thoughts article .asidenote::before {
		content: counter(asidenotes);
		position: absolute;
		top: -0.4em;
		right: calc(100% + 0.25em);
	}
} /* closes out the @supports block */

…and that’s more or less it (okay, yes, there are a few other tweaks to the markers and their sizes and line heights and asidenote text size and blah blah blah, but let’s not clutter up the main points by slogging through all that).  With that, I get little asides that are parenthetical in the base text, albeit with a bunch of invisible-to-the-user markup clutter, that will be progressively enhanced into full asidenotes where able.

There’s an extra usage trap here, as well: if I always generate a full stop at the end, it means I should never end my asidenotes with a question mark, exclamation point, interrobang, or other sentence-ending character.  But those are things I like to do!

So, will I use this on meyerweb?  Heck to the no.  The markup clutter is much more annoying than the benefit, it fumbles on some pretty basic use cases, and I don’t really want to go to the lengths of creating weird bespoke text macros  —  or worse, try to fork and extend a local Markdown parser to add some weird bespoke text pattern  —  just to make this work.  If CSS had a character selector that let me turn off the parentheses without needing the extras <span>s, and some kind of outside-the-element generated content, then maybe yes.  Otherwise, no, this is not how I’d do it, at least outside this post.  At the very least, some JavaScript is needed to remove bits of text and decide whether to append the full stop.

Given that JS is needed, how would I do it?  With custom elements and the Light DOM, which I’ll document in the next post.  Stay tuned!

@supports not (anchor-name: --main) { #thoughts article .asidenote + sup { display: none; } } @supports (anchor-name: --main) { #thoughts { anchor-name: --main; } #thoughts article { counter-reset: asidenotes; } #thoughts article .asidenote::before, #thoughts article .asidenote + sup::before { content: counter(asidenotes); } #thoughts article .asidenote + sup { font-size: 89%; line-height: 0.5; color: inherit; text-decoration: none; } #thoughts article .asidenote span:is(:first-child, :last-child) { display: none; } #thoughts article .asidenote { counter-increment: asidenotes; position: absolute; anchor-name: --asidenote; top: max(anchor(top), calc(anchor(--asidenote bottom, 0px) + 0.67em)); bottom: auto; left: calc(anchor(--main right) + 4em); max-width: 23em; text-wrap: balance; margin-block: 0.15em 0; text-indent: 0; font-size: 89%; line-height: 1.25; list-style: none; } #thoughts article .asidenote::before { position: absolute; top: -0.4em; right: calc(100% + 0.25em); } #thoughts article .asidenote::first-letter { text-transform: uppercase; } #thoughts article .asidenote::after { content: "."; } }

Have something to say to all that? You can add a comment to the post, or email Eric directly.

by Eric Meyer at October 28, 2025 01:04 PM

October 24, 2025

Pawel Lampe

Tracking WebKit's memory allocations with Malloc Heap Breakdown

One of the main constraints that embedded platforms impose on the browsers is a very limited memory. Combined with the fact that embedded web applications tend to run actively for days, weeks, or even longer, it’s not hard to imagine how important the proper memory management within the browser engine is in such use cases. In fact, WebKit and WPE in particular receive numerous memory-related fixes and improvements every year. Before making any changes, however, the areas to fix/improve need to be narrowed down first. Like any C++ application, WebKit memory can be profiled using a variety of industry-standard tools. Although such well-known tools are really useful in the majority of use cases, they have their limits that manifest themselves when applied on production-grade embedded systems in conjunction with long-running web applications. In such cases, a very useful tool is a debug-only feature of WebKit itself called malloc heap breakdown, which this article describes.

Industry-standard memory profilers #

When it comes to profiling memory of applications on linux systems, the 2 outstanding tools used usually are Massif (Valgrind) and Heaptrack.

Massif (Valgrind) #

Massif is a heap profiler that comes as part of the Valgrind suite. As its documentation states:

It measures how much heap memory your program uses. This includes both the useful space, and the extra bytes allocated for book-keeping and alignment purposes. It can also measure the size of your program’s stack(s), although it does not do so by default.

Using Massif with WebKit is very straightforward and boils down to a single command:

Malloc=1 valgrind --tool=massif --trace-children=yes WebKitBuild/GTK/Debug/bin/MiniBrowser '<URL>'
  • The Malloc=1 environment variable set above is necessary to instruct WebKit to enable debug heaps that use the system malloc allocator.

Given some results are generated, the memory usage over time can be visualized using massif-visualizer utility. An example of such a visualization is presented in the image below:

TODO.

While Massif has been widely adopted and used for many years now, from the very beginning, it suffered from a few significant downsides.

First of all, the way Massif instruments the profiled application introduces significant overhead that may slow down the application up to 2 orders of magnitude. In some cases, such overhead makes it simply unusable.

The other important problem is that Massif is snapshot-based, and hence, the level of detail is not ideal.

Heaptrack #

Heaptrack is a modern heap profiler developed as part of KDE. The below is its description from the git repository:

Heaptrack traces all memory allocations and annotates these events with stack traces. Dedicated analysis tools then allow you to interpret the heap memory profile to:

  • find hotspots that need to be optimized to reduce the memory footprint of your application
  • find memory leaks, i.e. locations that allocate memory which is never deallocated
  • find allocation hotspots, i.e. code locations that trigger a lot of memory allocation calls
  • find temporary allocations, which are allocations that are directly followed by their deallocation

At first glance, Heaptrack resembles Massif. However, a closer look at the architecture and features shows that it’s much more than the latter. While it’s fair to say it’s a bit similar, in fact, it is a significant progression.

Usage of Heaptrack to profile WebKit is also very simple. At the moment of writing, the most suitable way to use it is to attach to a certain running WebKit process using the following command:

heaptrack -p <PID>

while the WebKit needs to be run with system malloc, just like in Massif case:

WEBKIT_DISABLE_SANDBOX_THIS_IS_DANGEROUS=1 Malloc=1 WebKitBuild/GTK/Debug/bin/MiniBrowser '<URL>'
  • If profiling of e.g. web content process startup is essential, it’s then recommended also to use WEBKIT2_PAUSE_WEB_PROCESS_ON_LAUNCH=1, which adds 30s delay to the process startup.

When the profiling session is done, the analysis of the recordings is done using:

heaptrack --analyze <RECORDING>

The utility opened with the above, shows various things, such as the memory consumption over time:

TODO.

flame graphs of memory allocations with respect to certain functions in the code:

TODO.

etc.

As Heaptrack records every allocation and deallocation, the data it gathers is very precise and full of details, especially when accompanied by stack traces arranged into flame graphs. Also, as Heaptrack does instrumentation differently than e.g. Massif, it’s usually much faster in the sense that it slows down the profiled application only up to 1 order of magnitude.

Shortcomings on embedded systems #

Although the memory profilers such as above are really great for everyday use, their limitations on embedded platforms are:

  • they significantly slow down the profiled application — especially on low-end devices,
  • they effectively cannot be run for a longer period of time such as days or weeks, due to memory consumption,
  • they are not always provided in the images — and hence require additional setup,
  • they may not be buildable out of the box on certain architectures — thus requiring extra patching.

While the above limitations are not always a problem, usually at least one of them is. What’s worse, usually at least one of the limitations turns into a blocking problem. For example, if the target device is very short on memory, it may be basically impossible to run anything extra beyond the browser. Another example could be a situation where the application slowdown due to the profiler usage, leads to different application behavior, such as a problem that originally reproduced 100% of the time, does not reproduce anymore etc.

Malloc heap breakdown in WebKit #

Profiling the memory of WebKit while addressing the above problems points towards a solution that does not involve any extra tools, i.e. instrumenting WebKit itself. Normally, adding such an instrumentation to the C++ application means a lot of work. Fortunately, in the case of WebKit, all that work is already done and can be easily enabled by using the Malloc heap breakdown.

In a nutshell, Malloc heap breakdown is a debug-only feature that enables memory allocation tracking within WebKit itself. Since it’s built into WebKit, it’s very lightweight and very easy to build, as it’s just about setting the ENABLE_MALLOC_HEAP_BREAKDOWN build option. Internally, when the feature is enabled, WebKit switches to using debug heaps that use system malloc along with the malloc zone API to mark objects of certain classes as belonging to different heap zones and thus allowing one to track the allocation sizes of such zones.

As the malloc zone API is specific to BSD-like OSes, the actual implementations (and usages) in WebKit have to be considered separately for Apple and non-Apple ports.

Malloc heap breakdown on Apple ports #

Malloc heap breakdown was originally designed only with Apple ports in mind, with the reason being twofold:

  1. The malloc zone API is provided virtually by all platforms that Apple ports integrate with.
  2. MacOS platforms provide a great utility called footprint that allows one to inspect per-zone memory statistics for a given process.

Given the above, usage of malloc heap breakdown with Apple ports is very smooth and as simple as building WebKit with the ENABLE_MALLOC_HEAP_BREAKDOWN build option and running on macOS while using the footprint utility:

Footprint is a macOS specific tool that allows the developer to check memory usage across regions.

For more details, one should refer to the official documentation page.

Malloc heap breakdown on non-Apple ports #

Since all of the non-Apple WebKit ports are mostly being built and run on non-BSD-like systems, it’s safe to assume the malloc zone API is not offered to such ports by the system itself. Because of the above, for many years, malloc heap breakdown was only available for Apple ports.

Fortunately, with the changes introduced in 2025, such as: 294667@main (+ fix 294848@main), 301702@main, and improvements such as: 294848@main, 299555@main, 301695@main, 301709@main, 301712@main, 301839@main, 301861@main, the malloc heap breakdown integrates also with non-Apple ports and is stable as of main@a235408c2b4eb12216d519e996f70828b9a45e19.

The idea behind the integration for non-Apple ports is to provide a simple WebKit-internal library that provides a fake <malloc/malloc.h> header along with simple implementation that provides malloc_zone_*() function implementations as proxy calls to malloc(), calloc(), realloc() etc. along with a tracking mechanism that keeps references to memory chunks. Such an approach gathers all the information needed to be reported later on.

At the moment of writing, the above allows 2 methods of reporting the memory usage statistics periodically:

  • printing to standard output,
  • reporting to sysprof as counters.
Periodic reporting to standard output

By default, when WebKit is built with ENABLE_MALLOC_HEAP_BREAKDOWN, the heap breakdown is printed to the standard output every few seconds for each process. That can be tweaked by setting WEBKIT_MALLOC_HEAP_BREAKDOWN_LOG_INTERVAL=<SECONDS> environment variable.

The results have a structure similar to the one below:

402339 MHB: | PID | "Zone name" | #chunks | #bytes | {
402339 "ExecutableMemoryHandle" 2 32
402339 "AssemblerData" 1 192
402339 "VectorBuffer" 37 16184
402339 "StringImpl" 103 5146
402339 "WeakPtrImplBase" 17 272
402339 "HashTable" 37 9408
402339 "Vector" 1 16
402339 "EmbeddedFixedVector" 1 32
402339 "BloomFilter" 2 65536
402339 "CStringBuffer" 3 86
402339 "Default Zone" 0 0
402339 } MHB: grand total bytes allocated: 9690

Given the allocation statistics per-zone, it’s easy to narrow down the unusual usage patterns manually. The example of a successful investigation is presented in the image below:

TODO.

Moreover, the data presented can be processed either manually or using scripts to create memory usage charts that span as long as the application lifetime so e.g. hours (20+ like below), days, or even longer:

TODO.
Periodic reporting to sysprof

The other reporting mechanism currently supported is reporting periodically to sysprof as counters. In short, sysprof is a modern system-wide profiling tool that already integrates with WebKit very well when it comes to non-Apple ports.

The condition for malloc heap breakdown reporting to sysprof is that the WebKit browser needs to be profiled e.g. using:

sysprof-cli -f -- <BROWSER_COMMAND>

and the sysprof has to be in the latest version possible.

With the above, the memory usage statistics can then be inspected using the sysprof utility and look like in the image below:

TODO.

In the case of sysprof, memory statistics in that case are just a minor addition to other powerful features that were well described in this blog post from Georges.

Caveats #

While malloc heap breakdown is very useful in some use cases — especially on embedded systems — there are a few problems with it.

First of all, compilation with -DENABLE_MALLOC_HEAP_BREAKDOWN=ON is not guarded by any continuous integration bots; therefore, the compilation issues are expected on the latest WebKit main. Fortunately, fixing the problems is usually straightforward. For a reference on what may be causing compilation problems usually, one should refer to 299555@main, which contains a full variety of fixes.

The second problem is that malloc heap breakdown uses WebKit’s debug heaps, and hence the memory usage patterns may be different just because system malloc is used.

The third, and final problem, is that malloc heap breakdown integration for non-Apple ports introduces some overhead as the allocations need to lock/unlock the mutex, and as statistics are stored in the memory as well.

Opportunities #

Although malloc heap breakdown can be considered fairly constrained, in the case of non-Apple ports, it gives some additional possibilities that are worth mentioning.

Because on non-Apple ports, the custom library is used to track allocations (as mentioned at the beginning of the Malloc heap breakdown on non-Apple ports section), it’s very easy to add more sophisticated tracking/debugging/reporting capabilities. The only file that requires changes in such a case is: Source/WTF/wtf/malloc_heap_breakdown/main.cpp.

Some examples of custom modifications include:

  • adding different reporting mechanisms — e.g. writing to a file, or to some other tool,
  • reporting memory usage with more details — e.g. reporting the per-memory-chunk statistics,
  • dumping raw memory bytes — e.g. when some allocations are suspicious.
  • altering memory in-place — e.g. to simulate memory corruption.

Summary #

While the presented malloc heap breakdown mechanism is a rather poor approximation of what industry standard tools offer, the main benefit of it is that it’s built into WebKit, and that in some rare use-cases (especially on embedded platforms), it’s the only way to perform any reasonable profiling.

In general, as a rule of thumb, it’s not recommended to use malloc heap breakdown unless all other methods have failed. In that sense, it should be considered a last resort approach. With that in mind, malloc heap breakdown can be seen as a nice mechanism complementing other tools in the toolbox.

October 24, 2025 12:00 AM

October 23, 2025

Jasmine Tang

Building BRT.nvim - an atuin clone for my neovim workflow

Jasmine builds a tool to shorten her development feedback cycle?!

October 23, 2025 12:00 AM

October 22, 2025

Brian Kardell

The Juice

The Juice

What's worth the squeeze?

Every day, for both my physical and mental heath, I get away from the computer for a while and for a walk (or two) and listen to some podcast. Sometimes a title suprises me.

This week's Shop Talk (#687) is a good example. It's titled "Ben Frain on Responsive Design" and to be honest, I wondered if I'd get anything out of it. Responsive Design... Haven't we been doing that for... Well, a really long time? Sure. And they talk about that.

But what interested me most was more of a sidequest: There were some "spicy" but thoughtful takes. There was some discussion about which things were actually valuable. And how valuable? In what ways? There was discussion about whether some of "the juice was worth the squeeze?"

Was @scope really worth it? Is <picture> as valuable as we thought it would be? What about Container Queries? Or @layer? Or are View Transitions too confusing? Or are you actually using :has as much as you imagined? Honestly, I love hearing some of this discussion from developers because the question isn't "does it have value?" because the answer is "yes, of course". The questions are more like "At what cost?" or "Did we have give something else up in order to get that?" or "Do we sometimes try to land a 'quick win' and then come back and really solve the problem, and sort of make it more confusing?" They are great questions - and reasonable people can disagree!

Projects like Interop are us trying to do our best to balance lots of inputs and help us make good choices. Choosing to prioritize something from a list is, inherently, not choosing something else. This year my friend Jake Archibald put together this prioritization survey tool which lets you try to participate in what is effectively like a step of our process: Order them as you'd personally asign priority if you were in charge. It's hard, and this process is over-simplified in the sense that you are not constrained by the sorts of things real engineering teams are. Maybe you sort 10 items to the top, but half of them are very expensive and also involve the same expertise. Maybe that means that realistically we can only pick 2 of those 10, and then we can also pick several more items outside your "top 10".

There are debatable signals. Only a few people will pick OffscreenCanvas but most people deal with canvas via only a few primary libraries - so it doesn't take much to lift performance of lots of sites. MathML isn't going to make you a lot of money - but actually sharing mathematical text is super important. Or, maybe something seems only mildly valuable, but someone is actually willing to pay for the work to get done. That changes the calculus about what kind of standards effort it will require from engine stewards themselves.

And, at the end of the day, in a way, all of us are speculating about perceived value. The truth is that we often just can't know. Sometimes, only time will tell. Sometimes a feature lands flat, only to take off years later. Sometimes the features we were really excited about turn out to be not so great.

Once upon a time I imagined a future where the web community did a lot more of this kind of discussion, a lot earlier. I imagined that we should be trying to give people answers in terms of polyfills and origin trials and including their actual feedback with time to use it before we went forward.

We have done it more than we did, but not nearly as much as I'd hoped. I still kind of want us to get there. It means standards might go a little slower sometimes, I guess - but you could get solutions and input faster and have more opportunity for course corrections.

Anyway, fun episode. I'd love to hear your critiques too - send them!

October 22, 2025 04:00 AM

October 20, 2025

Igalia WebKit Team

WebKit Igalia Periodical #43

Update on what happened in WebKit in the week from October 13 to October 20.

This week was calmer than previous week but we still had some meaningful updates. We had a Selenium update, improvements to how tile sizes are calculated, and a new Igalian in the list of WebKit committer!

Cross-Port 🐱

Selenium's relative locators are now supported after commit 301445@main. Before, finding elements with locate_with(By.TAG_NAME, "input").above({By.ID: "password"}) could lead to "Unsupported locator strategy" errors.

Graphics 🖼️

A patch landed to compute the layers tile size, using a different strategy depending on whether GPU rendering is enabled, which improved the performance for both GPU and CPU rendering modes.

Community & Events 🤝

Our coworker Philip Chimento gained WebKit committer status!

That’s all for this week!

by Igalia WebKit Team at October 20, 2025 08:35 PM

October 13, 2025

Igalia WebKit Team

WebKit Igalia Periodical #42

Update on what happened in WebKit in the week from October 6 to October 13.

Another week with many updates in Temporal, the automated testing infrastructure is now running WebXR API tests; and WebKitGTK gets a fix for the janky Inspector resize while it drops support for libsoup 2. Last but not least, there are fresh releases of both the WPE and GTK ports including a security fix.

Cross-Port 🐱

Multimedia 🎥

GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.

When using libwebrtc, support has been added to register MDNS addresses of local networks as ICE candidates, to avoid exposing private addresses.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

JavaScriptCore's implementation of Temporal received a flurry of improvements:

  • Implemented the toString, toJSON, and toLocaleString methods for the PlainMonthDay type.

  • Brought the implementation of the round method on TemporalDuration objects up to spec. This is the last in the series of patches that refactor TemporalDuration methods to use the InternalDuration type, enabling mathematically precise computations on time durations.

  • Implemented basic support for the PlainMonthDay type, without most methods yet.

  • Brought the implementations of the since and until functions on Temporal PlainDate objects up to spec, improving the precision of computations.

WebKitGTK 🖥️

WebKitGTK will no longer support using libsoup 2 for networking starting with version 2.52.0, due in March 2026. An article in the website has more details and migrations tips for application developers.

Fixed the jittering bug of the docked Web Inspector window width and height while dragging the resizer.

Releases 📦️

WebKitGTK 2.50.1 and WPE WebKit 2.50.1 have been released. These include a number of small fixes, improved text rendering performance, and a fix for audio playback on Instagram.

A security advisory, WSA-2025-0007 (GTK, WPE), covers one security issue fixed in these releases. As usual, we recommend users and distributors to keep their WPE WebKit and WebKitGTK packages updated.

Infrastructure 🏗️

Updated the API test runner to run monado-service without standard input using XRT_NO_STDIN=TRUE, which allows the WPE and GTK bots to start validating the WebXR API.

Submitted a change that allows relaxing the DMA-BUF requirement when creating an OpenGL display in the OpenXRCoordinator, so that bots can run API tests in headless environments that don't have that extension.

That’s all for this week!

by Igalia WebKit Team at October 13, 2025 08:02 PM

October 12, 2025

Luis Henriques

Kernel Recipes 2025

Kernel Recipes Mascot
KernelRecipes Mascot © 2025 by Emma Tizzoni is licensed under CC BY-NC-ND 4.0

Kernel Recipes is an amazing conference and it is unique in several different ways. First of all because it is community-oriented, and the environment is really open and friendly. And then because it has a single track – i.e. all the talks are in the same room – people don't need to choose which talks to attend: they'll attend all of them. Oh, and there was even a person (Anisse Astier) that was doing live blogging. How awesome is that?

This year I managed to attended this conference for the first time, in its usual place: in Paris, in the Cité Internationale campus.

All the sessions were recorded and the videos are available at the conference website so that people can (re)watch them. For this reason, in this post I am not going through all that talks I've watched, but I would like to mention a few of them that I personally (and very subjectively!) found more interesting.

The first two I'd like to mention are those from my Igalian friends, of course! Melissa Wen has done a talk named Kworkflow: Mix & Match Kernel Recipes End-to-end. This talk was about Kernel workflow, which is an interface that glues together a set of tools and scripts into a single unified interface. The other talk from Igalia was delivered by Maíra Canal, and was about the evolution of the Rust programming language usage within the DRM subsystem. Her talk was named A Rusty Odyssey: A Timeline of Rust in the DRM subsystem.

As expected, there were plenty of different areas covered by the talks, but the ones that I found most exciting were those related with memory management. And there were a few of them. The first one was from Lorenzo Stokes (the guy that wrote "the book"!). He delivered the talk Where does my memory come from?, explaining this "simple" thing: what exactly happens when an user-space application does malloc()?

The second talk related with memory management was from Matthew Wilcox, touching different aspects of how reclaiming memory from within the VFS (and in file systems in general) can be tricky. Unsurprisingly, the name of his talk was Filesystem & Memory Reclaim.

The last memory management related talk was from Vlastimil Babka, which talked about the contents of the /proc/vmstat file in a talk named Observing the memory mills running.

The last talk I'd like to mention was Alice Ryhl's So you want to write a driver in Rust? It's not that I'm a huge Rust enthusiast myself, or that I actually know how to program in Rust (I do not!). But it was a nice talk for someone looking for a good excuse to start looking into this programming language and maybe get the missing push to start learning it!

Finally, a Huge Thanks to the organisation (and all the sponsors, of course) as they definitely manage to keep a very high quality conference in such a friendly environment. Looking forward for Kernel Recipes 2026!

October 12, 2025 11:00 PM

October 10, 2025

Olivier Tilloy

A polite URL handler

Yesterday a colleague of mine was asking around for a way to get their GNOME desktop to always ask which browser to use when a third-party application wants to open a hyperlink. Something like that:

App chooser dialog

If no browser has ever been registered as the default handler for the HTTP/HTTPS schemes, then the first time around that dialog would theoretically pop up. But that’s very unlikely. And as another colleague pointed out, there is no setting to enforce the “always ask” option.

So I came up with a relatively self-contained hack to address this specific use case, and I’m sharing it here in case it’s useful to others (who knows?), to my future self, or for your favourite LLM to ingest, chew and regurgitate upon request.

First, drop a desktop file that invokes the OpenURI portal over D-Bus in ~/.local/share/applications:

📝 ~/.local/share/applications/url-opener-always-ask.desktop

[Desktop Entry]
Name=URL opener - always ask
Exec=busctl call --user org.freedesktop.portal.Desktop /org/freedesktop/portal/desktop org.freedesktop.portal.OpenURI OpenURI ssa{sv} "" %u 1 ask b true
NoDisplay=true
Type=Application

Then, make that wrapper the default scheme handler for HTTP and HTTPS:

$ for scheme in http https; do \
gio mime x-scheme-handler/${scheme} url-opener-always-ask.desktop; \
done

And you’re all set!

Note that a slightly annoying side effect is that your preferred browser will likely complain that it’s not the default any longer.

You can at any time revert to associating these schemes to your preferred browser, e.g.:

$ xdg-settings set default-web-browser firefox.desktop

Note that I mentioned GNOME at the beginning of this post, but this should work in any desktop environment that provides an XDG desktop portal backend for the OpenURI interface.

✏️ EDIT: My colleague Philippe told me about Junction, a dedicated tool that addresses this very use case, with a much broader scope. It appears to be GNOME-specific, and is neatly packaged as a flatpak. An interesting project worth checking out.

October 10, 2025 12:00 AM

October 06, 2025

Igalia WebKit Team

WebKit Igalia Periodical #41

Update on what happened in WebKit in the week from September 29 to October 6.

Another exciting weekful of updates, this time we have a number of fixes on MathML, content secutiry policy, and Aligned Trusted types, public API for WebKitWebExtension has finally been added, and fixed enumeration of speaker devices. In addition to that, there's ongoing work to improved compatibility for broken AAC audio streams in MSE, a performance improvement to text rendering with Skia was merged, and fixed multi-plane DMA-BUF handling in WPE. Last but not least, The 2026 edition of the Web Engines Hackfest has been announced! It will take place from June 15th to the 17th.

Cross-Port 🐱

Fixed rendering for unknown elements in MathML.

Fixed incorrect parsing of malformed require-trusted-types-for CSP directive.

Align reporting of Trusted Types violations with spec in case of multiple Content-Security-Policy headers.

Aligned Trusted Types event handler namespace checks with an update to the specification.

Fixed some incorrect handling of null or undefined policy values in Trusted Types.

On the WebExtensions front, the WebKitWebExtension API has finally been added, after porting some more code from Objective-C code to C++.

Improved alignment with MathML Core by making mfenced, semantics and maction render like an mrow, ignoring the subscriptshift/superscriptshift legacy attributes and cleaning the User-Agent stylesheet to more closely match the spec.

Multimedia 🎥

GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.

Speaker device enumeration has been fixed to properly enumerate ALSA PCM devices, while improving audio output device handling in general.

Improved compatibility for broken AAC audio streams in MSE is currently in review.

JavaScriptCore 🐟

The built-in JavaScript/ECMAScript engine for WebKit, also known as JSC or SquirrelFish.

In JavaScriptCore's implementation of Temporal, improved the precision of addition and subtraction on Durations.

In JavaScriptCore's implementation of Temporal, improved the precision of calculations with the total() function on Durations. This was joint work with Philip Chimento.

In JavaScriptCore's implementation of Temporal, continued refactoring addition for Durations to be closer to the spec.

Graphics 🖼️

Landed a patch to build a SkTextBlob when recording DrawGlyphs operations for the GlyphDisplayListCache, which shows a significant improvement in MotionMark “design” test when using GPU rendering.

WPE WebKit 📟

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

Improved wpe_buffer_import_to_pixels() to work correctly on non-linear and multi-plane DMA-BUF buffers by taking into account their modifiers when mapping the buffers.

Community & Events 🤝

The 2026 edition of the Web Engines Hackfest has been announced, and it will take place from June 15th to the 17th.

That’s all for this week!

by Igalia WebKit Team at October 06, 2025 08:20 PM

October 03, 2025

Iago Toral

XDC 2025

It has been a while since my last post, I know. Today I just want to thank Igalia for continuing to give me and many other Igalians the opportunity to attend XDC. I had a great time in Vienna where I was able to catch up with other Mesa developers (including Igalians!) I rarely have the opportunity to see face to face. It is amazing to see how Mesa continues to gain traction and interest year after year, seeing more actors and vendors getting involved in one way or another… the push for open source drivers in the industry is real and it is fantastic to see it happening.

I’d also like to thank the organization, I know all the work that goes into making these things happen, so big thanks to everyone who was involved, and to the speakers, the XDC program is getting better every year.

Looking forward to next year already 🙂

by Iago Toral at October 03, 2025 06:58 AM

October 02, 2025

Seokho Song

Co-Organizing Browser Night, the browser developer meetup in South Korea

The Beginning…

At the beginning of March, 2025. I had a chance to meet a few browser folks in South Korea who work or have worked on the browser field.

It was nice to meet them and talk about very enjoyable topics. It was chill and pleasant time.

One of the topics we talked about was ‘what if we hold a browser engineer meetup’. Yeah. It was one of the lighter topics and we didn’t talk about it too seriously.

So after a while, at the end of the March, we chatted about that on KakaoTalk (which is chat application like line or signals). And we scheduled for initial meeting.

One of us had experience organizing developer meetups, so we got off to a great start.

I am so excited about this because I felt there was lack of space for browser engineers in south korea.

Yeah, this journey had begun.

The Planning

So for a few months, we held regular meetings at Mondays 10PM. The first issue and probably hardest, thing was ‘naming’. Yes. The naming.

We had several candidates… like

  • Blink On Korea
  • Open Web Platform Meetup
  • Web Platform Dev Meetup
  • …and the others

One of us proposed ‘browseRus’ which was inspired by Toys“R”Us.

And the process went smoothly. We decided on and done like:

  1. Making a google form for invitations
  2. Making a landing page to describe what it is and what to do
  3. Securing a location; it is most important thing I guess.
  4. Goods; the cup
  5. Preparing a feedback form
  6. Defining the event format: deep technical sessions or light topics … etc
  7. Thinking about how to help people connect. What if a participant struggles to talk?
  8. separating the event into two sessions. First is a lightning talk session, Second is networking sessions
  9. Providing Pizza and Chicken with beers.
  10. Preparing question card for the Second session
  11. Offering a discount for students
  12. and… many things!

The landing page: Browser Night 2025 - browseRus

The time flew by so fast. Before we knew it, it was the end of May.

I also prepared my lightning talk: Chromium, V8 Committer 2관왕 달성기

The D-Day

May 29, 2025. I finished my work at 6:00PM and headed to the Open Up Center, the location of our meetup. It was a rainy day.

Many people had already arrived at the center. I greeted the participants at the front desk, checked them in and provided name badges with goods.

In total, 53 people attended this event!

We jocked that these people were all of the browser engineers in south korea.

The lightning talk session:

Speaker Topic
Sangwoo Ko YouTubing with Chromium
Seokho Song My Journey to Becoming a Committer for Both Chromium and V8
Gyuyoung Kim Introduction to Blink for iOS
Gyuyoung Kim Introduction to Blink for Apple tvOS
Hyungwook Lee AI in the Web and Browsers
Hyunjune Kim Machine Unlearning and Recent Updates
Euisang(Amos) Lim My Experience as a Mentor at the Open Source Contribution Academy
최민섭 Open Source Contribution Academy: My First Chromium Contribution Story
Joonghun Park An Introduction to Samsung Internet
최영수 How LG Uses a Web Engine: About LG webOS
최병운 Introducing a Book on Browser Engineering
이상현 The browser that actually does stuff for you
Alan Jinkyu Jang The History of Interaction Tech: Beyond Mouse and Keyboard
Hyojin Song A Story About a Chromium Developer I Met at a W3C Meeting
Dongjun Kim The Failure Story of “JuOSA” (Weekend Open Source Doers)

(Note: To respect the original and accurate spelling, some speakers’ names are written in Korean.)

There are some pictures from event:

name badge

name badge

introduction

introduction

me!

my lighting talk!

cup

cup!

browser devs

browser folks!

Looking back

So yeah. The first Browser Night was a success.

The all sessions were very exciting and helpful and fun.

The networking session was very fun, and everyone enjoyed chatting with each other. We had been a little worried about breaking the ice, but that concern was completely unnecessary.

Based on the feedback form, we got a 4.87 out of 5 for overall satisfaction.

And 92.7% of people responded that they would like to attend again. (If include respond ‘consider’ it would be 100%.)

Thanks to…

I’m so grateful for my co-organizers. Seriously, this would have been impossible without them leading the way.

by Seokho Song (me@seokho.dev) at October 02, 2025 12:00 AM

October 01, 2025

Brian Kardell

Under-investment

Under-investment

A lot more words on a short statement I made last week on social media...

A couple of weeks ago I posted on social media that

It never ceases to amaze me how much stuff there is on the web platform that needs more attention than it gets in practice, despite vendors spending tons already.

Dave Rupert replied asking

could you itemize that list? i'd be curious. seems like new shiny consumes a lot of the efforts.

I said "no" at the time because it's true it would be a very long list and exceptionally time consuming task if exhaustive, but... It is probably worth rattling off a bunch that I know more or less off the top of my head from experience - so here goes (in no particular order)... I'll comment on a few:

Big general areas...

There are certain areas of focus that just always get shoved to the back burner.

Print

It's almost absurd to me that printing and print related APIs have the problems and concerns that they still do given that so much of enterprise and government are web based. For example: Will your images be loaded? Who knows! Did you know there is a .print() and it doesn't act the same in several respects as choosing print from the menu? Shouldn't the browser support many of the CSS based features that print pioneered? Like... paging? Or at least actually investing in considering it in the browser at the same time could have helped us determine if those were even good ideas or shape APIs.

Accessibility

In theory all of the processes are supposed to help create standards and browsers that are accessible - in practice, we miss on this more often than is comfortable to admit. This is mainly because - for whatever reason - so much of this, from reviews to testing to standards work in designing APIs in the first place, is largely done by volunteers or people disconnected from vendors themselves and just trying to keep up. My colleague Alice Boxhall wrote a piece that touches on this, and more.

Internationalization

Probably in better shape than accessibility in many ways, but the same basic sorts of things apply here.

Testing Infrastructure

The amount of things that we are incapable of actually testing is way higher than we should be comfortable with. The web actually spent the first 15 years or so of its life without any actual shared testing like web platform tests. Today, lots and lots of that infrastructure is just Google provided, so not community owned or anything.

Forgotten tech

Then there aere are certain big, important projects that were developed and have been widely deployed for ten, or even close to twenty years at this point, but were maybe a little wonky or buggy and then just sort of walked away from.

SVG

After some (like Amelia) doing amazing work to begin to normalize SVG and CSS, the working group effectively disbanded for years with very little investment from vendors.

MathML

From its integration in HTML5 until today, almost none of the work done in browsers has been by the browser vendors themselves. Google is the only vendor who has even joined the working group, and not so much to participate as an org as much as to allow someone interested on their own to participate.

Web Speech

Google and others were so excited to do this back in (checks watch)... 2012. But they did it too early, and in a community group. It's not even a Recommendation Track thing. I can easily see an argument to be made that this is the result of things swinging pretty far in the other direction - this is more than a decade after W3C had put together the W3C Speech Interface Framework with lots of XML. But meanwhile there is simple and obvious bugs and improvements that can and should be made - there is lots of be rethought here and very little invested from then till now.

The "wish list"

There is a long list of things that we, as a larger community, aren't investing in in the sense of wider particpation and real funding from browsers, but I think we should... Here are a few of my top ones:

Study of the web (and ability to)

The HTTPArchive and chrome status are about the best tools we have, but they're again mainly Google - but even other data sources are biased and incomplete. Until 2019 the study of elements on the web was just running a regexp on home pages in the archive. Until just a year or two ago our study of CSS was kind of similar. It just feels like we should have more here.

Polyfill ability for CSS

A number of us have been saying this for a long time. Even some small things could go a long way here (like, just really exposing what's parsed). After a lot of efforts we got Houdini, which should have helped answer a lot of this. It fizzled out after choosing probably the least interesting first project in my opinion. I don't know that we were looking at it just right, or that we would have done the right things - but I know that not really investing in trying isn't going to get it done either. To be really honest, I'd like a more perfect polyfill story for HTML as well. Once upon a time there was discussion down that road, but when <std-toast>-gate happened, all of the connected discussions died along with it. That's a shame. We are getting there slowly with some important things like custom media queries and so on, but a lot of these things we were starting to pitch a decade ago.

Protocols

The web has thus far been built a very particular way - but there are many ideas (distributed web ideas, for example) which it's very hard for the web to currently adapt toward because it's full of hard problems that really need involvement from vendors. I'd love to see many of those ideas really have an opportunity to take off, but I don't see good evolutionary paths to allow something to really do that. We had some earlier ideas like protocol handlers and content handlers for how this might work. Unfortunately content handlers were removed, and prototcol handlers are extremely limited and incomplete. Trying to imagine how a distributed web could work is pretty difficult with the tools we have. Perhaps part of this is related other items on my list like powerful features or monetization

"Web Views" / "Powerful features"

A ton of stuff is built with web technology as apps to get around some of the things that are currently difficult security-wide or privacy wise in the browser itself. Maybe that's how it should be, maybe it isn't. I'm not here to say "ship all the fugu stuff" or something, but it it definitely seems silly that there aren't some efforts to even think "above" the browser engines and standardize some APIs, a bit in the vein of what is now the Winter TC. What people are doing today doesn't seem better. I guesss there is a common theme here that I'd like to really invest in finding better ways to let the platform evolve a bit on its own and then pick it up and run with it.

"monetization"

I mean, this is a really tough one for so many reasons, both tehnical and political, but I just don't see a thing that could have bigger impact than a way to pay creators that isn't limited to ads and a way to fund new ideas. It just seems at the very core of a lot of things. I put it in quotes because I don't mean specifically the proposal called web monetization. There are lots of other ideas and a wide range of attempts happening, some of them seem less directly like money and more like ways to express licencing agreements or earn discounts.

Maps

We seem to have mostly just written off maps entirely as something which you just rely on Google Maps or Apple Maps for. That's a shame because there has been interest at several levels - there was a joint OGC/W3C workshop a few years ago, and many ideas. Almost all of them would benefit more than just those few proprietary map systems. There are even simple primitive ideas like adding the concept of pan and zoom to the platform, maybe in CSS. Surely we can do better than where things are right now, but who is going to invest to get it there?

There's a long list

There's way more things we could list here... Drag and drop needs work and improvements. Editing (see Contenteditable/execCommand/EditContext) is terribly hard. Given the importance, you'd think it would be one of the bigger areas of investment, but it's not really. Hit testing is a big area that needs defining. I mean, you can see that this year we got 134 focus area proposals for Interop 2026. Those aren't all areas that are under-invested in, exactly, but whatever we choose to focus on there is time and budget we can't spend on the things in this list...

In the past, I might have said documentation, but I feel like were just doing a lot better with that. We also now have the collectively funded, transparent and independent openwebdocs.org which Igalia has helped fund since its inception and, to my mind, is one of the most positive things. So many things on this list even could take a similar approach. It would be great to see.

October 01, 2025 04:00 AM

September 29, 2025

Igalia WebKit Team

WebKit Igalia Periodical #40

Update on what happened in WebKit in the week from September 22 to September 29.

Many news this week! We've got a performance improvement in the Vector implementation, a fix that makes a SVG attribute work similarly to HTML, and further advancements on WebExtension support. We also saw an update to WPE Android, the test infrastructure can now run WebXR tests, WebXR support in WPE Android, and a rather comprehensive blog post about the performance considerations of WPE WebKit with regards to the DOM tree.

Cross-Port 🐱

Vector copies performance was improved across the board, and specially for MSE use-cases

Fixed SVG <a> rel attribute to work the same as HTML  <a>'s.

Work on WebExtension support continues with more Objective-C converted to C++, which allows all WebKit ports to reuse the same utility code in all ports.

Added handling of the visibilityState value for inline WebXR sessions.

Graphics 🖼️

WPE now supports importing pixels from non-linear DMABuf formats since commit 300687@main. This will help the work to make WPE take screenshots from the UIProcess (WIP) instead of from the WebProcess, so they match better what's actually shown on the screen.

Added support for the WebXR passthroughFullyObscured rendering hint when using the OpenXR backend.

WPE WebKit 📟

WPE Platform API 🧩

New, modern platform API that supersedes usage of libwpe and WPE backends.

The build system will now compile WPEPlatform with warning-as-errors in developer builds. This helps catch potential programming errors earlier.

WPE Android 🤖

Adaptation of WPE WebKit targeting the Android operating system.

WPE-Android is being updated to use WPE WebKit 2.50.0. As usual, the ready-to-use packages will arrive in a few days to the Maven Central repository.

Added support to run WebXR content on Android, by using AHarwareBuffer to share graphics buffers between the main process and the content rendering process. This required coordination to make the WPE-Android runtime glue expose the current JavaVM and Activity in a way that WebKit could then use to initialize the OpenXR platform bindings.

Community & Events 🤝

Paweł Lampe has published in his blog the first post in a series about different aspects of Web engines that affect performance, with a focus on WPE WebKit and interesting comparisons between desktop-class hardware and embedded devices. This first article analyzes how “idle” nodes in the DOM tree render measurable effects on performance (pun intended).

Infrastructure 🏗️

The test infrastructure can now run API tests that need WebXR support, by using a dummy OpenXR compositor provided by the Monado runtime, along with the first tests and an additional one that make use of this.

That’s all for this week!

by Igalia WebKit Team at September 29, 2025 08:34 PM

Alicia Boya

Getting perf to work on ARM32 Linux: Part 2, the ISAs

Welcome to the second part in this series on how to get perf to work on ARM32. If you just arrived here and want to know what is perf and why it would be useful, refer to Part 1—it is very brief. If you’re already familiar with perf, you can skip it.

To put it blunty, ARM32 is a bit of a mess. Navigating this mess is a significant part of the difficulty in getting perf working. This post will focus on one of these messy parts: the ISAs, plural.

The ISA (Instruction Set Architecture) of a CPU defines the set of instructions and registers available, as well as how they are encoded in machine code. ARM32 CPUs generally have not one but two coexisting ISAs: ARM and Thumb, with significant differences between each other.

Unlike, let’s say, 32-bit x86 and 64-bit x86 executables running in the same operating system, ARM and Thumb can and often do coexist in the same process and have different sets of instructions and—to a certain extent—registers available, all while targetting the same hardware, and neither ISA being meant as a replacement of the other.

If you’re interested in this series as a tutorial, you can probably skip this one. If, on the other hand, you want to understand these concepts to be better for when they inevitably pop up in your troubleshooting—like it did in mine—keep reading. This post will explain some consequential features of both ARM and Thumb, and how they are used in Linux.

I highly recommend having a look at old ARM manuals for following this post. As it often happens with ISAs, old manuals are much more compact and easier to follow than the than current versions, making them a good choice for grasping the fundamentals. They often also have better diagrams, that were only possible when the CPUs were simpler—the manuals for the ARM7TDMI (a very popular ARMv4T design for microcontrollers from the late 90s) are particularly helpful for introducing the architecture.

Some notable features of the ARM ISA

(Recommended introductory reference: ARM7TDMI Manual (1995), Part 4: ARM Instruction Set. 64 pages, including examples.)

The ARM ISA has a fixed instruction size of 32 bits.

A notable feature of it is that the 4 most significant bits of each instruction contain a condition code. When you see mov.ge in assembly for ARM, that is the regular mov instruction with the condition code 1010 (GE: Greater or Equal). The condition code 1110 (AL: Always) is used for non-conditional instructions.

ARM has 16 directly addressable registers, named r0 to r15. Instructions use 4-bit fields to refer to them.

The ABIs give specific purposes to several registers, but as far as the CPU itself goes, there are very few special registers:

  • r15 is the Program Counter (PC): it contains the address of the instruction about to be executed.
  • r14 is meant to be used as Link Register (LR)—it contains the address a function will jump to on return.
    This is used by the bl (Branch with link) instruction, which before branching, will also update r14 (lr) with the value of r15 (pc), and is the main instruction used for function calls in ARM.

All calling conventions I’m aware of use r13 as a full-descending stack. “Full stack” means that the register points to the last item pushed, rather than to the address that will be used by the next push (“open stack”). “Descending stack” means that as items are pushed, the address in the stack register decreases, as opposed to increasing (“ascending stack”). This is the same type of stack used in x86.

The ARM ISA does not make assumptions about what type of stack programs use or what register is used for it, however. For stack manipulation, ARM has a Store Multiple (stm)/Load Multiple (ldm) instruction, which accepts any register as “stack register” and has flags for whether the stack is full or open, ascending or descending and whether the stack register should be updated at all (“writeback”). The “multiple” in the name comes from the fact that instead of having a single register argument, it operates on a 16 bit field representing all 16 registers. It will load or store all set registers, with lower index registers matched to lower addresses in the stack.

push and pop are assembler aliases for stmfd r13! (Store Multiple Full-Descending on r13 with writeback) and ldmfd r13! (Load Multiple Full-Descending on r13 with writeback) respectively—the exclamation mark means writeback in ARM assembly code.

Some notable features of the Thumb ISA

(Recommended introductory reference: ARM7TDMI Manual (1995), Part 5: Thumb Instruction Set. 47 pages, including examples.)

The Thumb-1 ISA has a fixed instruction size of 16 bits. This is meant to reduce code size, improve cache performance and make ARM32 competitive in applications previously reserved for 16-bit processors. Registers are still 32 bit in size.

As you can imagine, having a fixed 16 bit size for instructions greatly limits what functionality is available: Thumb instructions generally have an ARM counterpart, but often not the other way around.

Most instructions—with the notable exception of the branch instruction—lack condition codes. In this regards it works much more like x86.

The vast majority of instructions only have space for 3 bits for indexing registers. This effectively means Thumb has only 8 registers—so called low registers—available to most instructions. The remaining registers—referred as high registers—are only available in special encodings of few select instructions.

Store Multiple (stm)/Load Multiple(ldm) is largely replaced by push and pop, which here is not an alias but an actual ISA instruction and can only operate on low registers and—as a special case—can push LR and pop PC. The only stack supported is full-descending on r13 and writeback is always performed.

A limited form of Store Multiple (stm)/Load Multiple (ldm) with support for arbitrary low register as base is available, but it can only load/store low registers, writeback is still mandatory, and it only supports one addressing mode (“increment after”). This is not meant for stack manipulation, but for writing several registers to/from memory at once.

Switching between ARM and Thumb

(Recommended reading: ARM7TDMI Manual (1995), Part 2: Programmer’s Model. 3.2 Switching State. It’s just a few paragraphs.)

All memory accesses in ARM must be 32-bit aligned. Conveniently, this allows the 4 least significant bit of addresses to be used as flags, and ARM CPUs make use of this.

When branching with the bx (Branch with exchange) instruction, the least significant bit of the register holding the branch address indicates whether the CPU should swich after the jump to ARM mode (0) or Thumb mode (1).

It’s important to note that this bit in the address is just a flag: Thumb instructions lie in even addresses in memory.

As a result, ARM and Thumb code can coexist in the same program and applications can use libraries compiled with each other mode. This is far from an esoteric feature; as an example, buildroot always compiles glibc in ARM mode, even if Thumb is used for the rest of the system.

The Thumb-2 extension

(Recommended reference: ARM Architecture Reference Manual: Thumb-2 Supplement (2005)—This one is already much longer, but it’s nevertheless the documentation for when Thumb-2 was introduced)

Thumb-2 is an extension of the original Thumb ISA. Instructions are no longer fixed 16 bits in size, but instead instructions have variable size (16 or 32 bits).

This allows to reintroduce a lot of functionality that was previously missing in Thumb but only pay for the increased code size in instructions that require it. For instance, push now can save high registers, but it will become a 32-bit instruction when doing so.

Just like in Thumb-1, most instructions still lack condition codes. Instead, Thumb-2 introduces a different mechanism for making instructions conditional: the If-Then (it) instruction. it receives a 4 bit condition code (same as in ARM) and a clever 4 bit “mask”. The it instruction makes execution of the following up to 4 instructions conditional on either the condition or its negation. The first instruction is never negated.

An “IT block” is the sequence of instructions made conditional by a previous it instruction.

For instance, the 16-bit instruction ittet ge means: make the next 2 instructions conditional on “greater or equal”, the following instruction conditional on “less than (i.e. not greater or equal)”, and the following instruction conditional on “greater or equal”. ite eq would make the following instruction be conditional on “equal” and the following instruction conditional on “not equal”.

The IT block deprecation mess: Some documentation pages of ARM will state that it instructions followed by 32 bit instructions, or by more than one instruction, are deprecated. According to clang commits from 2022, this decision has been since reverted. The current (2025) version of the ARM reference manual for the A series of ARM CPUs remains vague about this, claiming “Many uses of the IT instruction are deprecated for performance reasons” but doesn’t claim any specific use as deprecated in that same page. Next time you see gcc or GNU Assembler complaining about a certain IT block being “performance deprecated”, this is what that is about.

Assembly code compatibility

Assemblers try to keep ARM and Thumb as mutually interchangeable where possible, so that it’s possible to write assembly code that can be assembled as either as long as you restrict your code to instructions available in both—something much more feasible since Thumb-2.

For instance, you can still use it instructions in code you assemble as ARM. The assembler will do some checks to make sure your IT block would work in Thumb the same as it would do if it was ARM conditional instructions and then ignore it. Conversely, instructions inside an IT block need to be tagged with the right condition code for the assembler to not complain, even if those conditions are stripped when producing Thumb.

What determines if code gets compiled as ARM or Thumb

If you try to use a buildroot environment, one of the settings you can tweak (Target options/ARM instruction set) is whether ARM or Thumb-2 should be used as default.

When you build gcc from source one of the options you can pass to ./configure is --with-mode=arm (or similarly, --with-mode=thumb). This determines which one is used by default—that is, if the gcc command line does not specify either. In buildroot, when “Toolchain/Toolchain type” is configured to use “Buildroot toolchain”, buildroot builds its own gcc and uses this option.

To specify which ISA to use for a particular file you can use the gcc flags -marm or -mthumb. In buildroot, when “Toolchain/Toolchain type” is configured to use “External toolchain”—in which case the compiler is not compiled from source—either of these flags is added to CFLAGS as a way to make it the default for packages built with buildroot scripts.

A mode can also be overriden on a per-function-basis with __attribute__((target("thumb")). This is not very common however.

GNU Assembler and ARM vs Thumb

In GNU Assembler, ARM or Thumb is selected with the .arm or .thumb directives respectively—alternatively, .code 16 and .code 32 respectively have the same effect.

Each functions that starts with Thumb code must be prefaced with the .thumb_func directive. This is necessary so that the symbol for the function includes the Thumb bit, and therefore branching to the function is done in the correct mode.

ELF object files

There are several ways ELF files can encode the mode of a function, but the most common and most reliable is to check the addresses of the symbols. ELF files use the same “lowest address bit means Thumb” convention as the CPU.

Unfortunately, while tools like objdump need to figure the mode of functions in order to e.g. disassemble them correctly, I have not found any high level flag in either objdump or readelf to query this information. Instead, here you can have a couple of Bash one liners using readelf.

syms_arm() { "${p:-}readelf" --syms --wide "$@" |grep -E '^\s*[[:digit:]]+: [0-9a-f]*[02468ace]\s+\S+\s+(FUNC|IFUNC)\s+'; }
syms_thumb() { "${p:-}readelf" --syms --wide "$@" |grep -E '^\s*[[:digit:]]+: [0-9a-f]*[13579bdf]\s+\S+\s+(FUNC|IFUNC)\s+|THUMB_FUNC'; }
  1. The regular expression matches on the parity of the address.
  2. $p is an optional variable I assign to my compiler prefix (e.g. /br/output/host/bin/arm-buildroot-linux-gnueabihf-).
    Note however that since the above commands just use readelf, they will work even without a cross-compiling toolchain.
  3. THUMB_FUNC is written by readelf when a symbol has type STT_ARM_TFUNC. This is another mechanism I’m aware object files can use for marking functions as Thumb, so I’ve included it for completion; but I have not found any usages of it in the wild.

If you’re building or assembling debug symbols, ranges of ARM and Thumb code are also marked with $a and $t symbols respectively. You can see them with readelf --syms. This has the advantage—at least in theory—of being able to work even in the presence of ARM and Thumb mixed in the same function.

Closing remarks

I hope someone else finds this mini-introduction to ARM32 useful. Now that we have an understanding of the ARM ISAs, in the next part we will go one layer higher and discuss the ABIs (plural again, tragically!)—that is, what expectations have functions of each other as they call one another.

In particular, we are interested in how the different ABIs handle—or not—frame pointers, which we will need in order for perf to do sampling profiling of large applications on low end devices with acceptable performance.

by aboya at September 29, 2025 10:42 AM

September 26, 2025

Pawel Lampe

WPE performance considerations: DOM tree

Designing performant web applications is not trivial in general. Nowadays, as many companies decide to use web platform on embedded devices, the problem of designing performant web applications becomes even more complicated. Typical embedded devices are orders of magnitude slower than desktop-class ones. Moreover, the proportion between CPU and GPU power is commonly different as well. This usually results in unexpected performance bottlenecks when the web applications designed with desktop-class devices in mind are being executed on embedded environments.

In order to help web developers approach the difficulties that the usage of web platform on embedded devices may bring, this blog post initiates a series of articles covering various performance-related aspects in the context of WPE WebKit usage on embedded devices. The coverage in general will include:

  • introducing the demo web applications dedicated to showcasing use cases of a given aspect,
  • benchmarking and profiling the WPE WebKit performance using the above demos,
  • discussing the causes for the performance measured,
  • inferring some general pieces of advice and rules of thumb based on the results.

This article, in particular, discusses the overhead of nodes in the DOM tree when it comes to layouting. It does that primarily by investigating the impact of idle nodes that introduce the least overhead and hence may serve as a lower bound for any general considerations. With the data presented in this article, it should be clear how the DOM tree size/depth scales in the case of embedded devices.

DOM tree #

Historically, the DOM trees emerging from the usual web page designs were rather limited in size and fairly shallow. This was the case as there were no reasons for them to be excessively large unless the web page itself had a very complex UI. Nowadays, not only are the DOM trees much bigger and deeper, but they also tend to contain idle nodes that artificially increase the size/depth of the tree. The idle nodes are the nodes in the DOM that are active yet do not contribute to any visual effects. Such nodes are usually a side effect of using various frameworks and approaches that conceptualize components or services as nodes, which then participate in various kinds of processing utilizing JavaScript. Other than idle nodes, the DOM trees are usually bigger and deeper nowadays, as there are simply more possibilities that emerged with the introduction of modern APIs such as Shadow DOM, Anchor positioning, Popover, and the like.

In the context of web platform usage on embedded devices, the natural consequence of the above is that web designers require more knowledge on how the particular browser performance scales with the DOM tree size and shape. Before considering embedded devices, however, it’s worth to take a brief look at how various web engines scale on desktop with the DOM tree growing in depth.

Desktop considerations #

To measure the impact of the DOM tree depth on the performance, the random-number-changing-in-the-tree.html?vr=0&ms=1&dv=0&ns=0 demo can be used to perform a series of experiments with different parameters.

In short, the above demo measures the average duration of a benchmark function run, where the run does the following:

  • changes the text of a single DOM element to a random number,
  • forces a full tree layout.

Moreover, the demo allows one to set 0 or more parent idle nodes for the node holding text, so that the layout must consider those idle nodes as well.

The parameters used in the URL above mean the following:

  • vr=0 — the results are reported to the console. Alternatively (vr=1), at the end of benchmarking (~23 seconds), the result appears on the web page itself.
  • ms=1 — the results are reported in “milliseconds per run”. Alternatively (ms=0), “runs per second” are reported instead.
  • dv=0 — the idle nodes are using <span> tag. Alternatively, (dv=1) <div> tag is used instead.
  • ns=N — the N idle nodes are added.

The idea behind the experiment is to check how much overhead is added as the number of extra idle nodes (ns=N) in the DOM tree increases. Since the browsers used in the experiments are not fair to compare due to various reasons, instead of concrete numbers in milliseconds, the results are presented in relative terms for each browser separately. It means that the benchmarking result for ns=0 serves as a baseline, and other results show the relative duration increase to that baseline result, where, e.g. a 300% increase means 3 times the baseline duration.

The results for a few mainstream browsers/browser engines (WebKit GTK MiniBrowser [09.09.2025], Chromium 140.0.7339.127, and Firefox 142.0) and a few experimental ones (Servo [04.07.2024] and Ladybird [30.06.2024]) are presented in the image below:

Idle nodes overhead on mainstream browsers.

As the results show, trends among all the browsers are very close to linear. It means that the overhead is very easy to assess, as usually N times more idle nodes will result in N times the overhead. Moreover, up until 100-200 extra idle nodes in the tree, the overhead trends are very similar in all the browsers except for experimental Ladybird. That in turn means that even for big web applications, it’s safe to assume the overhead among the browsers will be very much the same. Finally, past the 200 extra idle nodes threshold, the overhead across browsers diverges. It’s very likely due to the fact that the browsers are not optimizing such cases as a result of a lack of real-world use cases.

All in all, the conclusion is that on desktop, only very large / specific web applications should be cautious about the overhead of nodes, as modern web browsers/engines are very well optimized for handling substantial amounts of nodes in the DOM.

Embedded device considerations #

When it comes to the embedded devices, the above conclusions are no longer applicable. To demonstrate that, a minimal browser utilizing WPE WebKit is used to run the demo from the previous section both on desktop and NXP i.MX8M Plus platforms. The latter is a popular choice for embedded applications as it has quite an interesting set of features while still having strong specifications, which may be compared to those of Raspberry Pi 5. The results are presented in the image below:

Idle nodes overhead compared between desktop and embedded devices.

This time, the Y axis presents the duration (in milliseconds) of a single benchmark run, and hence makes it very easy to reason about overhead. As the results show, in the case of the desktop, 100 extra idle nodes in the DOM introduce barely noticeable overhead. On the other hand, on an embedded platform, even without any extra idle nodes, the time to change and layout the text is already taking around 0.6 ms. With 10 extra idle nodes, this duration increases to 0.75 ms — thus yielding 0.15 ms overhead. With 100 extra idle nodes, such overhead grows to 1.3 ms.

One may argue if 1.3 ms is much, but considering an application that e.g. does 60 FPS rendering, the time at application disposal each frame is below 16.67 ms, and 1.3 ms is ~8% of that, thus being very considerable. Similarly, for the application to be perceived as responsive, the input-to-output latency should usually be under 20 ms. Again, 1.3 ms is a significant overhead for such a scenario.

Given the above, it’s safe to state that the 20 extra idle nodes should be considered the safe maximum for embedded devices in general. In case of low-end embedded devices i.e. ones comparable to Raspberry Pi 1 and 2, the maximum should be even lower, but a proper benchmarking is required to come up with concrete numbers.

Inline vs block #

While the previous subsection demonstrated that on embedded devices, adding extra idle nodes as parents must usually be done in a responsible way, it’s worth examining if there are nuances that need to be considered as well.

The first matter that one may wonder about is whether there’s any difference between the overhead of idle nodes being inlines (display: inline) or blocks (display: block). The intuition here may be that, as idle nodes have no visual impact on anything, the overhead should be similar.

To verify the above, the demo from Desktop considerations section can be used with dv parameter used to control whether extra idle nodes should be blocks (1, <div>) or inlines (0, <span>). The results from such experiments — again, executed on NXP i.MX8M Plus — are presented in the image below:

Comparison of overhead of idle nodes being inline or block elements.

While in the safe range of 0-20 extra idle nodes the results are very much similar, it’s evident that in general, the idle nodes of block type are actually introducing more overhead.

The reason for the above is that, for layout purposes, the handling of inline and block elements is very different. The inline elements sharing the same line can be thought of as being flattened within so called line box tree. The block elements, on the other hand, have to be represented in a tree.

To show the above visually, it’s interesting to compare sysprof flamegraphs of WPE WebProcess from the scenarios comprising 20 idle nodes and using either <span> or <div> for idle nodes:

idle <span> nodes:
Sysprof flamegraph of WPE WebProcess layouting inline elements.
idle <div> nodes:
Sysprof flamegraph of WPE WebProcess layouting block elements.

The first flamegraph proves that there’s no clear dependency between the call stack and the number of idle nodes. The second one, on the other hand, shows exactly the opposite — each of the extra idle nodes is visible as adding extra calls. Moreover, each of the extra idle block nodes adds some overhead thus making the flamegraph have a pyramidal shape.

Whitespaces #

Another nuance worth exploring is the overhead of text nodes created because of whitespaces.

When the DOM tree is created from the HTML, usually a lot of text nodes are created just because of whitespaces. It’s because the HTML usually looks like:

<span>
<span>
(...)
</span>
</span>

rather than:

<span><span>(...)</span></span>

which makes sense from the readability point of view. From the performance point of view, however, more text nodes naturally mean more overhead. When such redundant text nodes are combined with idle nodes, the net outcome may be that with each extra idle node, some overhead will be added.

To verify the above hypothesis, the demo similar to the above one can be used along with the above one to perform a series of experiments comparing the approach with and without redundant whitespaces: random-number-changing-in-the-tree-w-whitespaces.html?vr=0&ms=1&dv=0&ns=0. The only difference between the demos is that the w-whitespaces one creates the DOM tree with artificial whitespaces, simulating as-if it was written in the formatted document. The comparison results from the experiments run on NXP i.MX8M Plus are presented in the image below:

Overhead of redundant whitespace nodes.

As the numbers suggest, the overhead of redundant text nodes is rather small on a per-idle-node basis. However, as the number of idle nodes scales, so does the overhead. Around 100 extra idle nodes, the overhead is noticeable already. Therefore, a natural conclusion is that the redundant text nodes should rather be avoided — especially as the number of nodes in the tree becomes significant.

Parents vs siblings #

The last topic that deserves a closer look is whether adding idle nodes as siblings is better than adding them as parent nodes. In theory, having extra nodes added as siblings should be better as the layout engine will have to consider them, yet it won’t mark them with a dirty flag and hence it won’t have to layout them.

As in other cases, the above can be examined using a series of experiments run on NXP i.MX8M Plus using the demo from Desktop considerations section and comparing against either random-number-changing-before-siblings.html?vr=0&ms=1&dv=0&ns=0 or random-number-changing-after-siblings.html?vr=0&ms=1&dv=0&ns=0 demo. As both of those yield similar results, any of them can be used. The results of the comparison are depicted in the image below:

Overhead of idle nodes added as parents vs as siblings.

The experiment results corroborate the theoretical considerations made above — idle nodes added as siblings indeed introduce less layout overhead. The savings are not very large from a single idle node perspective, but once scaled enough, they are beneficial enough to justify DOM tree re-organization (if possible).

Conclusions #

The above experiments mostly emphasized the idle nodes, however, the results can be extrapolated to regular nodes in the DOM tree. With that in mind, the overall conclusion to the experiments done in the former sections is that DOM tree size and shape has a measurable impact on web application performance on embedded devices. Therefore, web developers should try to optimize it as early as possible and follow the general rules of thumb that can be derived from this article:

  1. Nodes are not free, so they should always be added with extra care.
  2. Idle nodes should be limited to ~20 on mid-end and ~10 on low-end embedded devices.
  3. Idle nodes should be inline elements, not block ones.
  4. Redundant whitespaces should be avoided — especially with idle nodes.
  5. Nodes (especially idle ones) should be added as siblings.

Although the above serves as great guidance, for better results, it’s recommended to do the proper browser benchmarking on a given target embedded device — as long as it’s feasible.

Also, the above set of rules is not recommended to follow on desktop-class devices, as in that case, it can be considered a premature optimization. Unless the particular web application yields an exceptionally large DOM tree, the gains won’t be worth the time spent optimizing.

September 26, 2025 12:00 AM

September 22, 2025

Igalia WebKit Team

WebKit Igalia Periodical #39

Update on what happened in WebKit in the week from September 15 to September 22.

The first release in a new stable series is now out! And despite that, the work continues on WebXR, multimedia reliability, and WebExtensions support.

Cross-Port 🐱

Fixed running WebXR tests in the WebKit build infrastructure, and made a few more of them run. This both increases the amount of WebXR code covered during test runs, and helps prevent regressions in the future.

As part of the ongoing work to get WebExtensions support in the GTK and WPE WebKit ports, a number of classes have been converted from Objective-C to C++, in order to use share their functionality among all ports.

Multimedia 🎥

GStreamer-based multimedia support for WebKit, including (but not limited to) playback, capture, WebAudio, WebCodecs, and WebRTC.

A number of multimedia-related memory leaks have been plugged. These have been found thanks to the GStreamer leak tracer.

Releases 📦️

WebKitGTK 2.50.0 and WPE WebKit 2.50.0 are now available. These are the first releases of a new stable series, and are the result of the last six months of work. This development cycle focused on rendering performance improvements, improved support for font features, and more. New public API has been added to obtain the theme color declared by Web pages.

For those longer to integrate newer releases, which we know can be a longer process when targeting embedded devices, we have also published WPE WebKit 2.48.7 with a few stability and security fixes.

Accompanying these releases there is security advisory WSA-2025-0006 (GTK, WPE), with information about solved security issues. As usual, we encourage everybody to use the most recent versions where such issues are known to be fixed.

Bug reports are always welcome at the WebKit Bugzilla.

That’s all for this week!

by Igalia WebKit Team at September 22, 2025 11:12 PM