oci: Parallelize object storage within tar layers by cgwalters · Pull Request #208 · composefs/composefs-rs

cgwalters · 2026-01-09T15:49:10Z

Add parallel object storage when processing tar archives. Large files are streamed to O_TMPFILE via a channel, and fs-verity digests are computed in background blocking tasks. This avoids blocking the async runtime while allowing multiple files to be processed concurrently.

Key components:

FsVerityHasher: Made public for incremental digest computation from files on disk without loading them entirely into memory.
Repository:
- create_object_tmpfile(): Create O_TMPFILE for streaming writes
- finalize_object_tmpfile(): Compute verity, enable it, link to objects/
- write_semaphore(): Shared semaphore limiting concurrent writes
- register_stream(): Register stored object as named stream (with sync)
SplitStreamBuilder: Collects inline data and pending external object handles, resolves ObjectIDs at finalization with proper deduplication.
split_async(): Now takes AsyncBufRead, streams via channel to blocking tasks, uses repository's shared semaphore for backpressure.

For a 2GB tar (10,000 files × 200KB), achieves ~7x speedup: ~980ms → ~140ms (~14 GB/s throughput).

Assisted-by: OpenCode (Opus 4.5)

cgwalters · 2026-01-09T16:39:57Z

OK...so...it's hard to overstate just HOW MUCH BETTER this makes bootc install with composefs.

I need to inject some precise timings but before a fresh install was measured in minutes, now it's measured in seconds. The "writing files" phase takes about 10s here which is SO SO SO much better than before.

Add a benchmark test to measure tar archive splitting performance. This creates a 2GB tar archive (10,000 × 200KB files) and measures the time to process it through the split_async/Repository pipeline. Run with: cargo test --release --lib -p composefs-oci bench_tar_split -- --ignored --nocapture Assisted-by: OpenCode (Sonnet 4) Signed-off-by: Colin Walters <walters@verbum.org>

jeckersb · 2026-01-09T23:12:32Z

I need to spend some more hands-on time with this, but at the eyeball level this looks awesome! Definitely want to A/B bootc installs with/without to compare, probably a job for next week at this point though.

allisonkarlitskaya

Very happy to finally see a working patch for this.

Definitely mark this as fixes #62

My main concern is the O_TMPFILE thing here: we've discussed before that it would be nice to avoid streaming files to disk if we already have that data.... do we know how much of the win is attributed to O_TMPFILE and how much is from not blocking on the write?

Also: I'm kinda curious about why this is such a performance increase and I wonder if it's being conflated with c3677f1

In particular: the reason that this issue has been sitting around for so long with nobody paying attention on it is that I kinda assumed that with the ability to parallelize a large number of layers (bootc does layer splitting, right?) the need to parallelize within a single layer has seemed less important...

One thing I'm very happy to see, and I think it's a very important part of this: the shared semaphore idea is really really good. Thanks for that.

cgwalters · 2026-01-10T19:35:10Z

Also: I'm kinda curious about why this is such a performance increase and I wonder if it's being conflated with c3677f1

This is a very good observation; now because the fsync change came in with the same PR as an API break it's a bit more work to try to do a precise apples-to-apples comparison. I'll look at this though.

My main concern is the O_TMPFILE thing here: we've discussed before that it would be nice to avoid streaming files to disk if we already have that data.... do we know how much of the win is attributed to O_TMPFILE and how much is from not blocking on the write?

But I think we also don't want to load potentially large files into memory either. Now remember, writeback is a thing - the O_TMPFILE is still also just writing to memory (page cache) too but it has the advantage that when we do want to persist, it's ready to go into the filesystem and doesn't need another copy.

Currently I would say that when OCI is used in a sane (split-reproducible) way, there's relatively few objects shared across different layers. I think we don't need to optimize for the already-extant object case now - but if we decided to later, it would make sense as a followup; this isn't conflicting.

In particular: the reason that this issue has been sitting around for so long with nobody paying attention on it is that I kinda assumed that with the ability to parallelize a large number of layers (bootc does layer splitting, right?) the need to parallelize within a single layer has seemed less important...

The https://docs.fedoraproject.org/en-US/bootc/building-from-scratch/ flow results in a single large layer by default right now (and is my main motivation for this), but yes there's some work on a new generic rechunker.

Add parallel object storage when processing tar archives. Large files are streamed to O_TMPFILE via a channel, and fs-verity digests are computed in background blocking tasks. Key components: - FsVerityHasher: Made public for incremental digest computation from files on disk without loading them entirely into memory. Added BLOCK_SIZE const. - Repository: - create_object_tmpfile(): Create O_TMPFILE for streaming writes - spawn_finalize_object_tmpfile(): Spawn blocking task to enable verity and link into objects directory - finalize_object_tmpfile(): Sync implementation that enables verity (letting the kernel compute the digest) then measures the digest - compute_verity_digest(): Userspace fallback for insecure mode - SplitStreamBuilder: New type that accumulates pending object handles and resolves ObjectIDs when finalizing the splitstream - split_async(): Refactored to stream file content through channels to blocking tasks, with semaphore-based concurrency limiting Notes from review discussion: - We're not optimizing for existing objects, as we generally expect reuse to happen via layers anyways. - A lot of performance improvement really probably came from avoiding fdatasync() per object per earlier commit, but this makes things 2x faster still. Closes: composefs#62 Assisted-by: OpenCode (Sonnet 4) Signed-off-by: Colin Walters <walters@verbum.org>

cgwalters · 2026-01-11T13:56:26Z

Also: I'm kinda curious about why this is such a performance increase and I wonder if it's being conflated with c3677f1

I had an agent work on this overnight, I didn't do a deep verification but smells right to me (and matches your intuition):

Performance Improvement Summary: composefs-rs PR #208

Assisted-by: OpenCode (Opus 4.5)

Benchmark: `bootc install` Composefs Import Time

Version	Commit	Import Time	Speedup vs Baseline
Baseline (fdatasync per object)	`54c3c6d`	~580s (~9.7 min)	1x
syncfs optimization	`c3677f1`	~29.4s	~20x
syncfs + parallel	`5f1cce7` + fix	~16.7s	~35x

Root Cause of Baseline Slowness

The baseline called fdatasync() after writing each object file. While this ran inside spawn_blocking (so it didn't block the async runtime directly), the thousands of per-object fdatasync calls forced the filesystem to create a massive number of journal commits, causing severe I/O overhead.

Optimization Breakdown

1. syncfs optimization (Allison Karlitskaya)

Commit: c3677f1 ("repository: change our sync story")
Change: Replace per-object fdatasync() with a single syncfs() before linking the splitstream
Why it helps: Instead of thousands of individual journal commits, there's now one bulk sync at the end
Time: ~29.4s → ~20x faster

2. Parallel object storage (Colin Walters)

Commit: 5f1cce7 ("oci: Parallelize object storage within tar layers")
Change: Process multiple large files concurrently within each tar layer using channels and spawn_blocking tasks with semaphore-based concurrency limiting
Why it helps: Previously files were processed sequentially (one at a time). Now multiple files stream to tmpfiles in parallel, with fs-verity digests computed concurrently
Time: ~16.7s → ~2x additional, ~35x total

Bug Fix Required

The parallel implementation had a race condition where forked processes could inherit writable file descriptors, causing enable_verity to fail with "File is opened for writing".

Fix: Changed finalize_object_tmpfile() to use enable_verity_maybe_copy() instead of enable_verity_with_retry(). This creates a copy of the file if verity can't be enabled on the original, providing robustness against fd inheritance races.

Visual Summary

┌─────────────────────────────────────────────────────────────────┐
│           Composefs Import Performance Improvement              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Baseline                        ████████████████████  580s     │
│  (fdatasync per object)                                         │
│                                                                 │
│  + syncfs optimization           █  29.4s  (20x faster)         │
│  (single bulk sync)                                             │
│                                                                 │
│  + parallel file processing      ▌  16.7s  (35x faster)         │
│  (concurrent I/O within layers)                                 │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Test Environment

Benchmark command: bcvk to-disk --composefs-backend --disk-size=10G localhost/bootc target/bench-disk.img
Image: CentOS Stream 10 bootc with composefs-sealeduki-sdboot variant
Multiple runs averaged for each configuration

jeckersb · 2026-01-12T18:48:40Z

Bug Fix Required

The parallel implementation had a race condition where forked processes could inherit writable file descriptors, causing enable_verity to fail with "File is opened for writing".

Fix: Changed finalize_object_tmpfile() to use enable_verity_maybe_copy() instead of enable_verity_with_retry(). This creates a copy of the file if verity can't be enabled on the original, providing robustness against fd inheritance races.

Glad to see the work we did around getting that right (or at least as right as we can without kernel changes) is paying off!

cgwalters force-pushed the parallel-splitstream-objects branch 3 times, most recently from 520f9b8 to 5fe6e70 Compare January 9, 2026 16:33

cgwalters marked this pull request as ready for review January 9, 2026 16:40

cgwalters force-pushed the parallel-splitstream-objects branch from 5fe6e70 to 1910366 Compare January 9, 2026 17:04

cgwalters force-pushed the parallel-splitstream-objects branch from 1910366 to 5f1cce7 Compare January 9, 2026 17:04

cgwalters requested a review from jeckersb January 9, 2026 19:24

allisonkarlitskaya reviewed Jan 10, 2026

View reviewed changes

cgwalters force-pushed the parallel-splitstream-objects branch from 5f1cce7 to 9a6773d Compare January 11, 2026 13:56

jeckersb linked an issue Jan 12, 2026 that may be closed by this pull request

rfe: multithread splitstream creation #62

Closed

jeckersb approved these changes Jan 12, 2026

View reviewed changes

jeckersb merged commit 4397870 into composefs:main Jan 12, 2026
6 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

oci: Parallelize object storage within tar layers#208

oci: Parallelize object storage within tar layers#208
jeckersb merged 2 commits intocomposefs:mainfrom
cgwalters:parallel-splitstream-objects

cgwalters commented Jan 9, 2026 •

edited

Loading

Uh oh!

cgwalters commented Jan 9, 2026

Uh oh!

jeckersb commented Jan 9, 2026

Uh oh!

allisonkarlitskaya left a comment

Uh oh!

cgwalters commented Jan 10, 2026 •

edited

Loading

Uh oh!

cgwalters commented Jan 11, 2026

Uh oh!

jeckersb commented Jan 12, 2026

Bug Fix Required

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

cgwalters commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented Jan 9, 2026

Uh oh!

jeckersb commented Jan 9, 2026

Uh oh!

allisonkarlitskaya left a comment

Choose a reason for hiding this comment

Uh oh!

cgwalters commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cgwalters commented Jan 11, 2026

Performance Improvement Summary: composefs-rs PR #208

Benchmark: bootc install Composefs Import Time

Root Cause of Baseline Slowness

Optimization Breakdown

1. syncfs optimization (Allison Karlitskaya)

2. Parallel object storage (Colin Walters)

Bug Fix Required

Visual Summary

Test Environment

Uh oh!

jeckersb commented Jan 12, 2026

Bug Fix Required

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cgwalters commented Jan 9, 2026 •

edited

Loading

cgwalters commented Jan 10, 2026 •

edited

Loading

Benchmark: `bootc install` Composefs Import Time