oci: Parallelize object storage within tar layers#208
oci: Parallelize object storage within tar layers#208jeckersb merged 2 commits intocomposefs:mainfrom
Conversation
520f9b8 to
5fe6e70
Compare
|
OK...so...it's hard to overstate just HOW MUCH BETTER this makes I need to inject some precise timings but before a fresh install was measured in minutes, now it's measured in seconds. The "writing files" phase takes about 10s here which is SO SO SO much better than before. |
5fe6e70 to
1910366
Compare
Add a benchmark test to measure tar archive splitting performance. This creates a 2GB tar archive (10,000 × 200KB files) and measures the time to process it through the split_async/Repository pipeline. Run with: cargo test --release --lib -p composefs-oci bench_tar_split -- --ignored --nocapture Assisted-by: OpenCode (Sonnet 4) Signed-off-by: Colin Walters <walters@verbum.org>
1910366 to
5f1cce7
Compare
|
I need to spend some more hands-on time with this, but at the eyeball level this looks awesome! Definitely want to A/B bootc installs with/without to compare, probably a job for next week at this point though. |
allisonkarlitskaya
left a comment
There was a problem hiding this comment.
Very happy to finally see a working patch for this.
Definitely mark this as fixes #62
My main concern is the O_TMPFILE thing here: we've discussed before that it would be nice to avoid streaming files to disk if we already have that data.... do we know how much of the win is attributed to O_TMPFILE and how much is from not blocking on the write?
Also: I'm kinda curious about why this is such a performance increase and I wonder if it's being conflated with c3677f1
In particular: the reason that this issue has been sitting around for so long with nobody paying attention on it is that I kinda assumed that with the ability to parallelize a large number of layers (bootc does layer splitting, right?) the need to parallelize within a single layer has seemed less important...
One thing I'm very happy to see, and I think it's a very important part of this: the shared semaphore idea is really really good. Thanks for that.
This is a very good observation; now because the fsync change came in with the same PR as an API break it's a bit more work to try to do a precise apples-to-apples comparison. I'll look at this though.
But I think we also don't want to load potentially large files into memory either. Now remember, writeback is a thing - the O_TMPFILE is still also just writing to memory (page cache) too but it has the advantage that when we do want to persist, it's ready to go into the filesystem and doesn't need another copy. Currently I would say that when OCI is used in a sane (split-reproducible) way, there's relatively few objects shared across different layers. I think we don't need to optimize for the already-extant object case now - but if we decided to later, it would make sense as a followup; this isn't conflicting.
The https://docs.fedoraproject.org/en-US/bootc/building-from-scratch/ flow results in a single large layer by default right now (and is my main motivation for this), but yes there's some work on a new generic rechunker. |
Add parallel object storage when processing tar archives. Large files are
streamed to O_TMPFILE via a channel, and fs-verity digests are computed in
background blocking tasks.
Key components:
- FsVerityHasher: Made public for incremental digest computation from files
on disk without loading them entirely into memory. Added BLOCK_SIZE const.
- Repository:
- create_object_tmpfile(): Create O_TMPFILE for streaming writes
- spawn_finalize_object_tmpfile(): Spawn blocking task to enable verity
and link into objects directory
- finalize_object_tmpfile(): Sync implementation that enables verity
(letting the kernel compute the digest) then measures the digest
- compute_verity_digest(): Userspace fallback for insecure mode
- SplitStreamBuilder: New type that accumulates pending object handles and
resolves ObjectIDs when finalizing the splitstream
- split_async(): Refactored to stream file content through channels to
blocking tasks, with semaphore-based concurrency limiting
Notes from review discussion:
- We're not optimizing for existing objects, as we generally expect
reuse to happen via layers anyways.
- A lot of performance improvement really probably came from avoiding
fdatasync() per object per earlier commit, but this makes things
2x faster still.
Closes: composefs#62
Assisted-by: OpenCode (Sonnet 4)
Signed-off-by: Colin Walters <walters@verbum.org>
5f1cce7 to
9a6773d
Compare
I had an agent work on this overnight, I didn't do a deep verification but smells right to me (and matches your intuition): Performance Improvement Summary: composefs-rs PR #208Assisted-by: OpenCode (Opus 4.5) Benchmark:
|
| Version | Commit | Import Time | Speedup vs Baseline |
|---|---|---|---|
| Baseline (fdatasync per object) | 54c3c6d |
~580s (~9.7 min) | 1x |
| syncfs optimization | c3677f1 |
~29.4s | ~20x |
| syncfs + parallel | 5f1cce7 + fix |
~16.7s | ~35x |
Root Cause of Baseline Slowness
The baseline called fdatasync() after writing each object file. While this ran inside spawn_blocking (so it didn't block the async runtime directly), the thousands of per-object fdatasync calls forced the filesystem to create a massive number of journal commits, causing severe I/O overhead.
Optimization Breakdown
1. syncfs optimization (Allison Karlitskaya)
- Commit:
c3677f1("repository: change our sync story") - Change: Replace per-object
fdatasync()with a singlesyncfs()before linking the splitstream - Why it helps: Instead of thousands of individual journal commits, there's now one bulk sync at the end
- Time: ~29.4s → ~20x faster
2. Parallel object storage (Colin Walters)
- Commit:
5f1cce7("oci: Parallelize object storage within tar layers") - Change: Process multiple large files concurrently within each tar layer using channels and
spawn_blockingtasks with semaphore-based concurrency limiting - Why it helps: Previously files were processed sequentially (one at a time). Now multiple files stream to tmpfiles in parallel, with fs-verity digests computed concurrently
- Time: ~16.7s → ~2x additional, ~35x total
Bug Fix Required
The parallel implementation had a race condition where forked processes could inherit writable file descriptors, causing enable_verity to fail with "File is opened for writing".
Fix: Changed finalize_object_tmpfile() to use enable_verity_maybe_copy() instead of enable_verity_with_retry(). This creates a copy of the file if verity can't be enabled on the original, providing robustness against fd inheritance races.
Visual Summary
┌─────────────────────────────────────────────────────────────────┐
│ Composefs Import Performance Improvement │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Baseline ████████████████████ 580s │
│ (fdatasync per object) │
│ │
│ + syncfs optimization █ 29.4s (20x faster) │
│ (single bulk sync) │
│ │
│ + parallel file processing ▌ 16.7s (35x faster) │
│ (concurrent I/O within layers) │
│ │
└─────────────────────────────────────────────────────────────────┘
Test Environment
- Benchmark command:
bcvk to-disk --composefs-backend --disk-size=10G localhost/bootc target/bench-disk.img - Image: CentOS Stream 10 bootc with composefs-sealeduki-sdboot variant
- Multiple runs averaged for each configuration
Glad to see the work we did around getting that right (or at least as right as we can without kernel changes) is paying off! |
Add parallel object storage when processing tar archives. Large files are streamed to O_TMPFILE via a channel, and fs-verity digests are computed in background blocking tasks. This avoids blocking the async runtime while allowing multiple files to be processed concurrently.
Key components:
FsVerityHasher: Made public for incremental digest computation from files on disk without loading them entirely into memory.
Repository:
SplitStreamBuilder: Collects inline data and pending external object handles, resolves ObjectIDs at finalization with proper deduplication.
split_async(): Now takes AsyncBufRead, streams via channel to blocking tasks, uses repository's shared semaphore for backpressure.
For a 2GB tar (10,000 files × 200KB), achieves ~7x speedup: ~980ms → ~140ms (~14 GB/s throughput).
Assisted-by: OpenCode (Opus 4.5)