Skip to content

Background read-only region creation#1919

Merged
jmpesp merged 3 commits into
oxidecomputer:mainfrom
jmpesp:concurrent_read_only_clone
Apr 14, 2026
Merged

Background read-only region creation#1919
jmpesp merged 3 commits into
oxidecomputer:mainfrom
jmpesp:concurrent_read_only_clone

Conversation

@jmpesp

@jmpesp jmpesp commented Apr 9, 2026

Copy link
Copy Markdown
Contributor

When the Crucible Agent is requested to create a read-only region from a remote Downstairs source, this currently blocks the worker thread as region creation is performed in the worker loop, and it cannot respond to other state changes.

This commit spawns region creation threads that the main worker thread can send requests to, and sends all read-only region creation requests there.

This builds on the previous work to separate the serialized on-disk types from the in-memory types: a Creating state is added to the in-memory type and used while this background creation is occurring.

When the Crucible Agent is requested to create a read-only region from
a remote Downstairs source, this currently blocks the worker thread as
region creation is performed in the worker loop, and it cannot respond
to other state changes.

This commit spawns region creation threads that the main worker thread
can send requests to, and sends all read-only region creation requests
there.

This builds on the previous work to separate the serialized on-disk
types from the in-memory types: a `Creating` state is added to the
in-memory type and used while this background creation is occurring.
@jmpesp jmpesp requested a review from leftwo April 9, 2026 01:10

@leftwo leftwo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here, I have some questions for you.

Comment thread agent/src/datafile.rs Outdated
Comment thread agent/src/datafile.rs Outdated
Comment thread agent/src/main.rs
Comment thread agent/src/main.rs
let log0 = log.new(o!("component" => "worker"));
let df0 = Arc::clone(&df);
std::thread::spawn(|| {
tokio::spawn(async {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are going from a real thread to a tokio task, could a long running region create trip us up here? The old way was with a thread which seemed like it could go off and do whatever for an hour and the rest of the agent could continue working. Do we run any risk of that here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about all the differences between threads and tasks, but I don't think there's a risk. With worker running in a thread or with a task, the read/write region creation occurs separately from the dropshot server and datafile manipulation logic.

Comment thread agent/src/main.rs
@leftwo leftwo self-requested a review April 14, 2026 16:13
@jmpesp jmpesp merged commit cb62c0b into oxidecomputer:main Apr 14, 2026
17 checks passed
@jmpesp jmpesp deleted the concurrent_read_only_clone branch April 14, 2026 16:21
jmpesp added a commit to oxidecomputer/omicron that referenced this pull request May 26, 2026
Update Crucible from `7103cd3a` to `bd9a0e2a`, picking up the following
PRs:

- Use an explicit rev for oxidecomputer git deps
(oxidecomputer/crucible#1936)
- Add Clone and Deserialize to VolumeInfo et al
(oxidecomputer/crucible#1935)
- Update omicron/oximeter (oxidecomputer/crucible#1933)
- [meta] update to drift 0.1.4 (oxidecomputer/crucible#1932)
- Don't log if there is nothing to log (oxidecomputer/crucible#1930)
- Add VolumeInfo (oxidecomputer/crucible#1928)
- Remove bonus Volume layer (oxidecomputer/crucible#1927)
- Add session and client id to panic messages
(oxidecomputer/crucible#1926)
- [crucible-agent-types] migrate to RFD 619 pattern
(oxidecomputer/crucible#1899)
- Background read-only region creation (oxidecomputer/crucible#1919)
- [crucible-downstairs-repair] switch to RFD 619 pattern
(oxidecomputer/crucible#1901)
- [crucible-pantry] switch to RFD 619 pattern
(oxidecomputer/crucible#1900)
- Use separate in-memory types (oxidecomputer/crucible#1913)
- Remove old field from dtrace action script
(oxidecomputer/crucible#1917)
- Retry data writes that return an IO error
(oxidecomputer/crucible#1915)
- Bump dropshot to 0.17.0 (oxidecomputer/crucible#1909)
- Reject snapshot requests when read-only (oxidecomputer/crucible#1914)
- update ringbuf method, fix clippy lint (oxidecomputer/crucible#1904)
- bump vergen-v9 version too (oxidecomputer/crucible#1903)
- update dropshot to 0.16.7, dropshot-api-manager to 0.5.2
(oxidecomputer/crucible#1851)
- perf-vol.d updates (oxidecomputer/crucible#1898)
- upgrade progenitor to 0.13, reqwest to 0.13
(oxidecomputer/crucible#1854)
- Remove cargo nextest from github workflow, out of space
(oxidecomputer/crucible#1846)
- Add a test for VCR serialize/deserialize (oxidecomputer/crucible#1843)

Update Propolis from `bc489ddf` to `58ab73bd`, picking up the following
PRs:

- Bump crucible to latest, update Omicron, use explicit revs
(oxidecomputer/propolis#1141)
- Add project and silo ids to VM attestation
(oxidecomputer/propolis#1114)
- Update escargot (oxidecomputer/propolis#1139)
- Prefix shebang and mark D scripts as executable
(oxidecomputer/propolis#1140)
- Fix error in propolis-server README (oxidecomputer/propolis#1138)
- [meta] update to drift 0.1.4 (oxidecomputer/propolis#1137)
- Fix Intel CPUID leaf 4 cache topology for SMT
(oxidecomputer/propolis#1002)
- support NVMe Deallocate (oxidecomputer/propolis#1105)
- viona: do not lose used/avail indices (oxidecomputer/propolis#1135)
- viona: multiqueue device should stay multiqueue across migration
(oxidecomputer/propolis#1121)
- Bump crucible rev to latest (oxidecomputer/propolis#1132)
- expand zerocopy IntoBytes/FromByes use in guest memory accesses
(oxidecomputer/propolis#1130)
- dropshot-api-manager 0.7.1 (oxidecomputer/propolis#1129)
- improve slog component setting (oxidecomputer/propolis#1124)
- wait for viona Poller to run before declaring device running
(oxidecomputer/propolis#1118)
- virtio: tolerate importing queues with adjusted size
(oxidecomputer/propolis#1117)
- Run viona unit tests in CI (oxidecomputer/propolis#1120)
- feature gate Crucible-specific boot digest code
(oxidecomputer/propolis#1119)

Also:

- ran `cargo update -p vergen`

- removed the `reqwest012` dependency

- removed `reqwest012_client` from Nexus

- ran `cargo hakari generate` and `cargo hakari manage-deps`

- replace use of `ProgenitorOperationRetry` with
`retry_operation_while_indefinitely`

- during the region replacement drive saga, consume the new `VolumeInfo`
from Propolis and use that to determine when to consider a replacement
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants