Skip to content

Reject snapshot requests when read-only#1914

Merged
jmpesp merged 1 commit into
oxidecomputer:mainfrom
jmpesp:reject_read_only_snapshot
Mar 23, 2026
Merged

Reject snapshot requests when read-only#1914
jmpesp merged 1 commit into
oxidecomputer:mainfrom
jmpesp:reject_read_only_snapshot

Conversation

@jmpesp

@jmpesp jmpesp commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

The root of the problem behind oxidecomputer/omicron#9855 was that the downstairs was retrying an job that would never succeed, and notifying the upstairs every time it failed in a hot loop. This consumed memory until it was entirely exhausted.

Fix this by rejecting snapshot requests when read-only in the Upstairs. Note we don't need to check this in Volume because each Upstairs in a Volume should be read-only if any of them are.

Fixes #1856

The root of the problem behind oxidecomputer/omicron#9855 was that the
downstairs was retrying an job that would never succeed, and notifying
the upstairs every time it failed in a hot loop. This consumed memory
until it was entirely exhausted.

Fix this by rejecting snapshot requests when read-only in the Upstairs.
Note we don't need to check this in Volume because each Upstairs in a
Volume should be read-only if any of them are.

Fixes oxidecomputer#1856
@jmpesp jmpesp requested review from leftwo and mkeeter March 23, 2026 15:44
Comment thread upstairs/src/upstairs.rs
*/

if snapshot_details.is_some() {
if self.cfg.read_only {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect to be calling this if things are behaving as expected?
If yes, then maybe we don't want to log it.

If it's not expected, then it's fine to leave the log.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely not expected - if anything it'd be great to package this up as a fault soemhow (aka oxidecomputer/omicron#10118)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we should start making a list of these places in crucible where we want to tell someone about it, but don't know how to.

@jmpesp jmpesp merged commit c32ccfe into oxidecomputer:main Mar 23, 2026
17 checks passed
@jmpesp jmpesp deleted the reject_read_only_snapshot branch March 23, 2026 18:37
jmpesp added a commit to oxidecomputer/omicron that referenced this pull request May 26, 2026
Update Crucible from `7103cd3a` to `bd9a0e2a`, picking up the following
PRs:

- Use an explicit rev for oxidecomputer git deps
(oxidecomputer/crucible#1936)
- Add Clone and Deserialize to VolumeInfo et al
(oxidecomputer/crucible#1935)
- Update omicron/oximeter (oxidecomputer/crucible#1933)
- [meta] update to drift 0.1.4 (oxidecomputer/crucible#1932)
- Don't log if there is nothing to log (oxidecomputer/crucible#1930)
- Add VolumeInfo (oxidecomputer/crucible#1928)
- Remove bonus Volume layer (oxidecomputer/crucible#1927)
- Add session and client id to panic messages
(oxidecomputer/crucible#1926)
- [crucible-agent-types] migrate to RFD 619 pattern
(oxidecomputer/crucible#1899)
- Background read-only region creation (oxidecomputer/crucible#1919)
- [crucible-downstairs-repair] switch to RFD 619 pattern
(oxidecomputer/crucible#1901)
- [crucible-pantry] switch to RFD 619 pattern
(oxidecomputer/crucible#1900)
- Use separate in-memory types (oxidecomputer/crucible#1913)
- Remove old field from dtrace action script
(oxidecomputer/crucible#1917)
- Retry data writes that return an IO error
(oxidecomputer/crucible#1915)
- Bump dropshot to 0.17.0 (oxidecomputer/crucible#1909)
- Reject snapshot requests when read-only (oxidecomputer/crucible#1914)
- update ringbuf method, fix clippy lint (oxidecomputer/crucible#1904)
- bump vergen-v9 version too (oxidecomputer/crucible#1903)
- update dropshot to 0.16.7, dropshot-api-manager to 0.5.2
(oxidecomputer/crucible#1851)
- perf-vol.d updates (oxidecomputer/crucible#1898)
- upgrade progenitor to 0.13, reqwest to 0.13
(oxidecomputer/crucible#1854)
- Remove cargo nextest from github workflow, out of space
(oxidecomputer/crucible#1846)
- Add a test for VCR serialize/deserialize (oxidecomputer/crucible#1843)

Update Propolis from `bc489ddf` to `58ab73bd`, picking up the following
PRs:

- Bump crucible to latest, update Omicron, use explicit revs
(oxidecomputer/propolis#1141)
- Add project and silo ids to VM attestation
(oxidecomputer/propolis#1114)
- Update escargot (oxidecomputer/propolis#1139)
- Prefix shebang and mark D scripts as executable
(oxidecomputer/propolis#1140)
- Fix error in propolis-server README (oxidecomputer/propolis#1138)
- [meta] update to drift 0.1.4 (oxidecomputer/propolis#1137)
- Fix Intel CPUID leaf 4 cache topology for SMT
(oxidecomputer/propolis#1002)
- support NVMe Deallocate (oxidecomputer/propolis#1105)
- viona: do not lose used/avail indices (oxidecomputer/propolis#1135)
- viona: multiqueue device should stay multiqueue across migration
(oxidecomputer/propolis#1121)
- Bump crucible rev to latest (oxidecomputer/propolis#1132)
- expand zerocopy IntoBytes/FromByes use in guest memory accesses
(oxidecomputer/propolis#1130)
- dropshot-api-manager 0.7.1 (oxidecomputer/propolis#1129)
- improve slog component setting (oxidecomputer/propolis#1124)
- wait for viona Poller to run before declaring device running
(oxidecomputer/propolis#1118)
- virtio: tolerate importing queues with adjusted size
(oxidecomputer/propolis#1117)
- Run viona unit tests in CI (oxidecomputer/propolis#1120)
- feature gate Crucible-specific boot digest code
(oxidecomputer/propolis#1119)

Also:

- ran `cargo update -p vergen`

- removed the `reqwest012` dependency

- removed `reqwest012_client` from Nexus

- ran `cargo hakari generate` and `cargo hakari manage-deps`

- replace use of `ProgenitorOperationRetry` with
`retry_operation_while_indefinitely`

- during the region replacement drive saga, consume the new `VolumeInfo`
from Propolis and use that to determine when to consider a replacement
done
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Return an error from the Upstairs if trying to snapshot a read-only region set

2 participants