Issue extent flushes in parallel by jmpesp · Pull Request #801 · oxidecomputer/crucible

jmpesp · 2023-06-14T20:56:19Z

First, move the concept of whether an Extent is opened or closed out of Extent by removing the Option around Mutex:

-    inner: Option<Mutex<Inner>>,
+    pub inner: Mutex<Inner>,

This moves up to Region as a new enum:

-    pub extents: Vec<Extent>,
+    pub extents: Vec<Arc<Mutex<ExtentState>>>,

where ExtentState uses the variant to store if the extent is opened or closed:

+pub enum ExtentState {
+    Opened(Arc<Extent>),
+    Closed,
+}

This is all in service of fn region_flush issuing a tokio::spawn for each extent.flush call, and waiting on each join handle that is returned. This is possible when an Arc<Extent> is used because a clone can be passed to each task.

Using an enum in this way is also a little safer: Region::close_extent actually consumes the old extent while closing it, meaning that the old Extent no longer hangs around. It does however mean that the downstairs will panic if the ExtentState is not expected: both get_opened_extent and close_extent panic if the extent in question is closed. close_extent also panics if you're trying to close an extent where there's an Arc clone somewhere as part of doing some other work.

Note this work required updating to Rust 1.70 (for Arc::into_inner). There's a few unrelated changes due to this.

First, move the concept of whether an Extent is opened or closed *out* of Extent by removing the Option around Mutex<Inner>: - inner: Option<Mutex<Inner>>, + pub inner: Mutex<Inner>, This moves up to Region as a new enum: - pub extents: Vec<Extent>, + pub extents: Vec<Arc<Mutex<ExtentState>>>, where ExtentState uses the variant to store if the extent is opened or closed: +pub enum ExtentState { + Opened(Arc<Extent>), + Closed, +} This is all in service of `fn region_flush` issuing a `tokio::spawn` for each `extent.flush` call, and waiting on each join handle that is returned. This is possible when an `Arc<Extent>` is used because a clone can be passed to each task. Using an enum in this way is also a little safer: `Region::close_extent` actually consumes the old extent while closing it, meaning that the old Extent no longer hangs around. It does however mean that the downstairs will panic if the ExtentState is not expected: both `get_opened_extent` and `close_extent` panic if the extent in question is closed. `close_extent` also panics if you're trying to close an extent where there's an Arc clone somewhere as part of doing some other work. Note this work required updating to Rust 1.70 (for `Arc::into_inner`). There's a few unrelated changes due to this.

leftwo

We should run the ./tools/test_repair.sh and ./tools/test_restart_repair.sh tests, and possibly bump their internal loop counters and let them run for longer.

leftwo · 2023-06-16T16:13:09Z

@@ -1,3 +1,3 @@
 [toolchain]
-channel = "1.66"
+channel = "1.70"


If we bump this, do we need it bumped anywhere else in Omicron/Propolis?
I don't think so, as the downstairs itself is standalone, but not sure about any
other consumers of the upstairs

I'm not sure, leaning towards yes - I can't see how (for example) Propolis would be able to compile this in as a library without support for Arc::into_inner. We'll see when that rev gets bumped I guess :)

We might want to try to bump propolis and see if it compiles. If there is some entanglement
and it turns out to be a huge amount of work, we might consider holding this fix till we can
get propolis ready for us. I'm sure we will have other changes we want in Crucible and I want
to be able to update omicron/propolis frequently over the next few weeks.

i was compiling latest propolis against this commit to run my perf tests fwiw and didnt run into any issues. though i did not like, run the propolis test suite or anything like that. just launched an instance and ran fio in it.

leftwo · 2023-06-17T00:32:18Z

+        assert_eq!(res, ReplaceResult::Started);
+
+        eprintln!("Wait for some repair work to proceed");
+        tokio::time::sleep(tokio::time::Duration::from_secs(4)).await;


I've been pulling out hair tracking down failing CI tests, and the sleep here makes me
nervous. Is there any way to make this deterministic? Like checking for some job progress but
not completed repair yet? On a busy system or a fast system, could this sleep be too little
or too much?

Good point, I changed this to check for the StartedAlready and panic if the repair completed in 9994e79

leftwo · 2023-06-30T20:15:07Z

I ran
tools/test_reconnect.sh <- this test is broken now, and I'll be turning it off soon.
tools/test_restart_repair.sh
tools/test_repaly.sh (new, coming soon replacement for test_reconnect)
tools/test_repair.sh
tools/test_live_repair.sh
tools/hammer_loop.sh

All passed. I upped the loop count for each to get some extended runtime.

leftwo · 2023-06-30T20:41:37Z

+        for join_handle in join_handles {
+            join_handle
+                .await
+                .map_err(|e| CrucibleError::GenericError(e.to_string()))??;


If we error here, where does this bubble out to?

if any of the flushes fails, then it will get returned as a failure here for region_flush. Though, see 53c45fb for the fix for an issue you brought up during a chat: we should wait for all results before returning, instead of (potentially) cancelling these flush tasks in the middle of execution if one of them errors.

leftwo

Just that one question

Bump crucible rev to pick up: - Self assembling zones must set their own MTU - fix build break caused by merging oxidecomputer/crucible#801 - Issue extent flushes in parallel - Add Logger Option to Volume construct method - Update Rust crate libc to 0.2.147 - Update Rust crate percent-encoding to 2.3 - Retry jobs until they succeed - Reorder select arms so pings can't be starved out - Treat a skipped IO like an error IO for ACK results. - Retry pantry requests - Remove panics and asserts in dummy tests - Update Rust crate csv to 1.2.2 - Update Rust crate reedline to 0.21.0 - Set open file resource limit to the max - Update Rust crate ringbuffer to 0.14 - DTrace meet cmon - Widen assert values to u128 to deal with u64::MAX - Change size_to_validate from usize to u64 - Turn on live-repair test in CI - Increase flush_timeout for some tests, collect cores - Update to latest dropshot Bump propolis rev to pick up: - Self assembling zones must set their own MTU - Bump crucible rev to latest - Make the propolis zone self-assembling - Flesh out more PIIX3-PM to suppress log gripes - Bump crucible rev to latest - Restructure PM-timer under PIIX3 device - Fix inventory handling for nested child entities - Centralize vmm-data interface into bhyve_api - Clean up PCI device classes - Update openssl dep to 0.10.55 - Allow propolis-standalone to use VMM reservoir - only allow one request to reboot to be enqueued at a time Nexus is currently setting the MTU inside self-assembling zones, but this goes against the idea that self-assembling zones perform their own configuration. Remove the code in `RunningZone::boot` that performs commands on self-assembling zones, and set the MTU of $DATALINK in the Clickhouse and CockroachDB method scripts. Fixes oxidecomputer#3512

Bump crucible rev to pick up: - Self assembling zones must set their own MTU - fix build break caused by merging oxidecomputer/crucible#801 - Issue extent flushes in parallel - Add Logger Option to Volume construct method - Update Rust crate libc to 0.2.147 - Update Rust crate percent-encoding to 2.3 - Retry jobs until they succeed - Reorder select arms so pings can't be starved out - Treat a skipped IO like an error IO for ACK results. - Retry pantry requests - Remove panics and asserts in dummy tests - Update Rust crate csv to 1.2.2 - Update Rust crate reedline to 0.21.0 - Set open file resource limit to the max - Update Rust crate ringbuffer to 0.14 - DTrace meet cmon - Widen assert values to u128 to deal with u64::MAX - Change size_to_validate from usize to u64 - Turn on live-repair test in CI - Increase flush_timeout for some tests, collect cores - Update to latest dropshot Bump propolis rev to pick up: - Self assembling zones must set their own MTU - Bump crucible rev to latest - Make the propolis zone self-assembling - Flesh out more PIIX3-PM to suppress log gripes - Bump crucible rev to latest - Restructure PM-timer under PIIX3 device - Fix inventory handling for nested child entities - Centralize vmm-data interface into bhyve_api - Clean up PCI device classes - Update openssl dep to 0.10.55 - Allow propolis-standalone to use VMM reservoir - only allow one request to reboot to be enqueued at a time Nexus is currently setting the MTU inside self-assembling zones, but this goes against the idea that self-assembling zones perform their own configuration. Remove the code in `RunningZone::boot` that performs commands on self-assembling zones, and set the MTU of $DATALINK in the Clickhouse and CockroachDB method scripts. Fixes #3512

jmpesp requested review from faithanalog and leftwo June 14, 2023 20:56

leftwo reviewed Jun 16, 2023

View reviewed changes

jmpesp added 2 commits June 16, 2023 17:03

add integration_test_volume_replace_downstairs_then_takeover

4955d17

add back comment

b3484b0

leftwo reviewed Jun 17, 2023

View reviewed changes

no magic sleeps

9994e79

leftwo reviewed Jun 30, 2023

View reviewed changes

leftwo approved these changes Jun 30, 2023

View reviewed changes

jmpesp added 4 commits June 30, 2023 17:48

Merge branch 'main' into parallel_flush

0b6bdc9

collect all results, then check them

53c45fb

Merge branch 'main' into parallel_flush

2c082f2

use rust-version for the downstairs due to Arc::into_inner

ac74d73

jmpesp merged commit 591ed5b into oxidecomputer:main Jul 5, 2023

jmpesp deleted the parallel_flush branch July 5, 2023 14:24

jmpesp added a commit to jmpesp/crucible that referenced this pull request Jul 5, 2023

fix build break caused by merging oxidecomputer#801

f78832d

jmpesp mentioned this pull request Jul 7, 2023

Self assembling zones should set their own MTU oxidecomputer/omicron#3520

Merged

Conversation

jmpesp commented Jun 14, 2023

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

leftwo Jun 16, 2023

Choose a reason for hiding this comment

Uh oh!

jmpesp Jun 16, 2023

Choose a reason for hiding this comment

Uh oh!

leftwo Jun 17, 2023

Choose a reason for hiding this comment

Uh oh!

faithanalog Jun 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leftwo Jun 17, 2023

Choose a reason for hiding this comment

Uh oh!

jmpesp Jun 19, 2023

Choose a reason for hiding this comment

Uh oh!

leftwo commented Jun 30, 2023

Uh oh!

leftwo Jun 30, 2023

Choose a reason for hiding this comment

Uh oh!

jmpesp Jul 4, 2023

Choose a reason for hiding this comment

Uh oh!

leftwo left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

faithanalog Jun 17, 2023 •

edited

Loading