Skip to content

Issue extent flushes in parallel#801

Merged
jmpesp merged 8 commits into
oxidecomputer:mainfrom
jmpesp:parallel_flush
Jul 5, 2023
Merged

Issue extent flushes in parallel#801
jmpesp merged 8 commits into
oxidecomputer:mainfrom
jmpesp:parallel_flush

Conversation

@jmpesp

@jmpesp jmpesp commented Jun 14, 2023

Copy link
Copy Markdown
Contributor

First, move the concept of whether an Extent is opened or closed out of Extent by removing the Option around Mutex:

-    inner: Option<Mutex<Inner>>,
+    pub inner: Mutex<Inner>,

This moves up to Region as a new enum:

-    pub extents: Vec<Extent>,
+    pub extents: Vec<Arc<Mutex<ExtentState>>>,

where ExtentState uses the variant to store if the extent is opened or closed:

+pub enum ExtentState {
+    Opened(Arc<Extent>),
+    Closed,
+}

This is all in service of fn region_flush issuing a tokio::spawn for each extent.flush call, and waiting on each join handle that is returned. This is possible when an Arc<Extent> is used because a clone can be passed to each task.

Using an enum in this way is also a little safer: Region::close_extent actually consumes the old extent while closing it, meaning that the old Extent no longer hangs around. It does however mean that the downstairs will panic if the ExtentState is not expected: both get_opened_extent and close_extent panic if the extent in question is closed. close_extent also panics if you're trying to close an extent where there's an Arc clone somewhere as part of doing some other work.

Note this work required updating to Rust 1.70 (for Arc::into_inner). There's a few unrelated changes due to this.

First, move the concept of whether an Extent is opened or closed *out*
of Extent by removing the Option around Mutex<Inner>:

    -    inner: Option<Mutex<Inner>>,
    +    pub inner: Mutex<Inner>,

This moves up to Region as a new enum:

    -    pub extents: Vec<Extent>,
    +    pub extents: Vec<Arc<Mutex<ExtentState>>>,

where ExtentState uses the variant to store if the extent is opened or
closed:

    +pub enum ExtentState {
    +    Opened(Arc<Extent>),
    +    Closed,
    +}

This is all in service of `fn region_flush` issuing a `tokio::spawn` for
each `extent.flush` call, and waiting on each join handle that is
returned. This is possible when an `Arc<Extent>` is used because a clone
can be passed to each task.

Using an enum in this way is also a little safer: `Region::close_extent`
actually consumes the old extent while closing it, meaning that the old
Extent no longer hangs around. It does however mean that the downstairs
will panic if the ExtentState is not expected: both `get_opened_extent`
and `close_extent` panic if the extent in question is closed.
`close_extent` also panics if you're trying to close an extent where
there's an Arc clone somewhere as part of doing some other work.

Note this work required updating to Rust 1.70 (for `Arc::into_inner`).
There's a few unrelated changes due to this.
@jmpesp jmpesp requested review from faithanalog and leftwo June 14, 2023 20:56

@leftwo leftwo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should run the ./tools/test_repair.sh and ./tools/test_restart_repair.sh tests, and possibly bump their internal loop counters and let them run for longer.

Comment thread downstairs/src/region.rs
Comment thread downstairs/src/region.rs
Comment thread downstairs/src/repair.rs
Comment thread integration_tests/src/lib.rs
Comment thread rust-toolchain.toml
@@ -1,3 +1,3 @@
[toolchain]
channel = "1.66"
channel = "1.70"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we bump this, do we need it bumped anywhere else in Omicron/Propolis?
I don't think so, as the downstairs itself is standalone, but not sure about any
other consumers of the upstairs

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, leaning towards yes - I can't see how (for example) Propolis would be able to compile this in as a library without support for Arc::into_inner. We'll see when that rev gets bumped I guess :)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to try to bump propolis and see if it compiles. If there is some entanglement
and it turns out to be a huge amount of work, we might consider holding this fix till we can
get propolis ready for us. I'm sure we will have other changes we want in Crucible and I want
to be able to update omicron/propolis frequently over the next few weeks.

@faithanalog faithanalog Jun 17, 2023

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was compiling latest propolis against this commit to run my perf tests fwiw and didnt run into any issues. though i did not like, run the propolis test suite or anything like that. just launched an instance and ran fio in it.

Comment thread downstairs/src/region.rs
Comment thread integration_tests/src/lib.rs Outdated
assert_eq!(res, ReplaceResult::Started);

eprintln!("Wait for some repair work to proceed");
tokio::time::sleep(tokio::time::Duration::from_secs(4)).await;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been pulling out hair tracking down failing CI tests, and the sleep here makes me
nervous. Is there any way to make this deterministic? Like checking for some job progress but
not completed repair yet? On a busy system or a fast system, could this sleep be too little
or too much?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I changed this to check for the StartedAlready and panic if the repair completed in 9994e79

@leftwo

leftwo commented Jun 30, 2023

Copy link
Copy Markdown
Contributor

I ran
tools/test_reconnect.sh <- this test is broken now, and I'll be turning it off soon.
tools/test_restart_repair.sh
tools/test_repaly.sh (new, coming soon replacement for test_reconnect)
tools/test_repair.sh
tools/test_live_repair.sh
tools/hammer_loop.sh

All passed. I upped the loop count for each to get some extended runtime.

Comment thread downstairs/src/region.rs Outdated
for join_handle in join_handles {
join_handle
.await
.map_err(|e| CrucibleError::GenericError(e.to_string()))??;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we error here, where does this bubble out to?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if any of the flushes fails, then it will get returned as a failure here for region_flush. Though, see 53c45fb for the fix for an issue you brought up during a chat: we should wait for all results before returning, instead of (potentially) cancelling these flush tasks in the middle of execution if one of them errors.

@leftwo leftwo left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that one question

@jmpesp jmpesp merged commit 591ed5b into oxidecomputer:main Jul 5, 2023
@jmpesp jmpesp deleted the parallel_flush branch July 5, 2023 14:24
jmpesp added a commit to jmpesp/crucible that referenced this pull request Jul 5, 2023
jmpesp added a commit to jmpesp/omicron that referenced this pull request Jul 7, 2023
Bump crucible rev to pick up:

- Self assembling zones must set their own MTU
- fix build break caused by merging oxidecomputer/crucible#801
- Issue extent flushes in parallel
- Add Logger Option to Volume construct method
- Update Rust crate libc to 0.2.147
- Update Rust crate percent-encoding to 2.3
- Retry jobs until they succeed
- Reorder select arms so pings can't be starved out
- Treat a skipped IO like an error IO for ACK results.
- Retry pantry requests
- Remove panics and asserts in dummy tests
- Update Rust crate csv to 1.2.2
- Update Rust crate reedline to 0.21.0
- Set open file resource limit to the max
- Update Rust crate ringbuffer to 0.14
- DTrace meet cmon
- Widen assert values to u128 to deal with u64::MAX
- Change size_to_validate from usize to u64
- Turn on live-repair test in CI
- Increase flush_timeout for some tests, collect cores
- Update to latest dropshot

Bump propolis rev to pick up:

- Self assembling zones must set their own MTU
- Bump crucible rev to latest
- Make the propolis zone self-assembling
- Flesh out more PIIX3-PM to suppress log gripes
- Bump crucible rev to latest
- Restructure PM-timer under PIIX3 device
- Fix inventory handling for nested child entities
- Centralize vmm-data interface into bhyve_api
- Clean up PCI device classes
- Update openssl dep to 0.10.55
- Allow propolis-standalone to use VMM reservoir
- only allow one request to reboot to be enqueued at a time

Nexus is currently setting the MTU inside self-assembling zones, but
this goes against the idea that self-assembling zones perform their own
configuration. Remove the code in `RunningZone::boot` that performs
commands on self-assembling zones, and set the MTU of $DATALINK in the
Clickhouse and CockroachDB method scripts.

Fixes oxidecomputer#3512
jmpesp added a commit to jmpesp/omicron that referenced this pull request Jul 7, 2023
Bump crucible rev to pick up:

- Self assembling zones must set their own MTU
- fix build break caused by merging oxidecomputer/crucible#801
- Issue extent flushes in parallel
- Add Logger Option to Volume construct method
- Update Rust crate libc to 0.2.147
- Update Rust crate percent-encoding to 2.3
- Retry jobs until they succeed
- Reorder select arms so pings can't be starved out
- Treat a skipped IO like an error IO for ACK results.
- Retry pantry requests
- Remove panics and asserts in dummy tests
- Update Rust crate csv to 1.2.2
- Update Rust crate reedline to 0.21.0
- Set open file resource limit to the max
- Update Rust crate ringbuffer to 0.14
- DTrace meet cmon
- Widen assert values to u128 to deal with u64::MAX
- Change size_to_validate from usize to u64
- Turn on live-repair test in CI
- Increase flush_timeout for some tests, collect cores
- Update to latest dropshot

Bump propolis rev to pick up:

- Self assembling zones must set their own MTU
- Bump crucible rev to latest
- Make the propolis zone self-assembling
- Flesh out more PIIX3-PM to suppress log gripes
- Bump crucible rev to latest
- Restructure PM-timer under PIIX3 device
- Fix inventory handling for nested child entities
- Centralize vmm-data interface into bhyve_api
- Clean up PCI device classes
- Update openssl dep to 0.10.55
- Allow propolis-standalone to use VMM reservoir
- only allow one request to reboot to be enqueued at a time

Nexus is currently setting the MTU inside self-assembling zones, but
this goes against the idea that self-assembling zones perform their own
configuration. Remove the code in `RunningZone::boot` that performs
commands on self-assembling zones, and set the MTU of $DATALINK in the
Clickhouse and CockroachDB method scripts.

Fixes oxidecomputer#3512
jmpesp added a commit to oxidecomputer/omicron that referenced this pull request Jul 7, 2023
Bump crucible rev to pick up:

- Self assembling zones must set their own MTU
- fix build break caused by merging oxidecomputer/crucible#801
- Issue extent flushes in parallel
- Add Logger Option to Volume construct method
- Update Rust crate libc to 0.2.147
- Update Rust crate percent-encoding to 2.3
- Retry jobs until they succeed
- Reorder select arms so pings can't be starved out
- Treat a skipped IO like an error IO for ACK results.
- Retry pantry requests
- Remove panics and asserts in dummy tests
- Update Rust crate csv to 1.2.2
- Update Rust crate reedline to 0.21.0
- Set open file resource limit to the max
- Update Rust crate ringbuffer to 0.14
- DTrace meet cmon
- Widen assert values to u128 to deal with u64::MAX
- Change size_to_validate from usize to u64
- Turn on live-repair test in CI
- Increase flush_timeout for some tests, collect cores
- Update to latest dropshot

Bump propolis rev to pick up:

- Self assembling zones must set their own MTU
- Bump crucible rev to latest
- Make the propolis zone self-assembling
- Flesh out more PIIX3-PM to suppress log gripes
- Bump crucible rev to latest
- Restructure PM-timer under PIIX3 device
- Fix inventory handling for nested child entities
- Centralize vmm-data interface into bhyve_api
- Clean up PCI device classes
- Update openssl dep to 0.10.55
- Allow propolis-standalone to use VMM reservoir
- only allow one request to reboot to be enqueued at a time

Nexus is currently setting the MTU inside self-assembling zones, but
this goes against the idea that self-assembling zones perform their own
configuration. Remove the code in `RunningZone::boot` that performs
commands on self-assembling zones, and set the MTU of $DATALINK in the
Clickhouse and CockroachDB method scripts.

Fixes #3512
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants