Avoid task cancellation on service init failure#3707
Merged
Conversation
luqmana
approved these changes
Jul 19, 2023
|
|
||
| // We initialize all the zones we can, but only return the first error. | ||
| let local_existing_zones = Arc::new(Mutex::new(existing_zones)); | ||
| let first_err = Arc::new(Mutex::new(None)); |
Contributor
There was a problem hiding this comment.
Not a blocker, but as written shouldn't this be last_err?
Collaborator
Author
There was a problem hiding this comment.
Done in #3709 , pulled it out to not block the build for folks MUPdating
Contributor
|
The try_ methods seem pretty difficult to use correctly 😬 |
luqmana
added a commit
that referenced
this pull request
Jul 19, 2023
Roughly 3 changes: 1. Skip initializing any zones we already know are running (both from sled-agent's perspective and as reflected via `zoneadm`) 2. Log any errors `service_put` would return on the server-side. 3. Bump file-based zpools to 15G Lack of space was causing service initialization to fail on single-machine deployments using the `create_virtual_hardware` script. (3) should fix this. Because of `ENOSPC` errors, we were running into some future cancellation issues which #3707 addressed. But actually determining the underlying cause was a bit difficult as `RSS` was timing out (*) on the `services_put` call and so never got back the error. (2) should hopefully make this easier to catch in the future, e.g.: ``` 04:12:31.312Z ERRO SledAgent: failed to init services: Failed to install zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127' from '/opt/oxide/clickhouse.tar.gz': Failed to execute zoneadm command ' Install' for zone 'oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127': Failed to parse command output: exit code 1 stdout: A ZFS file system has been created for this zone. INFO: omicron: installing zone oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127 @ "/pool/ext/24b4dc87-ab46-49fb-a4b4-d361ae214c03/crypt/zone/oxz_clickhouse_32ce0d6f-38fa-41e7-b699-f0af4f4f1127"... INFO: omicron: replicating /usr tree... INFO: omicron: replicating /lib tree... INFO: omicron: replicating /sbin tree... INFO: omicron: pruning SMF manifests... INFO: omicron: pruning global-only files... INFO: omicron: unpacking baseline archive... INFO: omicron: unpacking image "/opt/oxide/clickhouse.tar.gz"... stderr: Error: No space left on device (os error 28) sled_id = 8fd66238-6bb3-4f08-89bf-f88fc2320d83 ``` (*) speaking of which, do we want to increase that timeout from just 60s? it's not like any subsequent requests will work considering they'll be blocked on a lock from the initial request. and in failure cases like this the subsequent requests will timeout as well, leaving a bunch of tasks in sled-agent all waiting on the same lock to try initializing. might be worth switching to the single processing task model rack/sled initialization use. Finally, (1) is to address the case where we try to initialize a zone that's already running. That's sometimes fine as we'll eventually realize the zone already exists. But in the case of a zone with an OPTE port, we'll run into errors like so: ``` error_message_internal = Failed to create OPTE port for service nexus: Failure interacting with the OPTE ioctl(2) interface: command CreateXde failed: MacExists { port: "opte12", vni: Vni { inner: 100 }, mac: MacAddr { inner: A8:40:25:FF:EC:09 } } ``` This happens because the [check](https://github.com/oxidecomputer/omicron/blob/ebd3db27f6bbd08af3194cae84b0efc6b6e7248d/illumos-utils/src/zone.rs#L282C27-L282C27) for "is zone running" comes after we do all the work necessary to create the zone (e.g. creating an opte port). (1) updates `initialize_services_locked` to cross-reference sled-agent and the system's view of what zones are running so that we don't try to do the unnecessary work. --------- Co-authored-by: Sean Klein <sean@oxide.computer>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We have seen some behavior where, after failing to initialize services, the sled agent enters "split-brain" behavior, and on subsequent requests to initialize services, the sled agent is not aware of zones which it created itself.
I believe this to be a possible cause: While integrating #3676 , I used
try_for_each_concurrentto parallelize zone bringup. This is an operation which can potentially return early on cancellation, dropping all other ongoing zone initialization tasks.This PR mitigates this issue by preventing concurrent drop within service initialization.