Skip to content

ignition: Fix PXE by using mount namespace rather than remounting RO#103

Merged
chewi merged 1 commit intoflatcar-masterfrom
chewi/ignition-remount
Feb 28, 2025
Merged

ignition: Fix PXE by using mount namespace rather than remounting RO#103
chewi merged 1 commit intoflatcar-masterfrom
chewi/ignition-remount

Conversation

@chewi
Copy link
Copy Markdown
Contributor

@chewi chewi commented Feb 27, 2025

Remounting read-only was failing due to systemd v256 holding a read-write file descriptor on systemd-executor. We don't need to do this inside a namespace. We don't need to unmount the OEM partition either.

Jenkins is all green. I included Equinix Metal on amd64 and arm64.

Remounting read-only was failing due to systemd v256 holding a
read-write file descriptor on systemd-executor. We don't need to do this
inside a namespace. We don't need to unmount the OEM partition either.

Co-authored-by: Jeremi Piotrowski <jpiotrowski@microsoft.com>
Signed-off-by: James Le Cuirot <jlecuirot@microsoft.com>
@chewi chewi requested a review from a team February 27, 2025 17:06
@chewi chewi self-assigned this Feb 27, 2025
@ader1990
Copy link
Copy Markdown
Contributor

Hello, can you share the bug link or description that this PR fixes?

@chewi
Copy link
Copy Markdown
Contributor Author

chewi commented Feb 28, 2025

It was only discussed on Matrix. The above basically sums it up though. This went unnoticed initially because we don't run tests on EM by default, and it's known to be flaky sometimes.

@ader1990
Copy link
Copy Markdown
Contributor

It was only discussed on Matrix. The above basically sums it up though. This went unnoticed initially because we don't run tests on EM by default, and it's known to be flaky sometimes.

Can you please provide the context and the actual error logs / files on this matter here? A PR needs to have a link to an issue or a clear description of what the issue issue / error logs.

@chewi
Copy link
Copy Markdown
Contributor Author

chewi commented Feb 28, 2025

Both amd64 and arm64 CI tests totally failed. The logs were full of errors like this:

04:50:13  2025-02-25T04:50:12Z platform/api/equinixmetal: 7987be91-f02c-4a6b-a580-04b280bcbdd5 console session failed: Process exited with status 1
05:38:34  2025-02-25T05:38:27Z platform/machine/equinixmetal: Retrying to provision a machine after error: "timed out waiting for flatcar-install: dial tcp 147.75.203.3:9: i/o timeout"
05:38:42  2025-02-25T05:38:41Z platform/api/equinixmetal: abddd3ff-d46d-4c2b-b314-7b9db5a5c66f console session failed: Process exited with status 1
06:27:02  2025-02-25T06:26:56Z platform/machine/equinixmetal: Retrying to provision a machine after error: "timed out waiting for flatcar-install: dial tcp 147.75.90.1:9: i/o timeout"

Logging in via the serial console showed that it was stuck in the emergency shell having failed to boot. This was due to ignition-setup.service failing, which showed this:

:/root# systemctl --no-pager -o cat status ignition-setup
× ignition-setup.service - Ignition (setup)
     Loaded: loaded (/run/systemd/generator/ignition-setup.service; generated)
     Active: failed (Result: exit-code) since Tue 2025-02-25 16:10:40 UTC; 5min ago
 Invocation: f24e87129324420aa82ca1c79c718092
    Process: 481 ExecStart=/usr/sbin/ignition-setup pxe (code=exited, status=32)
   Main PID: 481 (code=exited, status=32)

Starting ignition-setup.service - Ignition (setup)...
ignition-setup.service: Main process exited, code=exited, status=32/n/a
mount: /usr: mount point is busy.
       dmesg(1) may have more information after failed mount system call.
ignition-setup.service: Failed with result 'exit-code'.
Failed to start ignition-setup.service - Ignition (setup).
ignition-setup.service: Triggering OnFailure= dependencies.

Running lsof via the mounted /sysroot/usr showed a read-write file descriptor within /usr, indicated by the u.

COMMAND   PID USER  FD   TYPE DEVICE SIZE/OFF NODE NAME
systemd     1 root   9u   REG    0,2   137520 1232 /usr/lib/systemd/systemd-executor

Copy link
Copy Markdown
Member

@krnowak krnowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a neat fix, especially that the unit does not do any mounting that should remain after the process is finished.

Aside that, I'm wondering how this thing worked exactly for the "normal" case, where the later trap 'retry-umount "${src}"' EXIT clobbers the earlier trap 'mount -o remount,ro /usr' EXIT. And also, without the clobbering, the same issue probably would arise on other platforms.

Also, it looks like that the /usr was remounted as read-only, but at a later stage? Because I don't remember seeing problems with writeable /usr.

@chewi
Copy link
Copy Markdown
Contributor Author

chewi commented Feb 28, 2025

Very good points! I had wondered why it only affected PXE, and that totally explains it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants