Skip to content

Allow customization of Parallelstore mounts#3144

Merged
harshthakkar01 merged 1 commit into
GoogleCloudPlatform:developfrom
wiktorn:template_pstore_mount
Dec 24, 2024
Merged

Allow customization of Parallelstore mounts#3144
harshthakkar01 merged 1 commit into
GoogleCloudPlatform:developfrom
wiktorn:template_pstore_mount

Conversation

@wiktorn

@wiktorn wiktorn commented Oct 17, 2024

Copy link
Copy Markdown
Contributor

Add two customization options:

  • daos_agent_config - to allow providing additional configuration in /etc/daos/daos_agent.yml
  • dfuse_environment - to allow providing environment variables for dfuse process

Topics for discussion:

  • whether using templates is proper approach here (it gives a lot of flexibility)
  • support for whitespaces in mount points (currently not supported, not sure if it was supported earlier)
  • handling of /etc/daos/daos_agent.yml - currently anyway everything is commented, and we just uncomment specific lines with sed. The other approach is either to create a new file (as implemnted), or just make sure that everything is commented, and just add configuration at the end. I think this aproach is cleaner in terms of what is expected result / file content.
  • usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior
  • Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

TODO

  • align those changes with modules/file-system/pre-existing-network-storage

This PR also addresses bug in startup-script module,

startup-script: Mon Dec 16 08:44:43 -0500 2024 Info [4589]: === start executing runner: early_run_hotfixes.sh-b383 ===
startup-script: /slurm/custom_scripts/controller.d/ghpc_daos_mount.sh: line 373: .//tmp/tmp.g1yoDyA6T5/early_run_hotfixes.sh: No such file or directory
startup-script: Mon Dec 16 08:44:43 -0500 2024 Info [4589]: === early_run_hotfixes.sh-b383 finished with exit_code=127 ===
startup-script: Mon Dec 16 08:44:43 -0500 2024 Error [4589]: === execution of early_run_hotfixes.sh-b383 failed, exiting ===


If re-running the script, using for example: DEBUG=1 google_metadata_script_runner startup

@harshthakkar01 harshthakkar01 self-assigned this Oct 21, 2024
Comment thread modules/file-system/parallelstore/variables.tf
@harshthakkar01

Copy link
Copy Markdown
Contributor

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

@harshthakkar01

Copy link
Copy Markdown
Contributor

usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior

Current behavior we dont use systemd unit to mount the instance. We manually mount it using bash and enable the service to be run for every restart.

@harshthakkar01 harshthakkar01 added the release-improvements Added to release notes under the "Improvements" heading. label Nov 12, 2024
@harshthakkar01 harshthakkar01 marked this pull request as ready for review November 12, 2024 19:46
@tpdownes

Copy link
Copy Markdown
Contributor

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

I strongly recommend Restart=on-failure for almost any SystemD unit unless always provides some benefit you can clearly articulate. I believe always isn't allowed for oneshots.

@wiktorn

wiktorn commented Nov 13, 2024

Copy link
Copy Markdown
Contributor Author

Adding Restart=always to systemd unit. With my testing, one needs to run first umount $mount_point, but if this indeed is necessary, this could be wrapped in script together with $mount_command, and this would make sure, that even if dfuse crashes, it will restart

In this case, if you enable systemd unit, it will run for every restart (this is the current behavior, wdyt ?). For the case, you mentioned, I also observed you need to umount it first.

I strongly recommend Restart=on-failure for almost any SystemD unit unless always provides some benefit you can clearly articulate. I believe always isn't allowed for oneshots.

My idea was to run dfuse in foreground and do not make this a oneshoot, so systemd can monitor the process and if it crashes (what may happen), and whatever the reason it will be for it to exit, it systemd will restart it.

The only thing to do is to instead call directly the dfuse binary, call a script, which will call unmount if needed.

@wiktorn

wiktorn commented Nov 13, 2024

Copy link
Copy Markdown
Contributor Author

usage of systemctl to start dfuse on first call. I did not implemented as such for now, as I didn't find a good way to fail systemctl start command if dfuse failed to start and this is current behavior

Current behavior we dont use systemd unit to mount the instance. We manually mount it using bash and enable the service to be run for every restart.

Using systemd always would simplify the code, as there would only one way to mount the filesystem, whether this is the first startup of the instance, or the next one. The only drawback is that I did not find a way, to fail systemctl enable --now (or similar command) when service did not start successfully.

I can add a check after the call, with some sleeps to check if either service started successfully or the mountpoint is mounted and fail the startup script if those checks fail. WDYT?

@mr0re1

mr0re1 commented Nov 15, 2024

Copy link
Copy Markdown
Collaborator

/gcbrun

@wiktorn wiktorn marked this pull request as draft November 15, 2024 21:35
@harshthakkar01

harshthakkar01 commented Nov 15, 2024

Copy link
Copy Markdown
Contributor

You may need to rebase this after #3256 (this one was high priorty). I have discussed with Ivan and he will follow up with on this as I am ooo for 3 weeks from today.

Btw, we agreed on moving forward with adding customization for mounts. Thanks for this PR.

@wiktorn wiktorn force-pushed the template_pstore_mount branch from 0d3506a to 87f6475 Compare November 16, 2024 17:44
@wiktorn wiktorn marked this pull request as ready for review November 16, 2024 17:44
@wiktorn

wiktorn commented Nov 16, 2024

Copy link
Copy Markdown
Contributor Author

You may need to rebase this after #3256 (this one was high priorty). I have discussed with Ivan and he will follow up with on this as I am ooo for 3 weeks from today.

Btw, we agreed on moving forward with adding customization for mounts. Thanks for this PR.

Thanks for letting me know. For now, I rebased this PR on top of #3256, once it is in, I can skip a few commits from here.

I aligned now pre-existing-network-storage with parallelstore in terms of scripts and APIs and fixed some smaller issues I faced (like installation failing on RHEL 8, but started to work well after dnf clean all 🤷 )

@harshthakkar01

Copy link
Copy Markdown
Contributor

Please resolve conflicts

@wiktorn wiktorn force-pushed the template_pstore_mount branch from 87f6475 to bcb9a91 Compare December 20, 2024 14:55
@wiktorn

wiktorn commented Dec 20, 2024

Copy link
Copy Markdown
Contributor Author

Conflicts resolved.

@wiktorn wiktorn force-pushed the template_pstore_mount branch from bcb9a91 to 850c370 Compare December 20, 2024 16:08
@harshthakkar01

Copy link
Copy Markdown
Contributor

/gcbrun

1 similar comment
@harshthakkar01

Copy link
Copy Markdown
Contributor

/gcbrun

Comment thread modules/file-system/parallelstore/templates/mount-daos.sh.tftpl Outdated
@wiktorn wiktorn force-pushed the template_pstore_mount branch 2 times, most recently from 1c1a045 to f9a92b2 Compare December 23, 2024 17:27
@wiktorn wiktorn force-pushed the template_pstore_mount branch from f9a92b2 to d76cfe2 Compare December 23, 2024 20:17
@harshthakkar01

Copy link
Copy Markdown
Contributor

/gcbrun

@harshthakkar01 harshthakkar01 merged commit 2933f6f into GoogleCloudPlatform:develop Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants