Skip to content

Add multi-mount parallelstore support #3256

Merged
tpdownes merged 2 commits into
GoogleCloudPlatform:developfrom
harshthakkar01:ps-fix-2
Dec 5, 2024
Merged

Add multi-mount parallelstore support #3256
tpdownes merged 2 commits into
GoogleCloudPlatform:developfrom
harshthakkar01:ps-fix-2

Conversation

@harshthakkar01

@harshthakkar01 harshthakkar01 commented Nov 13, 2024

Copy link
Copy Markdown
Contributor

This PR,

  • excludes GPU interfaces in daos config file
  • Add support for multiple mount for single parallelstore instance (creates systemd service for each mount)

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

  • Fork your PR branch from the Toolkit "develop" branch (not main)
  • Test all changes with pre-commit in a local branch #
  • Confirm that "make tests" passes all tests
  • Add or modify unit tests to cover code changes
  • Ensure that unit test coverage remains above 80%
  • Update all applicable documentation
  • Follow Cluster Toolkit Contribution guidelines #

@harshthakkar01 harshthakkar01 added the release-improvements Added to release notes under the "Improvements" heading. label Nov 13, 2024
@harshthakkar01 harshthakkar01 force-pushed the ps-fix-2 branch 4 times, most recently from 727d580 to f607f80 Compare November 15, 2024 06:50
@harshthakkar01 harshthakkar01 changed the title Update mount parallelstore script to support multiple parallelstore Add multi-mount parallelstore support Nov 15, 2024
Comment thread modules/file-system/parallelstore/scripts/mount-daos.sh

@tpdownes tpdownes left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to know a bit more about what attempting to support multiple Parallelstore instances. In the meantime, the changes I suggest will improve reliability.

Comment thread modules/file-system/parallelstore/scripts/mount-daos.sh Outdated
Comment thread modules/file-system/parallelstore/scripts/mount-daos.sh Outdated
Comment thread modules/file-system/parallelstore/scripts/mount-daos.sh
Comment thread modules/file-system/parallelstore/scripts/mount-daos.sh Outdated
Comment thread modules/file-system/parallelstore/scripts/mount-daos.sh Outdated
@tpdownes tpdownes assigned harshthakkar01 and unassigned tpdownes Nov 18, 2024
@mr0re1 mr0re1 assigned mr0re1 and unassigned harshthakkar01 Nov 19, 2024
@tpdownes tpdownes assigned tpdownes and unassigned mr0re1 Nov 20, 2024
Comment thread modules/file-system/pre-existing-network-storage/scripts/mount-daos.sh Outdated
tpdownes added a commit to harshthakkar01/hpc-toolkit that referenced this pull request Dec 3, 2024
TESTED:
- simple Debian and Ubuntu VMs with one NIC

TODO:
- rewrite find command to address 2 gVNIC?
- fix quoting of ignored interfaces
TESTED:
- simple Debian and Ubuntu VMs with one NIC
- a3-megagpu-8g Ubuntu and HPC Rocky 8
@tpdownes

tpdownes commented Dec 5, 2024

Copy link
Copy Markdown
Contributor

In addition to the standard tests I tested against this blueprint:

---
blueprint_name: test-ps

vars:
  deployment_name: test-ps
  project_id: hpc-toolkit-gsc
  region: us-central1
  zone: us-central1-c
  parallelstore_ips: "[10.80.175.133,10.80.175.132,10.80.175.130]"

deployment_groups:
- group: primary
  modules:

  - id: network
    source: modules/network/pre-existing-vpc
    settings:
      network_name: a3mega-sys-net
      subnetwork_name: a3mega-sys-subnet

  - id: gpunet
    source: modules/network/pre-existing-vpc
    settings:
      network_name: a3mega-cluster-dev-gpunet-0
      subnetwork_name: a3mega-cluster-dev-gpunet-0-subnet

  - id: parallelstore-rwx
    source: modules/file-system/pre-existing-network-storage
    settings:
      fs_type: daos
      remote_mount: $(vars.parallelstore_ips)
      local_mount: /parallelstore/rwx
      mount_options: disable-caching,thread-count=26,eq-count=13,multi-user

  - id: parallelstore-rwo
    source: modules/file-system/pre-existing-network-storage
    settings:
      fs_type: daos
      remote_mount: $(vars.parallelstore_ips)
      local_mount: /parallelstore/rwo
      mount_options: disable-wb-cache,thread-count=26,eq-count=13,multi-user

  - id: vm
    source: modules/compute/vm-instance
    use:
    - parallelstore-rwo
    - parallelstore-rwx
    settings:
      machine_type: n2-standard-8
      name_prefix: id
      disk_type: pd-ssd
      network_interfaces:
      - network: null
        subnetwork: $(network.subnetwork_self_link)
        subnetwork_project: null
        network_ip: null
        stack_type: null
        access_config: []
        ipv6_access_config: []
        alias_ip_range: []
        queue_count: null
        nic_type: GVNIC
      - network: null
        subnetwork: $(gpunet.subnetwork_self_link)
        subnetwork_project: null
        network_ip: null
        stack_type: null
        access_config: []
        ipv6_access_config: []
        alias_ip_range: []
        queue_count: null
        nic_type: GVNIC

and observed the expected outcome:

exclude_fabric_ifaces: ["lo","eth1"]

@tpdownes tpdownes self-requested a review December 5, 2024 05:44
@tpdownes tpdownes merged commit c416381 into GoogleCloudPlatform:develop Dec 5, 2024
tpdownes added a commit to tpdownes/cluster-toolkit that referenced this pull request Dec 5, 2024
tpdownes added a commit to tpdownes/cluster-toolkit that referenced this pull request Dec 6, 2024
tpdownes added a commit to tpdownes/cluster-toolkit that referenced this pull request Dec 6, 2024
tpdownes added a commit to tpdownes/cluster-toolkit that referenced this pull request Dec 6, 2024
cdunbar13 pushed a commit to cdunbar13/cluster-toolkit that referenced this pull request Dec 18, 2024
TESTED:
- simple Debian and Ubuntu VMs with one NIC
- a3-megagpu-8g Ubuntu and HPC Rocky 8
cdunbar13 pushed a commit to cdunbar13/cluster-toolkit that referenced this pull request Dec 18, 2024
@nick-stroud nick-stroud mentioned this pull request Dec 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-improvements Added to release notes under the "Improvements" heading.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants