Skip to content

Separate requester compute#513

Merged
hevans66 merged 3 commits intomainfrom
ops/separate-requester-compute
Jul 17, 2023
Merged

Separate requester compute#513
hevans66 merged 3 commits intomainfrom
ops/separate-requester-compute

Conversation

@hevans66
Copy link
Copy Markdown
Contributor

  • Separates requester and compute nodes to separate EC2 instances. Currently one requester and one compute instance.

  • Factors out IPFS install steps into taks file.

  • This PR creates a new Requester/Compute node, essentially a new Plex instance - running at ec2-18-208-163-46.compute-1.amazonaws.com. I think if were happy with how this is working we can bounce the private ip to this node, then decommission the old compute node in a separate PR.

@vercel
Copy link
Copy Markdown

vercel bot commented Jul 17, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 17, 2023 7:41pm

@hevans66 hevans66 temporarily deployed to ci July 17, 2023 18:41 — with GitHub Actions Inactive
Comment on lines +96 to +108
- name: Install Nvidia Container Tookit
become: yes
ansible.builtin.apt:
pkg:
- nvidia-docker2
notify:
- Restart docker
when: gpu

- name: Ensure Nvidia persitence daemon is started
ansible.builtin.systemd:
name: nvidia-persistenced
when: gpu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocker: the 2 when's can be combined using ansible block

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to combine them, but I had them separated to decouple the Restart docker hook as much as possible. My thinking is since most of the jobs compute is running are docker based we want to avoid restarting docker unnecessarily if possible. Probably a minor optimization in this case though.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in that case, we can use docker live-restore option. https://docs.docker.com/config/containers/live-restore/

Comment on lines +45 to +79
# Nvidia
- name: Get Nvidia drivers apt key
ansible.builtin.get_url:
url: https://developer.download.nvidia.com/compute/cuda/repos/{{ nvidia_distribution }}/x86_64/cuda-keyring_1.0-1_all.deb
dest: /tmp/cuda-keyring.deb
when: gpu

- name: Add Nvidia Keyring
become: yes
ansible.builtin.apt:
deb: /tmp/cuda-keyring.deb
when: gpu

- name: Get Nvidia Container Tookit GPG key
become: yes
ansible.builtin.shell:
cmd: curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --yes --dearmor -o {{ nvidia_container_toolkit_key_path }}
creates: "{{ nvidia_container_toolkit_key_path }}"
when: gpu

- name: Add Nvidia Container Tookit Repository
become: yes
ansible.builtin.apt_repository:
repo: deb [signed-by={{ nvidia_container_toolkit_key_path }}] https://nvidia.github.io/libnvidia-container/stable/ubuntu18.04/$(ARCH) /
state: present
when: gpu

- name: Install required system packages for gpu build
become: yes
ansible.builtin.apt:
pkg:
- cuda-drivers
state: latest
update_cache: true
when: gpu
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not blocker: combining block and when.

@alabdao
Copy link
Copy Markdown
Contributor

alabdao commented Jul 17, 2023

just a general note, again not a blocker. Just good to delete anything that gets downloaded under /tmp/ to avoid filling up disk space. Especially matters when its a long lived node, small disk, multiple versions get installed, can fill up quick.

@alabdao
Copy link
Copy Markdown
Contributor

alabdao commented Jul 17, 2023

since my comments aren't blocking, gonna approve this to move it along.

@hevans66
Copy link
Copy Markdown
Contributor Author

just a general note, again not a blocker. Just good to delete anything that gets downloaded under /tmp/ to avoid filling up disk space. Especially matters when its a long lived node, small disk, multiple versions get installed, can fill up quick.

👍 💯 agree. I'm gonna merge as is. We can make /tmp/ clean up (and probably disk space monitoring in general) a separate project.

@hevans66 hevans66 merged commit c001733 into main Jul 17, 2023
@hevans66 hevans66 deleted the ops/separate-requester-compute branch July 17, 2023 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants