Conversation
alabdao
commented
Oct 17, 2023
- supporting latest prod deployment
- avoid pulling container since they cause slowdown
- setting receptor url
- new prod setup
- removed prod
- Ability to deploy custom file
- install config file at the end after bacalhau repo has been intialized
- add instance id to labels
- [LAB-595] Dynamically determining if GPUs are available and how many
- casting to int
- fixing query
- fixing bacalhau client version defterminatino command
- making limit memory determination dynamic
- gather facts true
- accept networked jobs
- removing quotes
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
| --limit-job-memory 12gb \ | ||
| {% if gpu %} | ||
| --limit-job-gpu 1 \ | ||
| --limit-job-memory {{ (ansible_memtotal_mb | int * 0.80) | round | int }}Mb \ |
There was a problem hiding this comment.
dynamically determining total memory and setting limit to 80% of that instead of hardcoded 12Gb which is low for some instances.
| {% if num_of_gpus | int > 0 %} | ||
| --limit-job-gpu {{ num_of_gpus | int }} \ |
There was a problem hiding this comment.
Dynamically setting number of gpus. This will support non GPU instances as well as instances with GPU > 1 instead of hardcoded to be always 1
| @@ -0,0 +1,67 @@ | |||
| --- | |||
There was a problem hiding this comment.
Since cli flags dont wory anymore, setting configurations in the file.
| JobExecutionTimeoutClientIDBypassList: [] | ||
| JobNegotiationTimeout: 3m0s | ||
| MinJobExecutionTimeout: 500ms | ||
| MaxJobExecutionTimeout: {{ bacalhau_compute_max_job_execution_timeout | default('24h') }} |
There was a problem hiding this comment.
allowing override for max execution timeout. set to 24h for now.
| JobSelection: | ||
| Locality: anywhere | ||
| RejectStatelessJobs: false | ||
| AcceptNetworkedJobs: true |
There was a problem hiding this comment.
setting it to true here since cli flags dont work atm.
| nvidia_distribution: ubuntu2004 | ||
| ipfs_version: "0.18.0" | ||
| ipfs_path: "/opt/ipfs" | ||
| gpu: true |
| # Get GPU info from system | ||
| - name: Get lshw display info | ||
| become: true | ||
| ansible.builtin.command: lshw -c display -json | ||
| changed_when: true | ||
| register: lshw_output | ||
|
|
||
| - name: set number of gpus available | ||
| vars: | ||
| query: "[?vendor=='NVIDIA Corporation']" | ||
| ansible.builtin.set_fact: | ||
| num_of_gpus: "{{ lshw_output.stdout | from_json | json_query(query) | length }}" | ||
|
|
There was a problem hiding this comment.
get number of GPUs for lshw command.
| # - name: Pull common containers | ||
| # ansible.builtin.include_tasks: tasks/pull_common_containers.yaml |
There was a problem hiding this comment.
pulling contains causes serious load on the system. disabling this for now until figuring out something better here.
There was a problem hiding this comment.
Does this mean the compute node will pull the container it needs the first time it runs the Job?
There was a problem hiding this comment.
that's correct.
Potentially fix here would be to use packer to create compute image which has Convexity select containers already available.
| # Try running Bacalhau first, to see what version it is. | ||
| - name: Check bacalhau version | ||
| ansible.builtin.command: /usr/local/bin/bacalhau version | ||
| ansible.builtin.command: /usr/local/bin/bacalhau version --client --no-style --hide-header |
There was a problem hiding this comment.
only get currently installed bacalhau version without the prettiness.
| - name: Set fact for currently installed version | ||
| ansible.builtin.set_fact: | ||
| bacalhau_installed_version: "{{ existing_bacalhau_version.stdout.split('Server Version: ')[1] }}" | ||
| bacalhau_installed_version: "{{ existing_bacalhau_version.stdout | trim }}" |
There was a problem hiding this comment.
output has now changed from 'Server Version:'.
| - name: Set fact when its non-prod node | ||
| - name: Set fact | ||
| ansible.builtin.set_fact: | ||
| requester_hostname: "requester.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz" | ||
| ipfs_hostname: "ipfs.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz" | ||
| when: ansible_ec2_tags_instance_Env is defined and ansible_ec2_tags_instance_Env | lower != "prod" | ||
| receptor_hostname: "receptor.{{ ansible_ec2_tags_instance_Env | lower }}.labdao.xyz" | ||
| when: ansible_ec2_tags_instance_Env is defined |
There was a problem hiding this comment.
all envs now following something,<env>.labdao.xyz approach
| - name: Set receptor url | ||
| ansible.builtin.set_fact: | ||
| receptor_url: "http://{{ receptor_hostname }}:8080/judge" | ||
| when: receptor_hostname is defined | ||
|
|
There was a problem hiding this comment.
determining receptor URL from env
| - name: Ensure path to bacalhau dir exists | ||
| become: true | ||
| ansible.builtin.file: | ||
| path: /home/ubuntu/.bacalhau/ | ||
| state: directory | ||
| mode: "0755" | ||
| owner: ubuntu | ||
| group: ubuntu | ||
|
|
There was a problem hiding this comment.
Creating dir so we can push the config file there.
| - name: Flush handler to ensure Bacalhau is running | ||
| ansible.builtin.meta: flush_handlers | ||
|
|
||
| - name: Deploy config file | ||
| become: true | ||
| ansible.builtin.template: | ||
| src: "files/{{ bacalhau_node_type }}.yaml" | ||
| dest: /home/ubuntu/.bacalhau/config.yaml | ||
| owner: ubuntu | ||
| group: ubuntu | ||
| mode: "0644" | ||
| notify: | ||
| - Restart Bacalhau |
There was a problem hiding this comment.
Deploy the custom file.
| # - name: Pull common containers | ||
| # ansible.builtin.include_tasks: tasks/pull_common_containers.yaml |
There was a problem hiding this comment.
Does this mean the compute node will pull the container it needs the first time it runs the Job?