Costs
This tutorial includes the following chargeable resources:-
NFS server:
- Compute VM (default: Non-GPU AMD EPYC Genoa,
4vcpu-16gb) - Compute disk (default: Network SSD IO M3, 93 GiB)
- Compute VM (default: Non-GPU AMD EPYC Genoa,
- Artifacts storage: Object Storage bucket
-
Anyscale deployment:
- Compute VMs as Managed Kubernetes nodes (default: one NVIDIA® H100 NVLink with Intel Sapphire Rapids,
1gpu-16vcpu-200gbVM; one Non-GPU AMD EPYC Genoa,4vcpu-16gbVM) - Compute disks for the VMs (default: two Network SSD disks, 1023 GiB and 128 GiB)
- Compute VMs as Managed Kubernetes nodes (default: one NVIDIA® H100 NVLink with Intel Sapphire Rapids,
Prerequisites
- Create an Anyscale account.
- Install and configure the following tools:
Steps
Prepare the environment
-
Clone the GitHub repository and go to the
anyscaledirectory: -
Edit the
environment.shfile to add the IDs of your tenant, project and the region of the project to the environment variables at the top of the file. -
Execute
environment.shto export environment variables from it, so they persist for the current shell session: -
Make a copy of the configuration file:
-
Create an SSH key for the NFS server and Anyscale nodes:
Create storage for Anyscale
The Terraform configuration in theprepare directory creates an Object Storage bucket to store workload artifacts and an NFS server to store Anyscale workspace data (user code, configuration files, etc.).
-
Edit
default.yaml:- In
.ssh_public_key, paste the contents of the public SSH key (~/.ssh/id_ed25519.pub). - In
.nfs_server.nfs_size, set the size of the disk in GiB. It is a Network SSD IO M3 disk, its size must be a multiple of 93 GiB (e.g. 1023 GiB).
- In
-
Apply the Terraform configuration from the
preparedirectory:
Deploy Anyscale
The Terraform configuration in thedeploy directory creates a Managed Kubernetes cluster and node groups, and deploys the Anyscale application on the cluster.
-
Register the Anyscale cloud:
Replace
<cloud_name>with the name for your Anyscale cloud deployment that will be shown in Anyscale console. The output contains a cloud deployment ID that starts withcldrsrc_. Save it for the next step. - Create an Anyscale API key in Anyscale console.
-
Edit
default.yaml:-
In
.anyscale.cloud_deployment_id, paste the cloud deployment ID that starts withcldrsrc_which you obtained from the previous step. -
In
.anyscale.anyscale_cli_token, paste the Anyscale API key. -
In
k8s_cluster, configure the cluster. It should have at least one GPU node group with one node, and at least one non-GPU node. The default configuration creates an NVIDIA H100 node group with one single-GPU node and a non-GPU node group with one node. For details about the parameters, see the following articles and resources:-
{cpu,gpu}_nodes_{platform,preset}: Types of virtual machines and GPUs in Nebius AI Cloud -
enable_gpu_cluster,infiniband_fabric: Interconnecting GPUs in Managed Service for Kubernetes® clusters using InfiniBand™ -
gpu_nodes_driverfull_image: GPU drivers and other components -
enable_{prometheus,loki}: section about Kubernetes observability in the Nebius AI Cloud solution library
-
-
In
-
Apply the Terraform configuration from the
deploydirectory:
(Optional) Configure Anyscale
You can configure how Anyscale head node and worker nodes are selected. To do this, log in to Anyscale console, go to your workspace and then follow instructions in the next sections.Force a non-GPU head node
Anyscale head node is a Kubernetes Pod that does not use GPUs. However, by default, it can be scheduled on a GPU node or a non-GPU node. You can force the head node to run on non-GPU nodes to save you the costs of provisioning GPU nodes. To do that, perform the following steps in your workspace in Anyscale console:-
On the Compute resources panel, under Head node, click
.
- In the window that opens, expand Advanced config.
-
Under Instance config, paste the node selector specification:
The value of the
node.kubernetes.io/instance-typeannotation must match the platform specified in the.k8s_cluster.cpu_nodes_platformfield of theanyscale/default.yamlfile. - Click Save.
Configure the selection of worker nodes
Anyscale allows automatic and manual modes of selecting worker nodes for your workspaces. To choose between these modes in Anyscale console, on the Compute resources panel, under Worker nodes, select or clear the Auto-select worker nodes checkbox:- When the checkbox is selected, Anyscale tries to provision worker nodes automatically. This works well when you run Anyscale and other workloads at the same time in the Managed Kubernetes cluster, because Anyscale workloads only reserve as many GPUs as they require. However, since the number of worker nodes is scaled on demand, provisioning new worker nodes can take some time.
- When the checkbox is not selected, you select worker nodes manually. This is recommended for workloads that need to scale up fast, or if you need granular control over GPU usage.
Test the deployment
For details on testing the Anyscale deployment, see Anyscale resources:How to delete the created resources
Some of the created resources are chargeable. If you do not need them, delete these resources so Nebius AI Cloud does not charge for them:- In Anyscale console, delete all the workloads that use the deployment.
-
Delete all objects from the Anyscale bucket. Its name starts with
anyscale-. -
Delete the Managed Kubernetes cluster, NFS server and bucket by running the following commands in the
anyscaledirectory of the cloned repository: