Skip to content

Change default CHGnet.load(check_cuda_mem: bool) to False#164

Merged
janosh merged 4 commits intomainfrom
default-check_cuda_mem-False
Jun 11, 2024
Merged

Change default CHGnet.load(check_cuda_mem: bool) to False#164
janosh merged 4 commits intomainfrom
default-check_cuda_mem-False

Conversation

@janosh
Copy link
Collaborator

@janosh janosh commented Jun 11, 2024

there's a problem with cuda_devices_sorted_by_free_mem on slurm clusters

def cuda_devices_sorted_by_free_mem() -> list[int]:
"""List available CUDA devices sorted by increasing available memory.
To get the device with the most free memory, use the last list item.
"""
if not torch.cuda.is_available():
return []
free_memories = []
nvidia_smi.nvmlInit()
device_count = nvidia_smi.nvmlDeviceGetCount()
for idx in range(device_count):
handle = nvidia_smi.nvmlDeviceGetHandleByIndex(idx)
info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
free_memories.append(info.free)

it will return whatever GPU has most free memory and so the model tries to use that even if the job was allocated a different GPU. this results in a cryptic CUDA error

    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

Process finished with exit code 1

given CHGNet is expected to often be used on queued HPC infra where this error can happen and the error message is not obvious to debug, @BowenD-UCB and I agreed to change the default from True to False

@janosh janosh added ux User experience breaking Breaking change hardware Running on accelerated hardware labels Jun 11, 2024
@janosh janosh changed the title Change default check_cuda_mem: bool to False Change default CHGnet.load(check_cuda_mem: bool) to False Jun 11, 2024
@janosh janosh merged commit d3f1b30 into main Jun 11, 2024
@janosh janosh deleted the default-check_cuda_mem-False branch June 11, 2024 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaking change hardware Running on accelerated hardware ux User experience

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant