Change default `CHGnet.load(check_cuda_mem: bool)` to `False` by janosh · Pull Request #164 · CederGroupHub/chgnet

janosh · 2024-06-11T18:09:13Z

there's a problem with cuda_devices_sorted_by_free_mem on slurm clusters

Lines 36 to 50 in 81439f2

    
           def cuda_devices_sorted_by_free_mem() -> list[int]: 
        
               """List available CUDA devices sorted by increasing available memory. 
        
               To get the device with the most free memory, use the last list item. 
        
               """ 
        
               if not torch.cuda.is_available(): 
        
                   return [] 
        
               free_memories = [] 
        
               nvidia_smi.nvmlInit() 
        
               device_count = nvidia_smi.nvmlDeviceGetCount() 
        
               for idx in range(device_count): 
        
                   handle = nvidia_smi.nvmlDeviceGetHandleByIndex(idx) 
        
                   info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle) 
        
                   free_memories.append(info.free)

it will return whatever GPU has most free memory and so the model tries to use that even if the job was allocated a different GPU. this results in a cryptic CUDA error

    return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
RuntimeError: CUDA error: invalid device ordinal

Process finished with exit code 1

given CHGNet is expected to often be used on queued HPC infra where this error can happen and the error message is not obvious to debug, @BowenD-UCB and I agreed to change the default from True to False

…rams

janosh added 2 commits June 11, 2024 13:57

tweak docs examples/QueryMPtrj.md

806b317

change check_cuda_mem default to False

81439f2

janosh added ux User experience breaking Breaking change hardware Running on accelerated hardware labels Jun 11, 2024

janosh changed the title ~~Change default check_cuda_mem: bool to False~~ Change default CHGnet.load(check_cuda_mem: bool) to False Jun 11, 2024

update documented default True->False

dc2e425

janosh temporarily deployed to github-pages June 11, 2024 18:16 — with GitHub Actions Inactive

assert check_cuda_mem defaults to False in test_model_load_version_pa…

3a9bacd

…rams

janosh temporarily deployed to github-pages June 11, 2024 18:26 — with GitHub Actions Inactive

janosh merged commit d3f1b30 into main Jun 11, 2024

janosh deleted the default-check_cuda_mem-False branch June 11, 2024 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change default `CHGnet.load(check_cuda_mem: bool)` to `False`#164

Change default `CHGnet.load(check_cuda_mem: bool)` to `False`#164
janosh merged 4 commits intomainfrom
default-check_cuda_mem-False

janosh commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	def cuda_devices_sorted_by_free_mem() -> list[int]:
	"""List available CUDA devices sorted by increasing available memory.

	To get the device with the most free memory, use the last list item.
	"""
	if not torch.cuda.is_available():
	return []

	free_memories = []
	nvidia_smi.nvmlInit()
	device_count = nvidia_smi.nvmlDeviceGetCount()
	for idx in range(device_count):
	handle = nvidia_smi.nvmlDeviceGetHandleByIndex(idx)
	info = nvidia_smi.nvmlDeviceGetMemoryInfo(handle)
	free_memories.append(info.free)

Conversation

janosh commented Jun 11, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant