🐛 Describe the bug
This issue appears to be related to #95823, but also occurs on smaller tensors.
Although #95823 is closed, the underlying problem still exists.
PyTorch appears to allocate memory up to the next power of two (256MB) when pinning tensors that are slightly larger than 128MB in size.
This nearly doubles the expected memory usage.
Minimal Example
import torch
def get_free():
import subprocess
r = subprocess.run(["free", "-m"], capture_output=True)
d = r.stdout.decode('utf-8')
s = d.split(':')[1].split()
return f"[used={s[1]:7}, shared={s[3]:7}] "
model_weight = torch.randn(18944, 3584, dtype=torch.float16, device='cpu') #129.5MB (qwen2.5 7b, up_proj)
# model_weight = torch.randn(14336, 4096, dtype=torch.float16, device='cpu') #112.0MB (llama3.1 7b, up_proj)
print("weight memory usage:", model_weight.element_size() * model_weight.nelement() / (1024 ** 2), "MB")
# Pinning memory
print(get_free() + "Before pin")
model_weight = model_weight.pin_memory()
print(get_free() + "After pin")
Observed Behavior
It allocates almost double the memory when pinning qwen2.5 7b's up_proj (expected: 129.5 MB, actually used: 264 MB):
weight memory usage: 129.5 MB
[used=9306 , shared=108 ] Before pin
[used=9334 , shared=372 ] After pin
Pinning llama3.1 8b's up_proj (expected:112.0 MB, actually used: 136 MB):
weight memory usage: 112.0 MB
[used=9280 , shared=108 ] Before pin
[used=9321 , shared=244 ] After pin
Although the additional memory is less noticeable when pinning a single tensor, it can scale up and significantly inflate DRAM usage. For instance, it results in approximately 12GB of extra memory overhead when pinning weights of Qwen2.5 7b model.
Versions
PyTorch version: 2.6.0+cu126
cc @ptrblck @msaroufim @eqy
🐛 Describe the bug
This issue appears to be related to #95823, but also occurs on smaller tensors.
Although #95823 is closed, the underlying problem still exists.
PyTorch appears to allocate memory up to the next power of two (256MB) when pinning tensors that are slightly larger than 128MB in size.
This nearly doubles the expected memory usage.
Minimal Example
Observed Behavior
It allocates almost double the memory when pinning qwen2.5 7b's up_proj (expected: 129.5 MB, actually used: 264 MB):
Pinning llama3.1 8b's up_proj (expected:112.0 MB, actually used: 136 MB):
Although the additional memory is less noticeable when pinning a single tensor, it can scale up and significantly inflate DRAM usage. For instance, it results in approximately 12GB of extra memory overhead when pinning weights of Qwen2.5 7b model.
Versions
PyTorch version: 2.6.0+cu126
cc @ptrblck @msaroufim @eqy