-
Notifications
You must be signed in to change notification settings - Fork 365
Description
I'm not sure if i should ask this here or the mlx-swift repo. Let me know if i should move it.
I'm trying to understand the memory usage. I have a 4billion parameter, quantized (4bit) Qwen2 model that i'm using for inference with the code in LLMEval. Sticking debug points in the code and tracking memory usage it seems just after load it uses around ~500mb. After the first inference this balloons to over 10gb (~10.3gb) and then doesn't reduce again even after inference is complete.
Can someone explain why this is? Is it just a case of the model loading things it needs lazily? If so is there a way to reset this so the model can drop back to it's pre inference, post weight load size to reduce the memory requirements of the app when the LLM is sitting idle?