LLMEval: Memory usage

I'm not sure if i should ask this here or the mlx-swift repo. Let me know if i should move it.

I'm trying to understand the memory usage. I have a 4billion parameter, quantized (4bit) Qwen2 model that i'm using for inference with the code in LLMEval. Sticking debug points in the code and tracking memory usage it seems just after load it uses around ~500mb. After the first inference this balloons to over 10gb (~10.3gb) and then doesn't reduce again even after inference is complete. 

Can someone explain why this is? Is it just a case of the model loading things it needs lazily? If so is there a way to reset this so the model can drop back to it's pre inference, post weight load size to reduce the memory requirements of the app when the LLM is sitting idle?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLMEval: Memory usage #17

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLMEval: Memory usage #17

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions