Skip to content

LLMEval: Memory usage #17

@ptliddle

Description

@ptliddle

I'm not sure if i should ask this here or the mlx-swift repo. Let me know if i should move it.

I'm trying to understand the memory usage. I have a 4billion parameter, quantized (4bit) Qwen2 model that i'm using for inference with the code in LLMEval. Sticking debug points in the code and tracking memory usage it seems just after load it uses around ~500mb. After the first inference this balloons to over 10gb (~10.3gb) and then doesn't reduce again even after inference is complete.

Can someone explain why this is? Is it just a case of the model loading things it needs lazily? If so is there a way to reset this so the model can drop back to it's pre inference, post weight load size to reduce the memory requirements of the app when the LLM is sitting idle?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions