After importing the model and using apr chat the model seems to be correctly loaded on the GPU but never returns an answer. The GPU cycles a few times between 0% and 70% usage for about 2 minutes, and then finally the return is empty.
If I try again, the same cycle repeats. This whole interaction took about 6 minutes:
apr chat qwen2.5-1.5b-instruct-q4_k_m.apr
=== Model Chat (APR Format) ===
Using APR v2 format with mmap (Native Library Mandate)
Model: qwen2.5-1.5b-instruct-q4_k_m.apr
Chat Template: ChatML
Temperature: 0.7
Top-P: 0.9
Max Tokens: 512
Commands:
/quit Exit the chat
/clear Clear conversation history
/system Set system prompt
/help Show help
════════════════════════════════════════════════════════════
Loading model...
Loaded APR format in 0.15s (1113.2 MB)
Loaded tokenizer: tokenizer.json (151936 tokens)
Detected Raw chat template
You: hey
[AprV2ModelCuda] Pre-cached 5596 MB of weights on GPU (28 layers, 0 quantized, 308 F32 tensors)
[AprV2ModelCuda] Cached embedding table: 125 MB
[APR CUDA: NVIDIA GeForce RTX 4090 (24077 MB VRAM)]
Assistant:
You: hey
[AprV2ModelCuda] Pre-cached 5596 MB of weights on GPU (28 layers, 0 quantized, 308 F32 tensors)
[AprV2ModelCuda] Cached embedding table: 125 MB
[APR CUDA: NVIDIA GeForce RTX 4090 (24077 MB VRAM)]
Assistant:
You:
After importing the model and using
apr chatthe model seems to be correctly loaded on the GPU but never returns an answer. The GPU cycles a few times between 0% and 70% usage for about 2 minutes, and then finally the return is empty.If I try again, the same cycle repeats. This whole interaction took about 6 minutes: