-
|
Hi I'm not sure if this is issue-worthy. I'm trying to run Qwen3.6 27B as follows:
Run command: These settings just barely fit ,but I get the following message on startup: The PP rate is between 100 and 400 t/s, output is at 9-16 t/s in this setup. Even though the GPU utilization is also at 99%, I'm wondering if the missing GET_ROWS and CPU offload is bottlenecking. Unfortunately, trying the Vulkan backend, the PP performance would never exceed 200 t/s. This is kind of painful with OpenCode, which likes huge contexts. Is the issue in the quantization that triggers the offload to a single CPU? Is there an option to parallelize the min-k on the CPU or is this a driver issue? And would adding a second b60 help, given that at least part of the workload is being moved to the CPU? Thanks :) Startup log below: |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
|
@cwriter Thank you for your reporting! |
Beta Was this translation helpful? Give feedback.
-
|
Thank you! |
Beta Was this translation helpful? Give feedback.
@cwriter
Fixed by PR: #23710.
Thank you!