You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
#10133 changes changed get_rows from false to true. I've detected a big regression for quantizations that support get_rows (llama3 Q8_0 for example).
@uniartisan Could you share more information of the device you used for offloading (where you saw increased performance)? Or did this just improve testing?
Returning true for GGML_OP_GET_ROWS in offload_op will cause the token embeddings to be copied to VRAM, which is almost never worth it since this is a big tensor and this op can be run very cheaply on the CPU. I imagine that RWK uses get_rows in some way that might make it worthwhile copying the weight to VRAM in that case, and that's why @uniartisan saw a speedup, but it needs to be done in a more selective way.
I´ve tested multiple GPUs. The description has data for a Nvidia A100, but I also tested on an Arc 770 and a Data Center GPU Max 1100. For these two models I see regression in performance, though I'm using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf.
@NeoZhangJianyu I assure you, this is a significant performance problem and needs to be fixed as soon as possible. It's hard to tell why you cannot reproduce this without more details about how you are testing.
@NeoZhangJianyu you mentioned testing with Meta-Llama-3-8B.Q8_0.gguf while we are using Meta-Llama-3.1-8B-Instruct-Q8_0.gguf. Could that explain why you are not seeing the same performance drop?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SYCLhttps://en.wikipedia.org/wiki/SYCL - GPU programming language
5 participants
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#10133 changes changed get_rows from false to true. I've detected a big regression for quantizations that support get_rows (llama3 Q8_0 for example).
@uniartisan Could you share more information of the device you used for offloading (where you saw increased performance)? Or did this just improve testing?
An example of regression:
build: fab5d30 (4143)
With this revert:
build: f4c4ce3