Fix GPU Layer Limitation in llamafile#534
Merged
jart merged 1 commit intomozilla-ai:mainfrom Nov 2, 2024
Merged
Conversation
Collaborator
|
Following up on this, this is the same issue mentioned in DM @jart. Removing this line should be sufficient as far as I can tell |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
#533
In the current implementation, the line
n_gpu_layers = std::min(n_gpu_layers, (int)hparams.n_layer);restricts the minimum value ofn_gpu_layers. However, in the llama.cpp project, within thestatic void llm_load_hparamsfunction,hparams.n_layeris derived fromml.get_key(LLM_KV_BLOCK_COUNT, hparams.n_layer);, which only accounts for layers that require key-value (KV) attention and does not include other potential layers, such as output layers.This restriction might lead to performance issues, as observed in the token generation speed and GPU utilization.
By either commenting out this line or adjusting it to
hparams.n_layer + 10, the issue can be mitigated, ensuring all necessary layers are properly offloaded to the GPU, improving overall performance.