context : perform output reorder lazily upon access after sync#14853
context : perform output reorder lazily upon access after sync#14853
Conversation
|
Can we avoid the reorder entirely if we drop |
|
Hm, I think |
|
Ah, got it. Let me see. |
|
Yes, we can "redirect" the index on-the-fly using the Do it in a follow-up PR, so we can patch the current issue for now? |
Only because the llama.cpp/src/llama-context.cpp Lines 1221 to 1225 in e4868d1 If (Right, this comment is redundant with #14853 (comment) and #14853 (comment)) |
|
Actually, because of how llama.cpp/tools/perplexity/perplexity.cpp Line 612 in e4868d1 The above assumes the logits after the ith one are in the correct order. There might need to be an API for logits ranges. |
|
This usage of |
|
It is indirectly stated in the comment that says |
…org#14853) * context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO
* origin/master: docs : update HOWTO‑add‑model.md for ModelBase and new model classes (ggml-org#14874) ggml : remove invalid portPos specifiers from dot files (ggml-org#14838) context : restore preemptive sched reset when LLAMA_SET_ROWS=0 (ggml-org#14870) mtmd : fix 32-bit narrowing issue in export-lora and mtmd clip (ggml-org#14503) rpc : check for null buffers in get/set/copy tensor endpoints (ggml-org#14868) sched : fix multiple evaluations of the same graph with pipeline parallelism (ggml-org#14855) musa: upgrade musa sdk to rc4.2.0 (ggml-org#14498) sync : ggml cmake : fix usage issues (ggml/1257) ggml-cpu : remove stdlib include from repack.cpp (ggml/1276) context : perform output reorder lazily upon access after sync (ggml-org#14853) chat : fix kimi-k2 chat template (ggml-org#14852) sycl: fixed semantics of block offset calculation (ggml-org#14814) llama : fix MiniCPM inference after Granite Four changes (ggml-org#14850) docs: add libcurl-dev install hint for Linux distros (ggml-org#14801) metal : fix fusion across different encoders (ggml-org#14849) sycl: fix undefined variable in work group size check (ggml-org#14843) convert : text-only support for GLM-4.1V-9B-Thinking (ggml-org#14823) CUDA: fix overflow in FA, tune performance (ggml-org#14840) CUDA: fix compilation with GGML_CUDA_F16 (ggml-org#14837)
* context : perform output reorder after lazily upon access after sync ggml-ci * cont : add TODO
ref #14795 (comment)
After processing a batch, remember the indices that we have to swap and apply the data swap (of logits and embeddings) upon access via
llama_get_...