Skip to content

server: propagate task index to response objects for batch requests#1303

Merged
ikawrakow merged 1 commit intoikawrakow:mainfrom
jcjolley:fix/batch-response-index
Feb 24, 2026
Merged

server: propagate task index to response objects for batch requests#1303
ikawrakow merged 1 commit intoikawrakow:mainfrom
jcjolley:fix/batch-response-index

Conversation

@jcjolley
Copy link

@jcjolley jcjolley commented Feb 22, 2026

Summary

When multiple prompts are sent in a single /v1/completions request (batch mode), each response object should carry the correct index so the client can match results to their corresponding input prompts.

Currently, res->index is not set in send_partial_response, send_final_response, or send_embedding, causing all responses in a batch to report index: 0.

This PR sets res->index = slot.task->index in all three code paths.

Reproduction

curl -s http://127.0.0.1:8083/v1/completions   -H "Content-Type: application/json"   -d '{"prompt":["Hello world","Goodbye world","Test prompt"],"max_tokens":1,"temperature":0,"logprobs":5}'

Before this fix, all three response objects had index: 0. After, they correctly report index: 0, index: 1, index: 2.

Test plan

  • Batch /v1/completions with 3 prompts returns correct indices
  • Streaming mode returns correct indices on partial responses
  • /v1/embeddings batch returns correct indices

Generated with Devin

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-Authored-By: Devin <noreply@cognition.ai>
@ikawrakow ikawrakow merged commit 68431b0 into ikawrakow:main Feb 24, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
…kawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>
abc-nix pushed a commit to abc-nix/ik_llama.cpp that referenced this pull request Feb 26, 2026
* Better estimate for max. nuber of compute nodes

* Just in case

server: fix crash from adaptive p (ikawrakow#1304)

Co-authored-by: firecoperana <firecoperana>

Fix tool call for Qwen3.5 (ikawrakow#1300)

* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* ggml-org/llama.cpp#19635
* ggml-org/llama.cpp#19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one

Graph parallel for Qwen3-Next (ikawrakow#1292)

* WIP

* This works, but is slower than split mode layer

Fix llm_arch_is_hybrid (ikawrakow#1305)

Fix max nodes (again) (ikawrakow#1306)

Fix typo in merge-up-gate-experts argument (ikawrakow#1311)

llama-quantize: --dry-run option (ikawrakow#1309)

Slightly better graph parallel for Qwen3-Next (ikawrakow#1307)

* Make sure we pick the reduced tensor from the right GPU

* Minor

Minor delta-net tweak (ikawrakow#1308)

* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak

adaptive p: collect probability before logit bias (ikawrakow#1314)

server: propagate task index to response objects for batch requests (ikawrakow#1303)

When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>

Llama-quantize: Partial requant feature (ikawrakow#1313)

* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup

Display the size of the tensors overriden during the tensor loading (ikawrakow#1318)

* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

Fused delta-net (ikawrakow#1315)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

Fix KT quantization yet again (ikawrakow#1321)

* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one

server: enable checkpoint for recurrent models (ikawrakow#1310)

* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>

Faster quantization for MoE models with many experts (ikawrakow#1322)

Fused delta net 2 (ikawrakow#1320)

* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.

iAdding support for dense Qwen-3.5 models (ikawrakow#1326)

add directio to llama-bench
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 26, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Feb 27, 2026
@jcjolley jcjolley deleted the fix/batch-response-index branch February 28, 2026 01:44
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 1, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 2, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 3, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 5, 2026
Nexesenex added a commit to Nexesenex/ik_llama.cpp.nxs that referenced this pull request Mar 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants