Skip to content

server: reset cache tokens after pp stops halfway#1787

Merged
ikawrakow merged 1 commit into
mainfrom
fcp/fix_pp_stop
May 13, 2026
Merged

server: reset cache tokens after pp stops halfway#1787
ikawrakow merged 1 commit into
mainfrom
fcp/fix_pp_stop

Conversation

@firecoperana

Copy link
Copy Markdown
Collaborator

When I was working on #1722, I noticed that when PP stops halfway, it treats like the current batch has been completed, but that is wrong. I don't know how to find where PP stops inside the batch, so the fix is to reset the status of tokens and kv cache to the start of the batch.

Also relax the condition to reset the llama decode flag. Could fix #1673 (comment).

@ikawrakow

Copy link
Copy Markdown
Owner

How can PP stop halfway? Via the user sending a stop command?

@firecoperana firecoperana changed the title server: reset cache tokens after pp stops server: reset cache tokens after pp stops halfway May 12, 2026
@firecoperana

Copy link
Copy Markdown
Collaborator Author

yes

@ikawrakow

Copy link
Copy Markdown
Owner

The batch is processed in u-batches. Checks for cancellation are only done after each u-batch has been processed. If you call llama_kv_cache_seq_pos_max with the sequence id of the slot, it should give you the last position that has been processed. Or not?

Scratch that. I see that the check for cancellation is being done after the sequence has been added to the KV cache cells, but before the u-batch has been actually computed. This is kind of stupid. I guess, we need to move the cancellation check either to the beginning or to the end of the loop over u-batches. So that the system is in a consistent state when a computation is cancelled. The same inconsistent state can also be achieved during TG, but TG is much faster, so the probability of the cancellation arriving after the KV cache cells have been manipulated but before the token has been computed is much smaller.

@firecoperana

Copy link
Copy Markdown
Collaborator Author

Yeah, I was not realizing that since I'm not familiar with how the inside of llama_decode works. Can you show me where should I do the cancellation check?

@ikawrakow

Copy link
Copy Markdown
Owner

Either at the beginning of this loop

for (uint32_t cur_token = 0; cur_token < n_tokens_all; ) {

or at the end of it.

Instead of checking here:

if (stop_internal_decode) {

@firecoperana

firecoperana commented May 13, 2026

Copy link
Copy Markdown
Collaborator Author

Thanks! The stop signal works reliably if I put it at the end of the loop, but not at the beginning of the loop for some reason.

@ikawrakow ikawrakow merged commit cdc288b into main May 13, 2026
@firecoperana firecoperana deleted the fcp/fix_pp_stop branch May 31, 2026 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants