server: reset cache tokens after pp stops halfway#1787
Conversation
|
How can PP stop halfway? Via the user sending a stop command? |
|
yes |
|
The batch is processed in u-batches. Checks for cancellation are only done after each u-batch has been processed. If you call Scratch that. I see that the check for cancellation is being done after the sequence has been added to the KV cache cells, but before the u-batch has been actually computed. This is kind of stupid. I guess, we need to move the cancellation check either to the beginning or to the end of the loop over u-batches. So that the system is in a consistent state when a computation is cancelled. The same inconsistent state can also be achieved during TG, but TG is much faster, so the probability of the cancellation arriving after the KV cache cells have been manipulated but before the token has been computed is much smaller. |
|
Yeah, I was not realizing that since I'm not familiar with how the inside of llama_decode works. Can you show me where should I do the cancellation check? |
abc2429 to
7ff12d6
Compare
|
Thanks! The stop signal works reliably if I put it at the end of the loop, but not at the beginning of the loop for some reason. |
When I was working on #1722, I noticed that when PP stops halfway, it treats like the current batch has been completed, but that is wrong. I don't know how to find where PP stops inside the batch, so the fix is to reset the status of tokens and kv cache to the start of the batch.
Also relax the condition to reset the llama decode flag. Could fix #1673 (comment).