Skip to content

Commit eac5ab3

Browse files
committed
fix: Prevent crash on full prompt cache hit (100% match)
When a repeated prompt matched 100% of cached tokens, the remaining token slice was empty (0 tokens). Passing this to the model caused '[reshape] Cannot infer the shape of an empty array' fatal error. Fix: replay the last cached token (with KV trim-back by 1) so the model always receives at least 1 token for next-token logit production.
1 parent 32dd183 commit eac5ab3

1 file changed

Lines changed: 9 additions & 1 deletion

File tree

Sources/SwiftLM/Server.swift

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -961,7 +961,15 @@ func handleChatCompletion(
961961
if let cachedCount = await promptCache.restore(newTokens: promptTokens, into: cache) {
962962
// Cache hit: KV state is pre-populated up to cachedCount tokens.
963963
// Only compute the remaining (new) tokens.
964-
let remainingTokens = lmInput.text.tokens[cachedCount...]
964+
var startIndex = cachedCount
965+
if startIndex >= lmInput.text.tokens.count {
966+
// Full match: all tokens are cached. We still need to feed at least
967+
// the last token so the model can produce next-token logits.
968+
startIndex = lmInput.text.tokens.count - 1
969+
// Trim the KV cache back by 1 to avoid double-counting the replayed token.
970+
for layer in cache { layer.trim(1) }
971+
}
972+
let remainingTokens = lmInput.text.tokens[startIndex...]
965973
let trimmedInput = LMInput(tokens: remainingTokens)
966974
return try MLXLMCommon.generate(
967975
input: trimmedInput, cache: cache, parameters: params, context: context

0 commit comments

Comments
 (0)