Summary
qmd query crashes during reranking when the combined input (query + document chunk + Qwen3 template overhead) exceeds RERANK_CONTEXT_SIZE = 2048. The error is deterministic and reproducible.
Environment
- QMD version: 1.1.0 (also reproduced on 1.0.7)
- OS: Rocky Linux 9 (x86_64)
- Node.js: v22.22.0
- GPU: NVIDIA RTX 3090 (24GB VRAM)
- Content: ~345 markdown files, primarily CJK (Chinese) text
- Index: 1386 chunks from 338 documents
Error
$ qmd query "test" --json
├─ test
├─ lex: test examples
├─ lex: test code
├─ vec: practical code examples for test
├─ vec: code examples for common patterns of test
└─ hyde: Here are some practical examples of test in action...
Searching 6 queries...
Reranking 40 chunks...
Error: The input lengths of some of the given documents exceed the context size.
Try to increase the context size to at least 2099 or use another model
that supports longer contexts.
at LlamaRankingContext.rankAll (.../node-llama-cpp/dist/evaluator/LlamaRankingContext.js:50:19)
at LlamaCpp.rerank (.../dist/llm.js:751:82)
Root Cause
In src/llm.ts:
static RERANK_CONTEXT_SIZE = 2048;
The reranker input is: query tokens + chunk tokens + Qwen3 template overhead (~200 tokens).
The comment says chunks are capped at ~800 tokens, so 800 + 200 + query ≈ 1100 should fit. However:
- CJK tokenization produces different token counts than English — a chunk that appears ~900 tokens in English tokenization may be longer in the Qwen3 tokenizer.
- Query expansion generates HyDE documents that can be 100+ tokens, pushing the total past 2048.
- The error requests "at least 2099" — only 51 tokens over the limit.
Workaround
Manually changing RERANK_CONTEXT_SIZE to 4096 in the installed dist/llm.js resolves the issue. VRAM impact is modest (~2× per reranking context), well within RTX 3090 capacity.
Suggested Fix
Either:
- Increase the default to 4096 (safest, modest VRAM cost)
- Dynamic sizing: compute the required context from the actual longest (query + chunk) pair before creating the ranking context, with a cap at the model's max context
- Graceful fallback: if a chunk exceeds the context size, skip it during reranking rather than crashing (log a warning, use the retrieval score instead)
Option 3 is the most robust since it handles arbitrarily long inputs without VRAM growth.
Related
- Changelog note in v1.0.0: "right-sized reranker context (40960 → 2048 tokens, 17x less memory)"
- The reduction from 40960 to 2048 was too aggressive for CJK content with long query expansions
Thank you for building QMD — it's excellent! 🙏
Summary
qmd querycrashes during reranking when the combined input (query + document chunk + Qwen3 template overhead) exceedsRERANK_CONTEXT_SIZE = 2048. The error is deterministic and reproducible.Environment
Error
Root Cause
In
src/llm.ts:The reranker input is: query tokens + chunk tokens + Qwen3 template overhead (~200 tokens).
The comment says chunks are capped at ~800 tokens, so
800 + 200 + query ≈ 1100should fit. However:Workaround
Manually changing
RERANK_CONTEXT_SIZEto4096in the installeddist/llm.jsresolves the issue. VRAM impact is modest (~2× per reranking context), well within RTX 3090 capacity.Suggested Fix
Either:
Option 3 is the most robust since it handles arbitrarily long inputs without VRAM growth.
Related
Thank you for building QMD — it's excellent! 🙏