RERANK_CONTEXT_SIZE (2048) too small — qmd query crashes on CJK content

### Summary

`qmd query` crashes during reranking when the combined input (query + document chunk + Qwen3 template overhead) exceeds `RERANK_CONTEXT_SIZE = 2048`. The error is deterministic and reproducible.

### Environment

- QMD version: 1.1.0 (also reproduced on 1.0.7)
- OS: Rocky Linux 9 (x86_64)
- Node.js: v22.22.0
- GPU: NVIDIA RTX 3090 (24GB VRAM)
- Content: ~345 markdown files, primarily CJK (Chinese) text
- Index: 1386 chunks from 338 documents

### Error

```
$ qmd query "test" --json
├─ test
├─ lex: test examples
├─ lex: test code
├─ vec: practical code examples for test
├─ vec: code examples for common patterns of test
└─ hyde: Here are some practical examples of test in action...
Searching 6 queries...
Reranking 40 chunks...

Error: The input lengths of some of the given documents exceed the context size.
Try to increase the context size to at least 2099 or use another model
that supports longer contexts.
    at LlamaRankingContext.rankAll (.../node-llama-cpp/dist/evaluator/LlamaRankingContext.js:50:19)
    at LlamaCpp.rerank (.../dist/llm.js:751:82)
```

### Root Cause

In `src/llm.ts`:

```typescript
static RERANK_CONTEXT_SIZE = 2048;
```

The reranker input is: **query tokens + chunk tokens + Qwen3 template overhead (~200 tokens)**.

The comment says chunks are capped at ~800 tokens, so `800 + 200 + query ≈ 1100` should fit. However:

1. **CJK tokenization** produces different token counts than English — a chunk that appears ~900 tokens in English tokenization may be longer in the Qwen3 tokenizer.
2. **Query expansion** generates HyDE documents that can be 100+ tokens, pushing the total past 2048.
3. The error requests "at least 2099" — only 51 tokens over the limit.

### Workaround

Manually changing `RERANK_CONTEXT_SIZE` to `4096` in the installed `dist/llm.js` resolves the issue. VRAM impact is modest (~2× per reranking context), well within RTX 3090 capacity.

### Suggested Fix

Either:

1. **Increase the default** to 4096 (safest, modest VRAM cost)
2. **Dynamic sizing**: compute the required context from the actual longest (query + chunk) pair before creating the ranking context, with a cap at the model's max context
3. **Graceful fallback**: if a chunk exceeds the context size, skip it during reranking rather than crashing (log a warning, use the retrieval score instead)

Option 3 is the most robust since it handles arbitrarily long inputs without VRAM growth.

### Related

- Changelog note in v1.0.0: "right-sized reranker context (40960 → 2048 tokens, 17x less memory)"
- The reduction from 40960 to 2048 was too aggressive for CJK content with long query expansions

Thank you for building QMD — it's excellent! 🙏

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RERANK_CONTEXT_SIZE (2048) too small — qmd query crashes on CJK content #290

Summary

Environment

Error

Root Cause

Workaround

Suggested Fix

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

RERANK_CONTEXT_SIZE (2048) too small — qmd query crashes on CJK content #290

Description

Summary

Environment

Error

Root Cause

Workaround

Suggested Fix

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions