Description
When ingesting repositories containing large source files, embedding fails with a ContextWindowExceededError because the full file content is sent to the embedding model without any truncation or chunking.
Error
litellm.ContextWindowExceededError: litellm.BadRequestError: AzureException ContextWindowExceededError - This model's maximum context length is 8192 tokens, however you requested 14298 tokens (14298 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
model=openai/text-embedding-3-large
Root Cause
The code parser preserves files without chunking (code.py:56 — "Preserves directory structure without chunking"). When these files are vectorized, TextEmbeddingHandler.on_dequeue() in collection_schemas.py:254 passes the raw text directly to the embedder:
result: EmbedResult = await asyncio.to_thread(
self._embedder.embed, embedding_msg.message
)
There is no truncation of the input text before the embedding call. The OpenAIDenseEmbedder also does not implement any input token limit check, unlike GeminiEmbedder which has a _token_limit field.
The markdown parser has max_section_size (1000 tokens) and max_section_chars (6000 chars) guardrails that split documents into smaller chunks, but the code parser has no equivalent — source files are stored and embedded whole.
Impact
- Any code file exceeding the embedding model's context window (~8192 tokens for
text-embedding-3-large) fails vectorization silently (logged as error, skipped)
- The file still exists in the virtual filesystem and its L0/L1 summaries may be embedded successfully, but the L2 (full content) is not vector-searchable
- For large backend repos this can affect a significant number of files
Suggested Fix
Add input truncation or chunking before the embedding call, either:
- In
TextEmbeddingHandler: truncate embedding_msg.message to a configurable token limit before calling self._embedder.embed()
- In
OpenAIDenseEmbedder.embed(): add a _token_limit similar to GeminiEmbedder and truncate input
- In the code parser: chunk large files similar to how the markdown parser splits by sections
Environment
- OpenViking: 0.1.18 (pip install)
- Embedding model:
text-embedding-3-large (8192 token limit) via LiteLLM gateway
- OS: macOS
Description
When ingesting repositories containing large source files, embedding fails with a
ContextWindowExceededErrorbecause the full file content is sent to the embedding model without any truncation or chunking.Error
Root Cause
The code parser preserves files without chunking (
code.py:56— "Preserves directory structure without chunking"). When these files are vectorized,TextEmbeddingHandler.on_dequeue()incollection_schemas.py:254passes the raw text directly to the embedder:There is no truncation of the input text before the embedding call. The
OpenAIDenseEmbedderalso does not implement any input token limit check, unlikeGeminiEmbedderwhich has a_token_limitfield.The markdown parser has
max_section_size(1000 tokens) andmax_section_chars(6000 chars) guardrails that split documents into smaller chunks, but the code parser has no equivalent — source files are stored and embedded whole.Impact
text-embedding-3-large) fails vectorization silently (logged as error, skipped)Suggested Fix
Add input truncation or chunking before the embedding call, either:
TextEmbeddingHandler: truncateembedding_msg.messageto a configurable token limit before callingself._embedder.embed()OpenAIDenseEmbedder.embed(): add a_token_limitsimilar toGeminiEmbedderand truncate inputEnvironment
text-embedding-3-large(8192 token limit) via LiteLLM gateway