Skip to content

Large code files fail embedding: no input truncation before embedding API call #931

@mertyldrm

Description

@mertyldrm

Description

When ingesting repositories containing large source files, embedding fails with a ContextWindowExceededError because the full file content is sent to the embedding model without any truncation or chunking.

Error

litellm.ContextWindowExceededError: litellm.BadRequestError: AzureException ContextWindowExceededError - This model's maximum context length is 8192 tokens, however you requested 14298 tokens (14298 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
model=openai/text-embedding-3-large

Root Cause

The code parser preserves files without chunking (code.py:56 — "Preserves directory structure without chunking"). When these files are vectorized, TextEmbeddingHandler.on_dequeue() in collection_schemas.py:254 passes the raw text directly to the embedder:

result: EmbedResult = await asyncio.to_thread(
    self._embedder.embed, embedding_msg.message
)

There is no truncation of the input text before the embedding call. The OpenAIDenseEmbedder also does not implement any input token limit check, unlike GeminiEmbedder which has a _token_limit field.

The markdown parser has max_section_size (1000 tokens) and max_section_chars (6000 chars) guardrails that split documents into smaller chunks, but the code parser has no equivalent — source files are stored and embedded whole.

Impact

  • Any code file exceeding the embedding model's context window (~8192 tokens for text-embedding-3-large) fails vectorization silently (logged as error, skipped)
  • The file still exists in the virtual filesystem and its L0/L1 summaries may be embedded successfully, but the L2 (full content) is not vector-searchable
  • For large backend repos this can affect a significant number of files

Suggested Fix

Add input truncation or chunking before the embedding call, either:

  1. In TextEmbeddingHandler: truncate embedding_msg.message to a configurable token limit before calling self._embedder.embed()
  2. In OpenAIDenseEmbedder.embed(): add a _token_limit similar to GeminiEmbedder and truncate input
  3. In the code parser: chunk large files similar to how the markdown parser splits by sections

Environment

  • OpenViking: 0.1.18 (pip install)
  • Embedding model: text-embedding-3-large (8192 token limit) via LiteLLM gateway
  • OS: macOS

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions