Large code files fail embedding: no input truncation before embedding API call

## Description

When ingesting repositories containing large source files, embedding fails with a `ContextWindowExceededError` because the full file content is sent to the embedding model without any truncation or chunking.

## Error

```
litellm.ContextWindowExceededError: litellm.BadRequestError: AzureException ContextWindowExceededError - This model's maximum context length is 8192 tokens, however you requested 14298 tokens (14298 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.
model=openai/text-embedding-3-large
```

## Root Cause

The code parser preserves files without chunking (`code.py:56` — "Preserves directory structure without chunking"). When these files are vectorized, `TextEmbeddingHandler.on_dequeue()` in `collection_schemas.py:254` passes the raw text directly to the embedder:

```python
result: EmbedResult = await asyncio.to_thread(
    self._embedder.embed, embedding_msg.message
)
```

There is no truncation of the input text before the embedding call. The `OpenAIDenseEmbedder` also does not implement any input token limit check, unlike `GeminiEmbedder` which has a `_token_limit` field.

The markdown parser has `max_section_size` (1000 tokens) and `max_section_chars` (6000 chars) guardrails that split documents into smaller chunks, but the code parser has no equivalent — source files are stored and embedded whole.

## Impact

- Any code file exceeding the embedding model's context window (~8192 tokens for `text-embedding-3-large`) fails vectorization silently (logged as error, skipped)
- The file still exists in the virtual filesystem and its L0/L1 summaries may be embedded successfully, but the L2 (full content) is not vector-searchable
- For large backend repos this can affect a significant number of files

## Suggested Fix

Add input truncation or chunking before the embedding call, either:

1. **In `TextEmbeddingHandler`**: truncate `embedding_msg.message` to a configurable token limit before calling `self._embedder.embed()`
2. **In `OpenAIDenseEmbedder.embed()`**: add a `_token_limit` similar to `GeminiEmbedder` and truncate input
3. **In the code parser**: chunk large files similar to how the markdown parser splits by sections

## Environment

- OpenViking: 0.1.18 (pip install)
- Embedding model: `text-embedding-3-large` (8192 token limit) via LiteLLM gateway
- OS: macOS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large code files fail embedding: no input truncation before embedding API call #931

Description

Error

Root Cause

Impact

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large code files fail embedding: no input truncation before embedding API call #931

Description

Description

Error

Root Cause

Impact

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions