fix(embed): add text chunking for oversized embedding inputs (#616) by lgYanami · Pull Request #642 · volcengine/OpenViking

lgYanami · 2026-03-15T16:53:00Z

## Description

- Add _estimate_tokens, _chunk_text, _chunk_and_embed to EmbedderBase
- Override embed()/embed_batch() in OpenAIDenseEmbedder with auto-chunking
- Use tiktoken for accurate token counting, fallback to char estimation
- Merge chunk vectors via token-weighted average + L2 normalization
- Make max_tokens configurable via ov.conf (embedding.dense.max_tokens)
- Default 8000 for OpenAI models, users can override for other models
- Add 22 unit tests covering chunking, fallback, batch, config

## Related Issue

https://github.com/volcengine/OpenViking/issues/616
Fixes #616

## Type of Change

- [x] Bug fix (non-breaking change that fixes an issue)
- [ ] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [x] Refactoring (no functional changes)
- [x] Test update

## Changes Made

- `openviking/models/embedder/base.py`: Added `max_tokens` property, `_estimate_tokens()`, `_chunk_text()` (paragraph > sentence > fixed-length splitting), `_merge_segments()`, `_fixed_length_split()`, `_chunk_and_embed()` (token-weighted average + L2 normalization), and `_embed_single()` to EmbedderBase
- `openviking/models/embedder/openai_embedders.py`: Refactored `embed()` → `_embed_single()`, added auto-chunking in `embed()` and `embed_batch()`, initialized tiktoken encoder at construction time, made `max_tokens` configurable via constructor parameter
- `openviking_cli/utils/config/embedding_config.py`: Added `max_tokens` field to `EmbeddingModelConfig`, passed through to OpenAIDenseEmbedder in factory registry
- `tests/unit/test_openai_embedder_chunking.py`: 22 unit tests covering short text (no chunking), oversized text, very long text (100x limit), empty text, batch with mixed lengths, L2 normalization, paragraph/sentence/fixed-length splitting, tiktoken fallback, and configurable max_tokens

## Testing

- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] I have tested this on the following platforms:
  - [x] Linux
  - [ ] macOS
  - [ ] Windows

## Checklist

- [x] My code follows the project's coding style
- [x] I have performed a self-review of my code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] Any dependent changes have been merged and published

## Screenshots (if applicable)

N/A

## Additional Notes

- `tiktoken` is an optional dependency: when unavailable (e.g., non-OpenAI models via compatible API), token estimation falls back to `len(text) // 3`
- `max_tokens` can be configured in `ov.conf` under `embedding.dense.max_tokens`, allowing users to set appropriate limits for different models (e.g., 32000 for Qwen3-Embedding-4B, 8000 default for OpenAI models)
- The chunking logic is implemented at the `EmbedderBase` level so other embedders can benefit in the future, but only `OpenAIDenseEmbedder` activates it in this PR
- `embed_batch()` intelligently routes: short texts are batched via the native OpenAI batch API for efficiency, while only oversized texts are individually chunked
- Verified locally with `max_tokens=100` against SiliconFlow API (Qwen3-Embedding-4B): no 400 errors, semantic artifacts generated correctly

…ine#616) - Add _estimate_tokens, _chunk_text, _chunk_and_embed to EmbedderBase - Override embed()/embed_batch() in OpenAIDenseEmbedder with auto-chunking - Use tiktoken for accurate token counting, fallback to char estimation - Merge chunk vectors via token-weighted average + L2 normalization - Make max_tokens configurable via ov.conf (embedding.dense.max_tokens) - Default 8000 for OpenAI models, users can override for other models - Add 22 unit tests covering chunking, fallback, batch, config

CLAassistant · 2026-03-15T16:53:07Z

All committers have signed the CLA.

qin-ctx · 2026-03-18T11:47:55Z

我们计划回滚这个 PR。

#616 的根因不是 embedder 缺少分块能力，而是 embedding_utils.py 的 vectorize_file() 对 TEXT 类型文件直接读取了 VikingFS 中的原始全文做向量化，跳过了 semantic processor 已经生成好的 summary。

其他资源类型（Markdown、PDF 等）在 parser 阶段就按结构切片，每个 L2 切片都在 embedding 窗口内。代码仓库文件是唯一跳过 parser 切片、直接把原始全文送进 embedder 的类型（CodeRepositoryParser 使用 upload_directory 原样上传，不做切片）。

_chunk_and_embed 的加权平均方案会降低 embedding 质量——多个 chunk 平均后的向量语义模糊，检索时哪个主题都匹配不准。正确的修复方式是让 CodeRepositoryParser 也像 Markdown/PDF parser 一样对文件做切片（按 AST 结构切函数/类），从源头保证写入 VikingFS 的 L2 内容有界，不需要 embedder 层做分块兜底。

计划：

回滚本 PR
修复 CodeRepositoryParser，对代码文件按 AST 结构切片后再上传

github-project-automation bot added this to OpenViking project Mar 15, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 15, 2026

MaojiaSheng approved these changes Mar 16, 2026

View reviewed changes

MaojiaSheng merged commit cbf8bef into volcengine:main Mar 16, 2026
1 check passed

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026

laofahai mentioned this pull request Mar 17, 2026

[Bug]: Memory commit triggers oversized embedding input → unhandled exception hangs uvicorn #686

Closed

chenjw mentioned this pull request Mar 18, 2026

feat(session): add chunked vectorization for long memories #734

Merged

8 tasks

qin-ctx mentioned this pull request Mar 18, 2026

revert-embedding-chunking #741

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(embed): add text chunking for oversized embedding inputs (#616)#642

fix(embed): add text chunking for oversized embedding inputs (#616)#642
MaojiaSheng merged 1 commit intovolcengine:mainfrom
lgYanami:dev-616

lgYanami commented Mar 15, 2026

Uh oh!

CLAassistant commented Mar 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

qin-ctx commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lgYanami commented Mar 15, 2026

Uh oh!

CLAassistant commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

qin-ctx commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CLAassistant commented Mar 15, 2026 •

edited

Loading