Skip to content

fix(embed): add text chunking for oversized embedding inputs (#616)#642

Merged
MaojiaSheng merged 1 commit intovolcengine:mainfrom
lgYanami:dev-616
Mar 16, 2026
Merged

fix(embed): add text chunking for oversized embedding inputs (#616)#642
MaojiaSheng merged 1 commit intovolcengine:mainfrom
lgYanami:dev-616

Conversation

@lgYanami
Copy link
Copy Markdown
Contributor

## Description

- Add _estimate_tokens, _chunk_text, _chunk_and_embed to EmbedderBase
- Override embed()/embed_batch() in OpenAIDenseEmbedder with auto-chunking
- Use tiktoken for accurate token counting, fallback to char estimation
- Merge chunk vectors via token-weighted average + L2 normalization
- Make max_tokens configurable via ov.conf (embedding.dense.max_tokens)
- Default 8000 for OpenAI models, users can override for other models
- Add 22 unit tests covering chunking, fallback, batch, config

## Related Issue

https://github.com/volcengine/OpenViking/issues/616
Fixes #616

## Type of Change

- [x] Bug fix (non-breaking change that fixes an issue)
- [ ] New feature (non-breaking change that adds functionality)
- [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
- [ ] Documentation update
- [x] Refactoring (no functional changes)
- [x] Test update

## Changes Made

- `openviking/models/embedder/base.py`: Added `max_tokens` property, `_estimate_tokens()`, `_chunk_text()` (paragraph > sentence > fixed-length splitting), `_merge_segments()`, `_fixed_length_split()`, `_chunk_and_embed()` (token-weighted average + L2 normalization), and `_embed_single()` to EmbedderBase
- `openviking/models/embedder/openai_embedders.py`: Refactored `embed()``_embed_single()`, added auto-chunking in `embed()` and `embed_batch()`, initialized tiktoken encoder at construction time, made `max_tokens` configurable via constructor parameter
- `openviking_cli/utils/config/embedding_config.py`: Added `max_tokens` field to `EmbeddingModelConfig`, passed through to OpenAIDenseEmbedder in factory registry
- `tests/unit/test_openai_embedder_chunking.py`: 22 unit tests covering short text (no chunking), oversized text, very long text (100x limit), empty text, batch with mixed lengths, L2 normalization, paragraph/sentence/fixed-length splitting, tiktoken fallback, and configurable max_tokens

## Testing

- [x] I have added tests that prove my fix is effective or that my feature works
- [x] New and existing unit tests pass locally with my changes
- [x] I have tested this on the following platforms:
  - [x] Linux
  - [ ] macOS
  - [ ] Windows

## Checklist

- [x] My code follows the project's coding style
- [x] I have performed a self-review of my code
- [x] I have commented my code, particularly in hard-to-understand areas
- [x] I have made corresponding changes to the documentation
- [x] My changes generate no new warnings
- [ ] Any dependent changes have been merged and published

## Screenshots (if applicable)

N/A

## Additional Notes

- `tiktoken` is an optional dependency: when unavailable (e.g., non-OpenAI models via compatible API), token estimation falls back to `len(text) // 3`
- `max_tokens` can be configured in `ov.conf` under `embedding.dense.max_tokens`, allowing users to set appropriate limits for different models (e.g., 32000 for Qwen3-Embedding-4B, 8000 default for OpenAI models)
- The chunking logic is implemented at the `EmbedderBase` level so other embedders can benefit in the future, but only `OpenAIDenseEmbedder` activates it in this PR
- `embed_batch()` intelligently routes: short texts are batched via the native OpenAI batch API for efficiency, while only oversized texts are individually chunked
- Verified locally with `max_tokens=100` against SiliconFlow API (Qwen3-Embedding-4B): no 400 errors, semantic artifacts generated correctly

…ine#616)

- Add _estimate_tokens, _chunk_text, _chunk_and_embed to EmbedderBase
- Override embed()/embed_batch() in OpenAIDenseEmbedder with auto-chunking
- Use tiktoken for accurate token counting, fallback to char estimation
- Merge chunk vectors via token-weighted average + L2 normalization
- Make max_tokens configurable via ov.conf (embedding.dense.max_tokens)
- Default 8000 for OpenAI models, users can override for other models
- Add 22 unit tests covering chunking, fallback, batch, config
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 15, 2026

CLA assistant check
All committers have signed the CLA.

@MaojiaSheng MaojiaSheng merged commit cbf8bef into volcengine:main Mar 16, 2026
1 check passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026
@qin-ctx
Copy link
Copy Markdown
Collaborator

qin-ctx commented Mar 18, 2026

我们计划回滚这个 PR。

#616 的根因不是 embedder 缺少分块能力,而是 embedding_utils.pyvectorize_file() 对 TEXT 类型文件直接读取了 VikingFS 中的原始全文做向量化,跳过了 semantic processor 已经生成好的 summary。

其他资源类型(Markdown、PDF 等)在 parser 阶段就按结构切片,每个 L2 切片都在 embedding 窗口内。代码仓库文件是唯一跳过 parser 切片、直接把原始全文送进 embedder 的类型(CodeRepositoryParser 使用 upload_directory 原样上传,不做切片)。

_chunk_and_embed 的加权平均方案会降低 embedding 质量——多个 chunk 平均后的向量语义模糊,检索时哪个主题都匹配不准。正确的修复方式是让 CodeRepositoryParser 也像 Markdown/PDF parser 一样对文件做切片(按 AST 结构切函数/类),从源头保证写入 VikingFS 的 L2 内容有界,不需要 embedder 层做分块兜底。

计划:

  1. 回滚本 PR
  2. 修复 CodeRepositoryParser,对代码文件按 AST 结构切片后再上传

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants