Skip to content

revert-embedding-chunking#741

Merged
chenjw merged 1 commit intomainfrom
revert-embedding-chunking
Mar 18, 2026
Merged

revert-embedding-chunking#741
chenjw merged 1 commit intomainfrom
revert-embedding-chunking

Conversation

@qin-ctx
Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx commented Mar 18, 2026

Revert "fix(embed): add text chunking for oversized embedding inputs (#616) (#642)"

This reverts commit cbf8bef.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 18, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 1 🔵⚪⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Restore dimension parameter in single embed

Add back the code to include the 'dimensions' parameter in the API call when
self.dimension is set. This was present in the previous _embed_single method but was
inadvertently removed when merging that method into embed().

openviking/models/embedder/openai_embedders.py [180-182]

 kwargs: Dict[str, Any] = {"input": text, "model": self.model_name}
+if self.dimension:
+    kwargs["dimensions"] = self.dimension
 
 extra_body = self._build_extra_body(is_query=is_query)
Suggestion importance[1-10]: 8

__

Why: This fixes a bug where the dimensions parameter was not being passed to the OpenAI API in the single embed() method, which was present in both the old _embed_single and the new embed_batch methods. This could cause embeddings to not use the requested dimension when specified.

Medium
General
Move math import to top of file

Move the 'import math' statement to the top of the file to follow standard Python
import practices and avoid repeated imports on each function call.

openviking/models/embedder/base.py [22-27]

-if not dimension or len(embedding) <= dimension:
-    return embedding
+import math
+import random
+import time
+from abc import ABC, abstractmethod
+from dataclasses import dataclass
+from typing import Any, Callable, Dict, List, Optional, TypeVar
 
-import math
-
-embedding = embedding[:dimension]
-
Suggestion importance[1-10]: 4

__

Why: This improves code style by moving the math import to the top of the file, following standard Python conventions and avoiding repeated imports on each function call, though it has minimal functional impact.

Low

Copy link
Copy Markdown
Contributor

@qin-ptr qin-ptr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

This PR reverts PR #642 which fixed issue #616 (embedding inputs exceeding token limits causing 400 errors). However, the PR description does not explain why this revert is necessary.

Critical Issues

1. [Design] (blocking) — Missing revert rationale

The PR body only states "Revert" without explaining:

  • Why is this revert needed?
  • What problem was discovered in the original implementation?
  • Is there an issue or user feedback documenting the problem?
  • What is the alternative solution?

2. [Bug] (blocking) — Known bug will regress

Issue #616 reported a real, reproducible bug where importing large repositories causes "This model's maximum context length is 8192 tokens, however you requested 13327 tokens" errors, breaking semantic artifact generation.

After this revert, users will encounter this bug again. The original fix (PR #642) was well-designed with proper chunking, token estimation, and weighted averaging.

3. [Design] (blocking) — Configuration breaking change

The max_tokens field removal breaks existing configurations. Since EmbeddingModelConfig uses extra="forbid", any user with embedding.dense.max_tokens in their ov.conf will experience configuration loading failures.

4. [Design] (blocking) — Missing related issue

This revert PR should be linked to an issue documenting:

  • What problem was found in PR #642
  • Reproduction steps
  • Impact analysis
  • Proposed alternative

Recommendations

Before merging:

  1. Add detailed rationale to PR description
  2. Create and link an issue documenting the discovered problem
  3. Provide mitigation for issue #616 regression
  4. Document the breaking change and provide migration guidance
  5. Consider fixing the specific problem instead of reverting the entire feature

🤖 I am a bot owned by @qin-ctx.

**({"query_param": cfg.query_param} if cfg.query_param else {}),
**({"document_param": cfg.document_param} if cfg.document_param else {}),
"max_tokens": cfg.max_tokens,
**({"extra_headers": cfg.extra_headers} if cfg.extra_headers else {}),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Design] (blocking)

The removal of "max_tokens": cfg.max_tokens from the factory parameters creates a configuration compatibility issue.

Users who previously configured embedding.dense.max_tokens in their ov.conf will find:

  1. The config field still exists in their file
  2. Due to extra="forbid" in EmbeddingModelConfig, loading will fail with validation error
  3. No migration path or deprecation warning is provided

Recommendation: If this revert is necessary, document it as a breaking change and provide clear migration instructions in release notes.

@@ -1,532 +0,0 @@
# Copyright (c) 2026 Beijing Volcano Engine Technology Co., Ltd.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking)

Deleting the entire test file (532 lines) reduces test coverage for the embedder module. Even if the chunking feature is removed, some tests remain valuable:

  • Basic embedding functionality
  • Batch embedding
  • Dimension detection
  • Error handling

Consider preserving the core embedder tests even after reverting the chunking feature.

Copy link
Copy Markdown
Collaborator

@chenjw chenjw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#734 implements the mapping of one Viking URI to multiple embeddings. As a result, embed pooling is no longer required, so we will roll back this commit.

@chenjw chenjw merged commit d30d96c into main Mar 18, 2026
5 of 6 checks passed
@chenjw chenjw deleted the revert-embedding-chunking branch March 18, 2026 12:20
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants