feat(embedder): add non-symmetric embedding support for query/document by CHW0n9 · Pull Request #608 · volcengine/OpenViking

CHW0n9 · 2026-03-14T20:26:15Z

Summary

Implement nonsymmetric embedding support so queries and documents can use different embedding strategies/parameters. This enables using OpenAI's input_type or Jina's task settings differently for retrieval queries vs. document/passage embeddings.

What changed

OpenAIDenseEmbedder: Added input_type parameter support (accepts "query", "document", "passage"). Parameters are passed via extra_body to the OpenAI client.
EmbeddingConfig: Added contextual embedder helpers:
- get_query_embedder(): Returns an embedder configured specifically for query embeddings.
- get_document_embedder(): Returns an embedder configured for document/passage embeddings.
Service & Session: Updated service/core.py and memory_deduplicator.py to use the query-specific embedder.
Storage: Updated collection_schemas.py to use the document-specific embedder for collection initialization.
Configuration: Improved default-value handling and added validation for embedding providers.
Linting: Fixed import ordering and removed unused imports in affected files.

Rationale

Modern embedding models (like OpenAI text-embedding-3 or Jina v5) perform better when the API is told whether the input is a short query or a long document. This PR introduces the architectural plumbing to distinguish these two paths in the OpenViking engine, allowing for higher retrieval accuracy.

Files included in this PR

openviking/models/embedder/openai_embedders.py
openviking_cli/utils/config/embedding_config.py
openviking/service/core.py
openviking/session/memory_deduplicator.py
openviking/storage/collection_schemas.py
openviking_cli/utils/config/vlm_config.py

(Note: Dockerfile optimizations and local test files were excluded from this PR to keep the focus on the embedding logic.)

Testing

Linting: Verified with ruff format and ruff check.
Type Checking: Verified with mypy (minor warnings in external stubs reviewed as benign).
Unit Testing: New logic was verified with local test suites (test_embedding_input_type.py and test_openai_embedder.py).

Checklist

Code follows project style guidelines (Ruff/Mypy)
Contextual embedder logic implemented for OpenAI/Jina
CI pipeline passing (requires upstream build)

- Add input_type parameter to OpenAIDenseEmbedder - Add get_query_embedder() and get_document_embedder() to EmbeddingConfig - Support different embedding strategies for queries and documents - Improve error handling and default value processing - Fix import ordering and remove unused imports

CLAassistant · 2026-03-14T20:26:21Z

All committers have signed the CLA.

openviking/service/core.py

MaojiaSheng · 2026-03-15T03:05:35Z

Thanks but a few comments need to be resolved before merge

ZaynJarvis · 2026-03-15T11:25:25Z

@CHW0n9 Great work on adding non-symmetric embedding support!

I notice the current approach in _get_contextual_embedder() uses provider-specific if/else logic. As more providers like Gemini Embedding 2 add task type support, this pattern might become difficult to maintain.

Suggested refactor for better scalability:
Consider moving the input type/task selection logic into each specific EmbeddingModel class:

Add a method like get_contextual_config(context: str) -> dict to each embedder class
Let each embedder handle its own provider-specific logic (OpenAI uses input_type, Jina uses task, etc.)
Replace the provider if/else chain with polymorphic calls

This would:

Keep provider-specific logic encapsulated
Make adding new providers simpler
Follow better OOP principles
Eliminate the growing if/elif chain

The current implementation works well, but this refactor could improve long-term maintainability. What do you think?

Add first-class query/document dense embedding configuration and wire it through retrieval and indexing. - add dense_query and dense_document partial overrides on top of the base dense config - encapsulate provider-specific query/document defaults in embedder classes via contextual init hooks - use query embedders for retrieval paths and document embedders for schema/indexing paths - document the new config shape and add focused coverage for config merging and Jina defaults Follow-up to volcengine#608.

kfiramar · 2026-03-15T13:35:03Z

Opened a dedicated follow-up PR for the long-term query/document embedding architecture here: #636.

This follow-up keeps the change scoped to the cross-cutting design discussed in this thread:

separate query-time and document/index-time dense embedder config
provider-specific contextual behavior implemented in embedder classes
retrieval wired to the query embedder and indexing/schema paths wired to the document embedder
no service-layer provider branching

That lets the architectural change land independently from provider-specific feature work.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

openviking/models/embedder/openai_embedders.py

openviking/service/core.py

openviking/storage/collection_schemas.py

ZaynJarvis · 2026-03-16T07:48:33Z

openviking_cli/utils/config/embedding_config.py

+                api_base=self.dense.api_base,
+                dimension=self.dense.dimension,
+                input_type=input_type,
+            )
+
+        if provider == "jina":
+            if self.dense.task_document:
+                task = "retrieval.query" if context == "query" else self.dense.task_document
+            else:
+                task = None
+
+            if not self.dense.model:
+                raise ValueError("Embedding model name is required")
+
+            return JinaDenseEmbedder(
+                model_name=self.dense.model,
+                api_key=self.dense.api_key,
+                api_base=self.dense.api_base,
+                dimension=self.dense.dimension,


return OpenAIDenseEmbedder( model_name=self.dense.model, api_key=self.dense.api_key, api_base=self.dense.api_base, dimension=self.dense.dimension, context=context, ) ... return JinaDenseEmbedder( model_name=self.dense.model, api_key=self.dense.api_key, api_base=self.dense.api_base, dimension=self.dense.dimension, context=context, )

in XXXEmbedder impl:

class OpenAIDenseEmbedder: ... input_type = "x" if context == "query" else "y"

class JinaDenseEmbedder: ... task = "a" if context == "query" else "b"

would this be better?

I've moved the context parameter handling into embedder classes:

OpenAIDenseEmbedder:

def __init__(self, ..., context: Optional[str] = None, query_param: Optional[str] = None, document_param: Optional[str] = None, ...): non_symmetric = query_param is not None or document_param is not None if not non_symmetric: self.input_type = None elif context == "query": self.input_type = query_param if query_param is not None else "query" elif context == "document": self.input_type = document_param if document_param is not None else "passage"

JinaDenseEmbedder:

def __init__(self, ..., context: Optional[str] = None, query_param: str = "retrieval.query", document_param: str = "retrieval.passage", ...): if context == "query": self.task = query_param elif context == "document": self.task = document_param

Each embedder class now decides how to handle the context parameter - OpenAI uses input_type, Jina uses task. The _get_contextual_embedder() now simply passes the context parameter to the embedder class without provider-specific if/else logic.

ZaynJarvis · 2026-03-16T07:53:27Z

openviking_cli/utils/config/embedding_config.py

+    task_document: Optional[str] = Field(
+        default=None,
+        description="Jina task for document embeddings (e.g. 'retrieval.passage')",
+    )


too many config fields here.

input_type_query, task_query refers to the same thing, should use same config name

i don't think query / document / retrieval.query / retrieval.passage need to be set in config. static value should be decided by each Embedder for each context.

I've consolidated the config fields!

Removed 4 old fields:

input_type_query

input_type_document

task_query

task_document

Added 2 unified fields:

query_param - works for all providers (OpenAI → input_type, Jina → task)

document_param - works for all providers

This makes it easier to add new providers in the future - no need to add new config fields, just use the existing query_param/document_param. The embedder class decides the default values:

OpenAI: query_param defaults to "query", document_param defaults to "passage"

Jina: query_param defaults to "retrieval.query", document_param defaults to "retrieval.passage"

OpenAI embedder is set default as symmetric models (without query_param and document_param in config). For OpenAI-Compatible embedding providers, if the user set one of the querry_param or document_param, the non-symmetric mode will be on. Users can specify the query_param and document_param to override the default value.

…ngDBManager init

ZaynJarvis · 2026-03-16T08:37:07Z

openviking_cli/utils/config/embedding_config.py

+                api_key=self.dense.api_key,
+                api_base=self.dense.api_base,
+                dimension=self.dense.dimension,
+                input_type=input_type,


btw, just to confirm, OpenAI itself doesn't support input_type, but OpenAI compatible embedding models might require this in extra body to set query / document type?

how do we determine what's the key in extraBody ? e.g. input_type, task_type etc.

embedder = OpenAIDenseEmbedder( model_name="text-embedding-3-small", api_key="test-api-key", input_type="document", ) embedder.embed_batch(["Hello", "World"]) call_kwargs = mock_client.embeddings.create.call_args[1] assert "extra_body" in call_kwargs assert call_kwargs["extra_body"] == {"input_type": "document"}

yes, OpenAI models are symmetric, this param is designed for OpenAI compatible embedding APIs.

Official OpenAI models:

Symmetric, do not support input_type parameter

Examples: text-embedding-3-small, text-embedding-3-large

OpenAI-compatible third-party models:

May require input_type parameter for non-symmetric embeddings

Examples: Jina, Cohere, Voyage

How to determine the key in extra_body:

Each embedder class decides which key to use

OpenAIDenseEmbedder uses input_type

JinaDenseEmbedder uses task

This is documented in the class docstrings:

"""OpenAI-Compatible Dense Embedder Implementation Note: Official OpenAI models are symmetric and do not support the input_type parameter. Non-symmetric mode (context='query'/'document') is only supported by OpenAI-compatible third-party models (e.g., BGE-M3, Jina, Cohere, etc.) that implement the input_type parameter. """

Per reviewer feedback, each embedder now decides its own context-specific parameter mapping (input_type for OpenAI, task for Jina) based on a 'context' arg ('query'/'document'/None), instead of having EmbeddingConfig pass pre-computed values. - Unified config fields: replaced 4 separate fields (input_type_query, input_type_document, task_query, task_document) with 2 unified fields (query_param, document_param) that work for all providers - OpenAI is symmetric by default; non-symmetric mode is activated implicitly when query_param or document_param is set - Jina is non-symmetric by default; task is always sent unless context=None - Updated tests to use new API Note: Official OpenAI models are symmetric and do not support input_type. Non-symmetric mode is only supported by OpenAI-compatible third-party models (e.g., BGE-M3, Jina, Cohere).

ZaynJarvis

lgtm, thx :)

… PR volcengine#608) This enhancement builds on the merged PR volcengine#608 to add flexible string parsing for multiple parameters while maintaining 100% backward compatibility. Key features: - Backward compatible with existing PR volcengine#608 usage - Enhanced string parsing: 'input_type=query,task=search,domain=finance' - Mixed format support: 'query,task=search' - Multiple parameter support beyond just input_type - Clean integration with existing context/query_param/document_param architecture Supported formats: - 'query' -> {'input_type': 'query'} (PR volcengine#608 style) - 'input_type=query' -> {'input_type': 'query'} (explicit) - 'input_type=query,task=search' -> {'input_type': 'query', 'task': 'search'} (multiple) - 'query,task=search' -> {'input_type': 'query', 'task': 'search'} (mixed) This addresses the need for flexible OpenAI-compatible parameter passing while preserving the simple, elegant API from PR volcengine#608.

Minimal enhancement that adds support for multiple parameters in query_param and document_param using key=value format. Changes: - Add _parse_param_string() method for key=value parsing - Enhance _build_extra_body() to support parsed parameters - Store raw parameters for parsing - Update docstrings and examples Fixes: - Remove mixed mode optimization (redundant) - Remove USAGE_EXAMPLES.md (requested) - Clean integration with existing config fields from main Usage: - Simple: query_param='query' (works as before) - Enhanced: query_param='input_type=query,task=search' Builds on merged PR volcengine#608 without breaking changes.

github-project-automation bot added this to OpenViking project Mar 14, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 14, 2026

chore: resolve merge conflict in core.py

ea535ac

MaojiaSheng reviewed Mar 15, 2026

View reviewed changes

openviking/service/core.py Outdated Show resolved Hide resolved

MaojiaSheng reviewed Mar 15, 2026

View reviewed changes

openviking/service/core.py Outdated Show resolved Hide resolved

ZaynJarvis mentioned this pull request Mar 15, 2026

feat(embedding): add Voyage text embedding support #614

Closed

ZaynJarvis mentioned this pull request Mar 15, 2026

feat(embedder): minimax embeding #624

Merged

19 tasks

kfiramar mentioned this pull request Mar 15, 2026

feat(embedding): split query and document embedders #636

Closed

CHW0n9 force-pushed the feature/nonsymmetric-embedding branch 2 times, most recently from 308dbb1 to ea535ac Compare March 15, 2026 13:59

fix(service): implement PR feedback for core.py

3246ae3

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>

CHW0n9 force-pushed the feature/nonsymmetric-embedding branch from 7930e19 to 3246ae3 Compare March 15, 2026 14:01

CHW0n9 requested a review from MaojiaSheng March 15, 2026 14:36

Merge branch 'main' into feature/nonsymmetric-embedding

cf640c8