feat(embedder): add non-symmetric embedding support for query/document#608
Conversation
- Add input_type parameter to OpenAIDenseEmbedder - Add get_query_embedder() and get_document_embedder() to EmbeddingConfig - Support different embedding strategies for queries and documents - Improve error handling and default value processing - Fix import ordering and remove unused imports
|
Thanks but a few comments need to be resolved before merge |
|
@CHW0n9 Great work on adding non-symmetric embedding support! I notice the current approach in Suggested refactor for better scalability:
This would:
The current implementation works well, but this refactor could improve long-term maintainability. What do you think? |
Add first-class query/document dense embedding configuration and wire it through retrieval and indexing. - add dense_query and dense_document partial overrides on top of the base dense config - encapsulate provider-specific query/document defaults in embedder classes via contextual init hooks - use query embedders for retrieval paths and document embedders for schema/indexing paths - document the new config shape and add focused coverage for config merging and Jina defaults Follow-up to volcengine#608.
|
Opened a dedicated follow-up PR for the long-term query/document embedding architecture here: #636. This follow-up keeps the change scoped to the cross-cutting design discussed in this thread:
That lets the architectural change land independently from provider-specific feature work. |
308dbb1 to
ea535ac
Compare
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
7930e19 to
3246ae3
Compare
| api_base=self.dense.api_base, | ||
| dimension=self.dense.dimension, | ||
| input_type=input_type, | ||
| ) | ||
|
|
||
| if provider == "jina": | ||
| if self.dense.task_document: | ||
| task = "retrieval.query" if context == "query" else self.dense.task_document | ||
| else: | ||
| task = None | ||
|
|
||
| if not self.dense.model: | ||
| raise ValueError("Embedding model name is required") | ||
|
|
||
| return JinaDenseEmbedder( | ||
| model_name=self.dense.model, | ||
| api_key=self.dense.api_key, | ||
| api_base=self.dense.api_base, | ||
| dimension=self.dense.dimension, |
There was a problem hiding this comment.
return OpenAIDenseEmbedder(
model_name=self.dense.model,
api_key=self.dense.api_key,
api_base=self.dense.api_base,
dimension=self.dense.dimension,
context=context,
)
...
return JinaDenseEmbedder(
model_name=self.dense.model,
api_key=self.dense.api_key,
api_base=self.dense.api_base,
dimension=self.dense.dimension,
context=context,
)
in XXXEmbedder impl:
class OpenAIDenseEmbedder:
...
input_type = "x" if context == "query" else "y"
class JinaDenseEmbedder:
...
task = "a" if context == "query" else "b"
would this be better?
There was a problem hiding this comment.
I've moved the context parameter handling into embedder classes:
OpenAIDenseEmbedder:
def __init__(self, ..., context: Optional[str] = None, query_param: Optional[str] = None, document_param: Optional[str] = None, ...):
non_symmetric = query_param is not None or document_param is not None
if not non_symmetric:
self.input_type = None
elif context == "query":
self.input_type = query_param if query_param is not None else "query"
elif context == "document":
self.input_type = document_param if document_param is not None else "passage"JinaDenseEmbedder:
def __init__(self, ..., context: Optional[str] = None, query_param: str = "retrieval.query", document_param: str = "retrieval.passage", ...):
if context == "query":
self.task = query_param
elif context == "document":
self.task = document_paramEach embedder class now decides how to handle the context parameter - OpenAI uses input_type, Jina uses task. The _get_contextual_embedder() now simply passes the context parameter to the embedder class without provider-specific if/else logic.
| task_document: Optional[str] = Field( | ||
| default=None, | ||
| description="Jina task for document embeddings (e.g. 'retrieval.passage')", | ||
| ) |
There was a problem hiding this comment.
too many config fields here.
- input_type_query, task_query refers to the same thing, should use same config name
- i don't think query / document / retrieval.query / retrieval.passage need to be set in config. static value should be decided by each Embedder for each context.
There was a problem hiding this comment.
I've consolidated the config fields!
Removed 4 old fields:
input_type_queryinput_type_documenttask_querytask_document
Added 2 unified fields:
query_param- works for all providers (OpenAI → input_type, Jina → task)document_param- works for all providers
This makes it easier to add new providers in the future - no need to add new config fields, just use the existing query_param/document_param. The embedder class decides the default values:
- OpenAI:
query_paramdefaults to"query",document_paramdefaults to"passage" - Jina:
query_paramdefaults to"retrieval.query",document_paramdefaults to"retrieval.passage"
OpenAI embedder is set default as symmetric models (without query_param and document_param in config). For OpenAI-Compatible embedding providers, if the user set one of the querry_param or document_param, the non-symmetric mode will be on. Users can specify the query_param and document_param to override the default value.
5de6f7b to
87f983f
Compare
87f983f to
673d6db
Compare
2bffae2 to
673d6db
Compare
| api_key=self.dense.api_key, | ||
| api_base=self.dense.api_base, | ||
| dimension=self.dense.dimension, | ||
| input_type=input_type, |
There was a problem hiding this comment.
btw, just to confirm, OpenAI itself doesn't support input_type, but OpenAI compatible embedding models might require this in extra body to set query / document type?
how do we determine what's the key in extraBody ? e.g. input_type, task_type etc.
There was a problem hiding this comment.
embedder = OpenAIDenseEmbedder(
model_name="text-embedding-3-small",
api_key="test-api-key",
input_type="document",
)
embedder.embed_batch(["Hello", "World"])
call_kwargs = mock_client.embeddings.create.call_args[1]
assert "extra_body" in call_kwargs
assert call_kwargs["extra_body"] == {"input_type": "document"}
There was a problem hiding this comment.
yes, OpenAI models are symmetric, this param is designed for OpenAI compatible embedding APIs.
Official OpenAI models:
- Symmetric, do not support
input_typeparameter - Examples:
text-embedding-3-small,text-embedding-3-large
OpenAI-compatible third-party models:
- May require
input_typeparameter for non-symmetric embeddings - Examples: Jina, Cohere, Voyage
How to determine the key in extra_body:
- Each embedder class decides which key to use
OpenAIDenseEmbedderusesinput_typeJinaDenseEmbedderusestask
This is documented in the class docstrings:
"""OpenAI-Compatible Dense Embedder Implementation
Note: Official OpenAI models are symmetric and do not support the input_type parameter.
Non-symmetric mode (context='query'/'document') is only supported by OpenAI-compatible
third-party models (e.g., BGE-M3, Jina, Cohere, etc.) that implement the input_type parameter.
"""Per reviewer feedback, each embedder now decides its own context-specific
parameter mapping (input_type for OpenAI, task for Jina) based on a
'context' arg ('query'/'document'/None), instead of having EmbeddingConfig
pass pre-computed values.
- Unified config fields: replaced 4 separate fields (input_type_query,
input_type_document, task_query, task_document) with 2 unified fields
(query_param, document_param) that work for all providers
- OpenAI is symmetric by default; non-symmetric mode is activated
implicitly when query_param or document_param is set
- Jina is non-symmetric by default; task is always sent unless context=None
- Updated tests to use new API
Note: Official OpenAI models are symmetric and do not support input_type.
Non-symmetric mode is only supported by OpenAI-compatible third-party
models (e.g., BGE-M3, Jina, Cohere).
… PR volcengine#608) This enhancement builds on the merged PR volcengine#608 to add flexible string parsing for multiple parameters while maintaining 100% backward compatibility. Key features: - Backward compatible with existing PR volcengine#608 usage - Enhanced string parsing: 'input_type=query,task=search,domain=finance' - Mixed format support: 'query,task=search' - Multiple parameter support beyond just input_type - Clean integration with existing context/query_param/document_param architecture Supported formats: - 'query' -> {'input_type': 'query'} (PR volcengine#608 style) - 'input_type=query' -> {'input_type': 'query'} (explicit) - 'input_type=query,task=search' -> {'input_type': 'query', 'task': 'search'} (multiple) - 'query,task=search' -> {'input_type': 'query', 'task': 'search'} (mixed) This addresses the need for flexible OpenAI-compatible parameter passing while preserving the simple, elegant API from PR volcengine#608.
Minimal enhancement that adds support for multiple parameters in query_param and document_param using key=value format. Changes: - Add _parse_param_string() method for key=value parsing - Enhance _build_extra_body() to support parsed parameters - Store raw parameters for parsing - Update docstrings and examples Fixes: - Remove mixed mode optimization (redundant) - Remove USAGE_EXAMPLES.md (requested) - Clean integration with existing config fields from main Usage: - Simple: query_param='query' (works as before) - Enhanced: query_param='input_type=query,task=search' Builds on merged PR volcengine#608 without breaking changes.
Summary
Implement nonsymmetric embedding support so queries and documents can use different embedding strategies/parameters. This enables using OpenAI's
input_typeor Jina'stasksettings differently for retrieval queries vs. document/passage embeddings.What changed
input_typeparameter support (accepts "query", "document", "passage"). Parameters are passed viaextra_bodyto the OpenAI client.get_query_embedder(): Returns an embedder configured specifically for query embeddings.get_document_embedder(): Returns an embedder configured for document/passage embeddings.service/core.pyandmemory_deduplicator.pyto use the query-specific embedder.collection_schemas.pyto use the document-specific embedder for collection initialization.Rationale
Modern embedding models (like OpenAI
text-embedding-3or Jinav5) perform better when the API is told whether the input is a short query or a long document. This PR introduces the architectural plumbing to distinguish these two paths in the OpenViking engine, allowing for higher retrieval accuracy.Files included in this PR
openviking/models/embedder/openai_embedders.pyopenviking_cli/utils/config/embedding_config.pyopenviking/service/core.pyopenviking/session/memory_deduplicator.pyopenviking/storage/collection_schemas.pyopenviking_cli/utils/config/vlm_config.py(Note: Dockerfile optimizations and local test files were excluded from this PR to keep the focus on the embedding logic.)
Testing
ruff formatandruff check.mypy(minor warnings in external stubs reviewed as benign).test_embedding_input_type.pyandtest_openai_embedder.py).Checklist