Skip to content

feat(embedder): add non-symmetric embedding support for query/document#608

Merged
ZaynJarvis merged 7 commits intovolcengine:mainfrom
CHW0n9:feature/nonsymmetric-embedding
Mar 16, 2026
Merged

feat(embedder): add non-symmetric embedding support for query/document#608
ZaynJarvis merged 7 commits intovolcengine:mainfrom
CHW0n9:feature/nonsymmetric-embedding

Conversation

@CHW0n9
Copy link
Copy Markdown
Contributor

@CHW0n9 CHW0n9 commented Mar 14, 2026

Summary

Implement nonsymmetric embedding support so queries and documents can use different embedding strategies/parameters. This enables using OpenAI's input_type or Jina's task settings differently for retrieval queries vs. document/passage embeddings.

What changed

  • OpenAIDenseEmbedder: Added input_type parameter support (accepts "query", "document", "passage"). Parameters are passed via extra_body to the OpenAI client.
  • EmbeddingConfig: Added contextual embedder helpers:
    • get_query_embedder(): Returns an embedder configured specifically for query embeddings.
    • get_document_embedder(): Returns an embedder configured for document/passage embeddings.
  • Service & Session: Updated service/core.py and memory_deduplicator.py to use the query-specific embedder.
  • Storage: Updated collection_schemas.py to use the document-specific embedder for collection initialization.
  • Configuration: Improved default-value handling and added validation for embedding providers.
  • Linting: Fixed import ordering and removed unused imports in affected files.

Rationale

Modern embedding models (like OpenAI text-embedding-3 or Jina v5) perform better when the API is told whether the input is a short query or a long document. This PR introduces the architectural plumbing to distinguish these two paths in the OpenViking engine, allowing for higher retrieval accuracy.

Files included in this PR

  • openviking/models/embedder/openai_embedders.py
  • openviking_cli/utils/config/embedding_config.py
  • openviking/service/core.py
  • openviking/session/memory_deduplicator.py
  • openviking/storage/collection_schemas.py
  • openviking_cli/utils/config/vlm_config.py

(Note: Dockerfile optimizations and local test files were excluded from this PR to keep the focus on the embedding logic.)

Testing

  • Linting: Verified with ruff format and ruff check.
  • Type Checking: Verified with mypy (minor warnings in external stubs reviewed as benign).
  • Unit Testing: New logic was verified with local test suites (test_embedding_input_type.py and test_openai_embedder.py).

Checklist

  • Code follows project style guidelines (Ruff/Mypy)
  • Contextual embedder logic implemented for OpenAI/Jina
  • CI pipeline passing (requires upstream build)

- Add input_type parameter to OpenAIDenseEmbedder
- Add get_query_embedder() and get_document_embedder() to EmbeddingConfig
- Support different embedding strategies for queries and documents
- Improve error handling and default value processing
- Fix import ordering and remove unused imports
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 14, 2026

CLA assistant check
All committers have signed the CLA.

@MaojiaSheng
Copy link
Copy Markdown
Collaborator

Thanks but a few comments need to be resolved before merge

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

ZaynJarvis commented Mar 15, 2026

@CHW0n9 Great work on adding non-symmetric embedding support!

I notice the current approach in _get_contextual_embedder() uses provider-specific if/else logic. As more providers like Gemini Embedding 2 add task type support, this pattern might become difficult to maintain.

Suggested refactor for better scalability:
Consider moving the input type/task selection logic into each specific EmbeddingModel class:

  1. Add a method like get_contextual_config(context: str) -> dict to each embedder class
  2. Let each embedder handle its own provider-specific logic (OpenAI uses input_type, Jina uses task, etc.)
  3. Replace the provider if/else chain with polymorphic calls

This would:

  • Keep provider-specific logic encapsulated
  • Make adding new providers simpler
  • Follow better OOP principles
  • Eliminate the growing if/elif chain

The current implementation works well, but this refactor could improve long-term maintainability. What do you think?

@ZaynJarvis ZaynJarvis mentioned this pull request Mar 15, 2026
19 tasks
kfiramar added a commit to kfiramar/OpenViking that referenced this pull request Mar 15, 2026
Add first-class query/document dense embedding configuration and wire it through retrieval and indexing.

- add dense_query and dense_document partial overrides on top of the base dense config
- encapsulate provider-specific query/document defaults in embedder classes via contextual init hooks
- use query embedders for retrieval paths and document embedders for schema/indexing paths
- document the new config shape and add focused coverage for config merging and Jina defaults

Follow-up to volcengine#608.
@kfiramar
Copy link
Copy Markdown
Contributor

Opened a dedicated follow-up PR for the long-term query/document embedding architecture here: #636.

This follow-up keeps the change scoped to the cross-cutting design discussed in this thread:

  • separate query-time and document/index-time dense embedder config
  • provider-specific contextual behavior implemented in embedder classes
  • retrieval wired to the query embedder and indexing/schema paths wired to the document embedder
  • no service-layer provider branching

That lets the architectural change land independently from provider-specific feature work.

@CHW0n9 CHW0n9 force-pushed the feature/nonsymmetric-embedding branch 2 times, most recently from 308dbb1 to ea535ac Compare March 15, 2026 13:59
Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-opencode)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
@CHW0n9 CHW0n9 force-pushed the feature/nonsymmetric-embedding branch from 7930e19 to 3246ae3 Compare March 15, 2026 14:01
@CHW0n9 CHW0n9 requested a review from MaojiaSheng March 15, 2026 14:36
Comment on lines +346 to +364
api_base=self.dense.api_base,
dimension=self.dense.dimension,
input_type=input_type,
)

if provider == "jina":
if self.dense.task_document:
task = "retrieval.query" if context == "query" else self.dense.task_document
else:
task = None

if not self.dense.model:
raise ValueError("Embedding model name is required")

return JinaDenseEmbedder(
model_name=self.dense.model,
api_key=self.dense.api_key,
api_base=self.dense.api_base,
dimension=self.dense.dimension,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        return OpenAIDenseEmbedder(
            model_name=self.dense.model,
            api_key=self.dense.api_key,
            api_base=self.dense.api_base,
            dimension=self.dense.dimension,
            context=context,
        )
        ...
        return JinaDenseEmbedder(
            model_name=self.dense.model,
            api_key=self.dense.api_key,
            api_base=self.dense.api_base,
            dimension=self.dense.dimension,
            context=context,
        )

in XXXEmbedder impl:

  class OpenAIDenseEmbedder:
    ...
    input_type = "x" if context == "query" else "y"
  class JinaDenseEmbedder:
    ...
    task  = "a" if context == "query" else "b"

would this be better?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved the context parameter handling into embedder classes:

OpenAIDenseEmbedder:

def __init__(self, ..., context: Optional[str] = None, query_param: Optional[str] = None, document_param: Optional[str] = None, ...):
    non_symmetric = query_param is not None or document_param is not None
    if not non_symmetric:
        self.input_type = None
    elif context == "query":
        self.input_type = query_param if query_param is not None else "query"
    elif context == "document":
        self.input_type = document_param if document_param is not None else "passage"

JinaDenseEmbedder:

def __init__(self, ..., context: Optional[str] = None, query_param: str = "retrieval.query", document_param: str = "retrieval.passage", ...):
    if context == "query":
        self.task = query_param
    elif context == "document":
        self.task = document_param

Each embedder class now decides how to handle the context parameter - OpenAI uses input_type, Jina uses task. The _get_contextual_embedder() now simply passes the context parameter to the embedder class without provider-specific if/else logic.

task_document: Optional[str] = Field(
default=None,
description="Jina task for document embeddings (e.g. 'retrieval.passage')",
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too many config fields here.

  1. input_type_query, task_query refers to the same thing, should use same config name
  2. i don't think query / document / retrieval.query / retrieval.passage need to be set in config. static value should be decided by each Embedder for each context.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've consolidated the config fields!

Removed 4 old fields:

  • input_type_query
  • input_type_document
  • task_query
  • task_document

Added 2 unified fields:

  • query_param - works for all providers (OpenAI → input_type, Jina → task)
  • document_param - works for all providers

This makes it easier to add new providers in the future - no need to add new config fields, just use the existing query_param/document_param. The embedder class decides the default values:

  • OpenAI: query_param defaults to "query", document_param defaults to "passage"
  • Jina: query_param defaults to "retrieval.query", document_param defaults to "retrieval.passage"

OpenAI embedder is set default as symmetric models (without query_param and document_param in config). For OpenAI-Compatible embedding providers, if the user set one of the querry_param or document_param, the non-symmetric mode will be on. Users can specify the query_param and document_param to override the default value.

@CHW0n9 CHW0n9 force-pushed the feature/nonsymmetric-embedding branch 2 times, most recently from 5de6f7b to 87f983f Compare March 16, 2026 07:57
@CHW0n9 CHW0n9 force-pushed the feature/nonsymmetric-embedding branch from 87f983f to 673d6db Compare March 16, 2026 08:05
@CHW0n9 CHW0n9 force-pushed the feature/nonsymmetric-embedding branch from 2bffae2 to 673d6db Compare March 16, 2026 08:25
api_key=self.dense.api_key,
api_base=self.dense.api_base,
dimension=self.dense.dimension,
input_type=input_type,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, just to confirm, OpenAI itself doesn't support input_type, but OpenAI compatible embedding models might require this in extra body to set query / document type?

how do we determine what's the key in extraBody ? e.g. input_type, task_type etc.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    embedder = OpenAIDenseEmbedder(
        model_name="text-embedding-3-small",
        api_key="test-api-key",
        input_type="document",
    )

    embedder.embed_batch(["Hello", "World"])

    call_kwargs = mock_client.embeddings.create.call_args[1]
    assert "extra_body" in call_kwargs
    assert call_kwargs["extra_body"] == {"input_type": "document"}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, OpenAI models are symmetric, this param is designed for OpenAI compatible embedding APIs.

Official OpenAI models:

  • Symmetric, do not support input_type parameter
  • Examples: text-embedding-3-small, text-embedding-3-large

OpenAI-compatible third-party models:

  • May require input_type parameter for non-symmetric embeddings
  • Examples: Jina, Cohere, Voyage

How to determine the key in extra_body:

  • Each embedder class decides which key to use
  • OpenAIDenseEmbedder uses input_type
  • JinaDenseEmbedder uses task

This is documented in the class docstrings:

"""OpenAI-Compatible Dense Embedder Implementation

Note: Official OpenAI models are symmetric and do not support the input_type parameter.
Non-symmetric mode (context='query'/'document') is only supported by OpenAI-compatible
third-party models (e.g., BGE-M3, Jina, Cohere, etc.) that implement the input_type parameter.
"""

CHW0n9 and others added 2 commits March 16, 2026 18:14
Per reviewer feedback, each embedder now decides its own context-specific
parameter mapping (input_type for OpenAI, task for Jina) based on a
'context' arg ('query'/'document'/None), instead of having EmbeddingConfig
pass pre-computed values.

- Unified config fields: replaced 4 separate fields (input_type_query,
  input_type_document, task_query, task_document) with 2 unified fields
  (query_param, document_param) that work for all providers
- OpenAI is symmetric by default; non-symmetric mode is activated
  implicitly when query_param or document_param is set
- Jina is non-symmetric by default; task is always sent unless context=None
- Updated tests to use new API

Note: Official OpenAI models are symmetric and do not support input_type.
Non-symmetric mode is only supported by OpenAI-compatible third-party
models (e.g., BGE-M3, Jina, Cohere).
Copy link
Copy Markdown
Collaborator

@ZaynJarvis ZaynJarvis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thx :)

@ZaynJarvis ZaynJarvis merged commit 5260b81 into volcengine:main Mar 16, 2026
1 check passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 16, 2026
ZaynJarvis added a commit to ZaynJarvis/OpenViking that referenced this pull request Mar 17, 2026
… PR volcengine#608)

This enhancement builds on the merged PR volcengine#608 to add flexible string parsing
for multiple parameters while maintaining 100% backward compatibility.

Key features:
- Backward compatible with existing PR volcengine#608 usage
- Enhanced string parsing: 'input_type=query,task=search,domain=finance'
- Mixed format support: 'query,task=search'
- Multiple parameter support beyond just input_type
- Clean integration with existing context/query_param/document_param architecture

Supported formats:
- 'query' -> {'input_type': 'query'} (PR volcengine#608 style)
- 'input_type=query' -> {'input_type': 'query'} (explicit)
- 'input_type=query,task=search' -> {'input_type': 'query', 'task': 'search'} (multiple)
- 'query,task=search' -> {'input_type': 'query', 'task': 'search'} (mixed)

This addresses the need for flexible OpenAI-compatible parameter passing
while preserving the simple, elegant API from PR volcengine#608.
ZaynJarvis added a commit to ZaynJarvis/OpenViking that referenced this pull request Mar 17, 2026
Minimal enhancement that adds support for multiple parameters in
query_param and document_param using key=value format.

Changes:
- Add _parse_param_string() method for key=value parsing
- Enhance _build_extra_body() to support parsed parameters
- Store raw parameters for parsing
- Update docstrings and examples

Fixes:
- Remove mixed mode optimization (redundant)
- Remove USAGE_EXAMPLES.md (requested)
- Clean integration with existing config fields from main

Usage:
- Simple: query_param='query' (works as before)
- Enhanced: query_param='input_type=query,task=search'

Builds on merged PR volcengine#608 without breaking changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants