Skip to content

feat(embedding): split query and document embedders#636

Closed
kfiramar wants to merge 2 commits intovolcengine:mainfrom
kfiramar:feat/query-document-embedders
Closed

feat(embedding): split query and document embedders#636
kfiramar wants to merge 2 commits intovolcengine:mainfrom
kfiramar:feat/query-document-embedders

Conversation

@kfiramar
Copy link
Copy Markdown
Contributor

@kfiramar kfiramar commented Mar 15, 2026

Summary

This PR introduces first-class query-time and document-time dense embedding configuration in OpenViking.

The goal is to solve the architectural problem raised during review of #608: some embedding providers need different request semantics for retrieval queries versus indexed documents, but that behavior should not be modeled as provider-specific branching in the service layer or as provider-only fields in the shared base config.

This change makes the query/document split explicit in configuration and wiring while keeping provider-specific contextual behavior inside embedder implementations.

Why

Modern retrieval models often distinguish between short search queries and document or passage embeddings. That distinction matters for retrieval quality, but the current single dense-embedder configuration does not model it cleanly.

This PR provides a long-term structure for that split by:

  • making query and document embedders explicit in config
  • using the correct embedder at each call site
  • keeping provider-specific defaults encapsulated in embedders
  • preserving backward compatibility for existing configurations

What Changed

  • add embedding.dense_query and embedding.dense_document as optional partial overrides on top of the existing embedding.dense config
  • keep embedding.dense as the shared base so common settings do not need to be duplicated
  • add EmbeddingConfig.get_query_embedder()
  • add EmbeddingConfig.get_document_embedder()
  • keep EmbeddingConfig.get_embedder() as a backward-compatible alias for the query embedder
  • route retrieval paths through the query embedder
  • route indexing, schema initialization, and persisted-vector dimension selection through the document embedder
  • add an embedder-level contextual initialization hook so providers can apply query/document defaults without service-layer conditionals
  • use that hook in JinaDenseEmbedder to map query embeddings to retrieval.query and document embeddings to retrieval.passage
  • document the new configuration shape and add focused regression coverage

Design Notes

This PR intentionally addresses the architecture, not a single provider.

Important design constraints:

  • provider-specific behavior stays in embedder implementations
  • the shared config is not expanded with provider-only fields
  • existing configs remain valid because dense_query and dense_document are optional
  • the document-side embedder determines stored vector shape, which is the correct source for collection/schema dimension

This keeps the feature lean while giving providers room to support contextual embedding behavior cleanly.

Backward Compatibility

Existing configurations continue to work unchanged.

If embedding.dense_query and embedding.dense_document are omitted:

  • OpenViking still uses embedding.dense
  • get_embedder() continues to behave as before
  • query and document paths resolve to the same effective embedder configuration

The new config keys are additive overrides, not a breaking replacement.

Example Configuration

embedding:
  dense:
    provider: jina
    model: jina-embeddings-v4
    api_key: ${JINA_API_KEY}
    dimension: 1024
  dense_query:
    config:
      task: retrieval.query
  dense_document:
    config:
      task: retrieval.passage

Providers that do not need different query/document behavior can continue using only embedding.dense.

Validation

Automated validation performed for this PR:

  • ruff check on touched source and test files
  • focused unit tests for config override merging
  • focused unit tests for contextual embedder defaults
  • service-level regression test covering query embedder wiring into VikingFS
  • storage and session regression tests for updated embedder call sites
  • quick-start integration smoke coverage for initialization behavior

Manual checks:

  • verified base config plus query/document overrides merge as expected
  • verified stored vector dimension follows the document embedder configuration
  • verified Jina query/document embedders receive the expected default task values
  • verified unchanged configs still resolve successfully through the backward-compatible path

Scope

This PR is intentionally limited to query/document dense embedder architecture.

It does not:

  • add provider-specific config trees for every provider
  • introduce provider-specific branching into service code
  • bundle unrelated provider feature work into the same review

Related

Add first-class query/document dense embedding configuration and wire it through retrieval and indexing.

- add dense_query and dense_document partial overrides on top of the base dense config
- encapsulate provider-specific query/document defaults in embedder classes via contextual init hooks
- use query embedders for retrieval paths and document embedders for schema/indexing paths
- document the new config shape and add focused coverage for config merging and Jina defaults

Follow-up to volcengine#608.
Add a service-level regression test for the new query/document embedding split.

The test verifies OpenVikingService initializes both embedders and passes the query embedder into VikingFS during runtime initialization.
@ZaynJarvis
Copy link
Copy Markdown
Collaborator

referencing #608 review, two things needs confirmation

  1. is query & document embedder used in the correct place
  2. i don't think task type should be set in config, rather, it should be assigned in Embedder based on context.

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

ZaynJarvis commented Mar 16, 2026

#608 is merged, this PR might be redundant. if no further update will be closed in future.

@ZaynJarvis ZaynJarvis closed this Mar 17, 2026
@github-project-automation github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants