feat(embedding): add Voyage text embedding support by kfiramar · Pull Request #614 · volcengine/OpenViking

kfiramar · 2026-03-14T22:46:38Z

Summary

add first-class voyage dense embedding support
resolve Voyage model defaults when deriving embedding dimension from config
document the Voyage configuration surface and add focused tests

Problem

OpenViking's embedding configuration currently supports a fixed set of providers and does not expose Voyage AI as a first-class dense embedding backend.

That creates two concrete issues for Voyage usage:

there is no supported provider path for Voyage text embedding models
when no explicit dimension is configured, schema creation falls back to a generic default instead of the Voyage model's actual output dimension

For Voyage, that second point is important because collection schema creation happens from configuration before embeddings are written. If the configured dimension does not match the vectors returned by the provider, writes fail later with a vector dimension mismatch.

What This PR Changes

1. Add a dedicated Voyage dense embedder

This PR adds VoyageDenseEmbedder and wires it into the embedding factory as a new provider: "voyage" option.

The embedder uses Voyage's embeddings API shape and supports provider-specific request fields:

input_type
output_dtype
output_dimension (mapped from dimension in config)

2. Resolve Voyage dimensions from config

This PR adds provider-specific effective dimension resolution for Voyage so that:

an explicit dimension is honored when provided
otherwise, the configured Voyage model's default output dimension is used

That keeps vector schema creation aligned with the actual vectors returned by Voyage models.

3. Document the supported config shape

The configuration guide now includes:

provider: "voyage"
Voyage-specific embedding fields
a concrete dense embedding example
supported Voyage text embedding models

Why This Design

This PR keeps the change narrow and provider-specific.

Instead of broadening existing providers, it introduces Voyage support only where Voyage behavior is materially different:

provider validation
embedder construction
Voyage-specific request payload fields
Voyage-specific default dimension resolution

That keeps existing provider behavior unchanged while making Voyage configuration explicit and maintainable.

Verification

Ran:

.venv/bin/python -m pytest tests/unit/test_voyage_embedder.py tests/unit/test_embedding_config_voyage.py --noconftest -q

Result:

14 passed

Coverage of the added tests includes:

provider validation for voyage
default and explicit dimension handling
embedder factory wiring
Voyage request payload construction
batch behavior and API error handling

Scope

This PR is intentionally limited to Voyage text embedding support.
It does not change sparse embeddings, hybrid embeddings, reranking, or non-Voyage providers.

Add first-class Voyage dense embedding support. This adds a dedicated Voyage dense embedder, validates the new provider in embedding config, and ensures Voyage model defaults are used when deriving the vector schema dimension from configuration. The implementation also exposes Voyage-specific embedding options such as input_type, output_dtype, and output_dimension in a provider-specific way, with focused documentation and tests. Verification: - .venv/bin/python -m pytest tests/unit/test_voyage_embedder.py tests/unit/test_embedding_config_voyage.py --noconftest -q

CLAassistant · 2026-03-14T22:46:45Z

All committers have signed the CLA.

ZaynJarvis · 2026-03-15T11:12:31Z

docs/en/guides/01-configuration.md

+| `dimension` | int | Vector dimension. For Voyage, this maps to `output_dimension` |
 | `input` | str | Input type: `"text"` or `"multimodal"` |
+| `input_type` | str | Provider-specific text input mode such as `"query"` or `"document"` (Voyage) |
+| `output_dtype` | str | Provider-specific output dtype (Voyage) |


should input_type set in config? I think it should be document/query for indexing and query scenario.

#608

ZaynJarvis · 2026-03-15T11:49:34Z

@kfiramar Thanks for adding Voyage support! However, I have concerns about adding output_dtype to the base EmbeddingModelConfig:

Problem: output_dtype is Voyage-specific and doesn't apply to other providers (OpenAI, Jina, etc.). Adding provider-specific parameters to the base config creates architectural issues:

Config bloat with unused fields for most providers
Confusing documentation ("only works with Voyage")
Same scalability problem as the if/else logic in PR feat(embedder): add non-symmetric embedding support for query/document #608

Suggestions:

Short-term: Use environment variable VOYAGE_OUTPUT_DTYPE instead of config field, since this is likely a deployment-level preference rather than per-model config
Long-term: Refactor to support provider-specific configs:

This approach would solve the architectural issue for all providers, not just Voyage.

What do you think? Would environment variable work for your use case?

kfiramar · 2026-03-15T13:14:28Z

Replacement is #635.

That branch keeps first-class Voyage dense embedding support, but narrows the config surface to match OpenViking's current runtime:

no output_dtype in public config
no query/document config knob while indexing and retrieval still share one dense embedder configuration
dimension is mapped internally to Voyage's output_dimension

Using one clean replacement PR is the least confusing way to review the final merge candidate.

kfiramar · 2026-03-15T13:35:03Z

Opened a dedicated follow-up PR for the larger query/document embedding architecture here: #636.

I kept that work separate so this Voyage support thread does not have to absorb the broader cross-cutting config/runtime change.

The follow-up PR covers the long-term direction raised in review:

separate query/document dense embedder config
provider-specific contextual behavior encapsulated in embedder implementations
retrieval/indexing wired to the correct embedder roles
no provider-specific expansion of the shared base embedding config

github-project-automation bot added this to OpenViking project Mar 14, 2026

github-project-automation bot moved this to Backlog in OpenViking project Mar 14, 2026

ZaynJarvis reviewed Mar 15, 2026

View reviewed changes

ZaynJarvis mentioned this pull request Mar 15, 2026

feat(embedder): minimax embeding #624

Merged

19 tasks

kfiramar mentioned this pull request Mar 15, 2026

feat(embedding): add Voyage dense embedding support #635

Merged

kfiramar closed this Mar 15, 2026

github-project-automation bot moved this from Backlog to Done in OpenViking project Mar 15, 2026

kfiramar mentioned this pull request Mar 15, 2026

feat(embedding): split query and document embedders #636

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(embedding): add Voyage text embedding support#614

feat(embedding): add Voyage text embedding support#614
kfiramar wants to merge 1 commit intovolcengine:mainfrom
kfiramar:feat/voyage-text-embeddings

kfiramar commented Mar 14, 2026

Uh oh!

CLAassistant commented Mar 14, 2026 •

edited

Loading

Uh oh!

ZaynJarvis Mar 15, 2026

Uh oh!

ZaynJarvis commented Mar 15, 2026 •

edited

Loading

Uh oh!

kfiramar commented Mar 15, 2026

Uh oh!

kfiramar commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kfiramar commented Mar 14, 2026

Summary

Problem

What This PR Changes

1. Add a dedicated Voyage dense embedder

2. Resolve Voyage dimensions from config

3. Document the supported config shape

Why This Design

Verification

Scope

Uh oh!

CLAassistant commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZaynJarvis Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

ZaynJarvis commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kfiramar commented Mar 15, 2026

Uh oh!

kfiramar commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Mar 14, 2026 •

edited

Loading

ZaynJarvis commented Mar 15, 2026 •

edited

Loading