Skip to content

feat(embedding): add Voyage text embedding support#614

Closed
kfiramar wants to merge 1 commit intovolcengine:mainfrom
kfiramar:feat/voyage-text-embeddings
Closed

feat(embedding): add Voyage text embedding support#614
kfiramar wants to merge 1 commit intovolcengine:mainfrom
kfiramar:feat/voyage-text-embeddings

Conversation

@kfiramar
Copy link
Copy Markdown
Contributor

Summary

  • add first-class voyage dense embedding support
  • resolve Voyage model defaults when deriving embedding dimension from config
  • document the Voyage configuration surface and add focused tests

Problem

OpenViking's embedding configuration currently supports a fixed set of providers and does not expose Voyage AI as a first-class dense embedding backend.

That creates two concrete issues for Voyage usage:

  • there is no supported provider path for Voyage text embedding models
  • when no explicit dimension is configured, schema creation falls back to a generic default instead of the Voyage model's actual output dimension

For Voyage, that second point is important because collection schema creation happens from configuration before embeddings are written. If the configured dimension does not match the vectors returned by the provider, writes fail later with a vector dimension mismatch.

What This PR Changes

1. Add a dedicated Voyage dense embedder

This PR adds VoyageDenseEmbedder and wires it into the embedding factory as a new provider: "voyage" option.

The embedder uses Voyage's embeddings API shape and supports provider-specific request fields:

  • input_type
  • output_dtype
  • output_dimension (mapped from dimension in config)

2. Resolve Voyage dimensions from config

This PR adds provider-specific effective dimension resolution for Voyage so that:

  • an explicit dimension is honored when provided
  • otherwise, the configured Voyage model's default output dimension is used

That keeps vector schema creation aligned with the actual vectors returned by Voyage models.

3. Document the supported config shape

The configuration guide now includes:

  • provider: "voyage"
  • Voyage-specific embedding fields
  • a concrete dense embedding example
  • supported Voyage text embedding models

Why This Design

This PR keeps the change narrow and provider-specific.

Instead of broadening existing providers, it introduces Voyage support only where Voyage behavior is materially different:

  • provider validation
  • embedder construction
  • Voyage-specific request payload fields
  • Voyage-specific default dimension resolution

That keeps existing provider behavior unchanged while making Voyage configuration explicit and maintainable.

Verification

Ran:

.venv/bin/python -m pytest tests/unit/test_voyage_embedder.py tests/unit/test_embedding_config_voyage.py --noconftest -q

Result:

  • 14 passed

Coverage of the added tests includes:

  • provider validation for voyage
  • default and explicit dimension handling
  • embedder factory wiring
  • Voyage request payload construction
  • batch behavior and API error handling

Scope

This PR is intentionally limited to Voyage text embedding support.
It does not change sparse embeddings, hybrid embeddings, reranking, or non-Voyage providers.

Add first-class Voyage dense embedding support.

This adds a dedicated Voyage dense embedder, validates the new
provider in embedding config, and ensures Voyage model defaults are
used when deriving the vector schema dimension from configuration.

The implementation also exposes Voyage-specific embedding options
such as input_type, output_dtype, and output_dimension in a
provider-specific way, with focused documentation and tests.

Verification:
- .venv/bin/python -m pytest tests/unit/test_voyage_embedder.py tests/unit/test_embedding_config_voyage.py --noconftest -q
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Mar 14, 2026

CLA assistant check
All committers have signed the CLA.

| `dimension` | int | Vector dimension. For Voyage, this maps to `output_dimension` |
| `input` | str | Input type: `"text"` or `"multimodal"` |
| `input_type` | str | Provider-specific text input mode such as `"query"` or `"document"` (Voyage) |
| `output_dtype` | str | Provider-specific output dtype (Voyage) |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should input_type set in config? I think it should be document/query for indexing and query scenario.

#608

@ZaynJarvis
Copy link
Copy Markdown
Collaborator

ZaynJarvis commented Mar 15, 2026

@kfiramar Thanks for adding Voyage support! However, I have concerns about adding output_dtype to the base EmbeddingModelConfig:

Problem: output_dtype is Voyage-specific and doesn't apply to other providers (OpenAI, Jina, etc.). Adding provider-specific parameters to the base config creates architectural issues:

Suggestions:

  1. Short-term: Use environment variable VOYAGE_OUTPUT_DTYPE instead of config field, since this is likely a deployment-level preference rather than per-model config

  2. Long-term: Refactor to support provider-specific configs:

This approach would solve the architectural issue for all providers, not just Voyage.

What do you think? Would environment variable work for your use case?

@kfiramar
Copy link
Copy Markdown
Contributor Author

Replacement is #635.

That branch keeps first-class Voyage dense embedding support, but narrows the config surface to match OpenViking's current runtime:

  • no output_dtype in public config
  • no query/document config knob while indexing and retrieval still share one dense embedder configuration
  • dimension is mapped internally to Voyage's output_dimension

Using one clean replacement PR is the least confusing way to review the final merge candidate.

@kfiramar
Copy link
Copy Markdown
Contributor Author

Opened a dedicated follow-up PR for the larger query/document embedding architecture here: #636.

I kept that work separate so this Voyage support thread does not have to absorb the broader cross-cutting config/runtime change.

The follow-up PR covers the long-term direction raised in review:

  • separate query/document dense embedder config
  • provider-specific contextual behavior encapsulated in embedder implementations
  • retrieval/indexing wired to the correct embedder roles
  • no provider-specific expansion of the shared base embedding config

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants