feat(embedding): add Voyage text embedding support#614
feat(embedding): add Voyage text embedding support#614kfiramar wants to merge 1 commit intovolcengine:mainfrom
Conversation
Add first-class Voyage dense embedding support. This adds a dedicated Voyage dense embedder, validates the new provider in embedding config, and ensures Voyage model defaults are used when deriving the vector schema dimension from configuration. The implementation also exposes Voyage-specific embedding options such as input_type, output_dtype, and output_dimension in a provider-specific way, with focused documentation and tests. Verification: - .venv/bin/python -m pytest tests/unit/test_voyage_embedder.py tests/unit/test_embedding_config_voyage.py --noconftest -q
| | `dimension` | int | Vector dimension. For Voyage, this maps to `output_dimension` | | ||
| | `input` | str | Input type: `"text"` or `"multimodal"` | | ||
| | `input_type` | str | Provider-specific text input mode such as `"query"` or `"document"` (Voyage) | | ||
| | `output_dtype` | str | Provider-specific output dtype (Voyage) | |
There was a problem hiding this comment.
should input_type set in config? I think it should be document/query for indexing and query scenario.
|
@kfiramar Thanks for adding Voyage support! However, I have concerns about adding Problem:
Suggestions:
This approach would solve the architectural issue for all providers, not just Voyage. What do you think? Would environment variable work for your use case? |
|
Replacement is #635. That branch keeps first-class Voyage dense embedding support, but narrows the config surface to match OpenViking's current runtime:
Using one clean replacement PR is the least confusing way to review the final merge candidate. |
|
Opened a dedicated follow-up PR for the larger query/document embedding architecture here: #636. I kept that work separate so this Voyage support thread does not have to absorb the broader cross-cutting config/runtime change. The follow-up PR covers the long-term direction raised in review:
|
Summary
voyagedense embedding supportProblem
OpenViking's embedding configuration currently supports a fixed set of providers and does not expose Voyage AI as a first-class dense embedding backend.
That creates two concrete issues for Voyage usage:
dimensionis configured, schema creation falls back to a generic default instead of the Voyage model's actual output dimensionFor Voyage, that second point is important because collection schema creation happens from configuration before embeddings are written. If the configured dimension does not match the vectors returned by the provider, writes fail later with a vector dimension mismatch.
What This PR Changes
1. Add a dedicated Voyage dense embedder
This PR adds
VoyageDenseEmbedderand wires it into the embedding factory as a newprovider: "voyage"option.The embedder uses Voyage's embeddings API shape and supports provider-specific request fields:
input_typeoutput_dtypeoutput_dimension(mapped fromdimensionin config)2. Resolve Voyage dimensions from config
This PR adds provider-specific effective dimension resolution for Voyage so that:
dimensionis honored when providedThat keeps vector schema creation aligned with the actual vectors returned by Voyage models.
3. Document the supported config shape
The configuration guide now includes:
provider: "voyage"Why This Design
This PR keeps the change narrow and provider-specific.
Instead of broadening existing providers, it introduces Voyage support only where Voyage behavior is materially different:
That keeps existing provider behavior unchanged while making Voyage configuration explicit and maintainable.
Verification
Ran:
Result:
14 passedCoverage of the added tests includes:
voyageScope
This PR is intentionally limited to Voyage text embedding support.
It does not change sparse embeddings, hybrid embeddings, reranking, or non-Voyage providers.