[Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval

### Problem Statement

## Summary

I would like to request first-class support for **Gemini Embedding 2** in OpenViking.

This feels like a strong fit for OpenViking because the project is positioned as a **context database for AI agents**, not just a text vector store. Gemini Embedding 2 is especially interesting because it can embed **multiple modalities into a shared semantic space**, which matches OpenViking’s broader context and retrieval goals.

## What matters most

The key point is that this should be treated as a **true multimodal retrieval backend**, not just as another text embedding option.

To be valuable in OpenViking, support should preserve the multimodal nature of the model through the retrieval pipeline. In particular:

- it should not reduce everything to plain text before embedding
- it should preserve modality where possible during ingestion and indexing
- it should support meaningful **cross-modal retrieval** use cases, not only text-to-text search

## Why this matters

A lot of systems already support standard text embeddings well. What could make OpenViking stand out is stronger retrieval across mixed context such as documents, images, audio, video, and other agent resources.

Gemini Embedding 2 seems like a very natural match for that direction.

## Expected outcome

From a user perspective, the important result is:

- first-class support for Gemini Embedding 2
- multimodal inputs handled as multimodal, not flattened by default
- retrieval that benefits from a shared embedding space across resource types
- clear documentation of capabilities and limitations

I am intentionally not proposing a specific implementation path here. The main request is that, if added, Gemini Embedding 2 should be integrated in a way that reflects its value as a **multimodal retrieval model**.

Thanks for considering this.

### Proposed Solution

Just use Codex ;)

### Alternatives Considered

https://www.mixedbread.com/

### Feature Area

Model Integration

### Use Case

OpenViking is meant to serve as a context layer for AI agents that work with more than plain text. A useful next step would be support for Gemini Embedding 2 so OpenViking can better handle retrieval across mixed resource types such as documents, images, audio, video, and screenshots. The main use case is multimodal retrieval in a shared embedding space, for example finding the right document from an image query, retrieving notes from an audio clip, or matching a text query to relevant visual or document context. This would make OpenViking more useful for real agent workflows where context is heterogeneous and should not always be flattened into text before retrieval.

### Example API (Optional)

```python

```

### Additional Context

_No response_

### Contribution

- [ ] I am willing to contribute to implementing this feature

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval #566

Problem Statement

Summary

What matters most

Why this matters

Expected outcome

Proposed Solution

Alternatives Considered

Feature Area

Use Case

Example API (Optional)

Additional Context

Contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature]: Add first-class Gemini Embedding 2 support for multimodal retrieval #566

Description

Problem Statement

Summary

What matters most

Why this matters

Expected outcome

Proposed Solution

Alternatives Considered

Feature Area

Use Case

Example API (Optional)

Additional Context

Contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions