model: Add VIRTUE multimodal embedding models (Sony VIRTUE-2B/7B-SCaR)#4822
Conversation
|
Can you try to run vidore v1&v2 tasks to reproduce scores? |
|
Sure, I'll run ViDoRe v1 & v2 with both checkpoints and post the scores here. |
|
Re: ViDoRe — I'll need to queue this on GPU and will post scores in a follow-up. Re: CI — the test and 3.13 failures don't reproduce locally (both latest and lowest deps pass). Could you share the failure logs or re-trigger the runs? |
This is fine. It's just a flaky test. I think you can also check it yourself |
|
ViDoRe(v1&v2) results for both VIRTUE checkpoints:
|
|
Hm, seems they don't report results on vidore even they evaluated on MMEB which have vidore 1&2 in subtasks. I think we can merge then |
Adds the Sony VIRTUE universal text-image embedders (VIRTUE-2B-SCaR and VIRTUE-7B-SCaR), built on Qwen2-VL. The wrapper uses left-padding last-token pooling with L2 normalization and supports text-only, image-only, and fused image+text inputs, matching the no-visual-prompt path of the reference implementation. A smoke evaluation on AILAStatutes ran successfully with finite scores. Fixes #4517.