Note on ViCA2 vs. ViCA:
This repository primarily focuses on ViCA2, including its codebase, training scripts, and our novel model architecture. For all details pertaining to ViCA2, please refer to the information within this repository.
For information regarding the original ViCA model, including its introduction and model weight downloads, please consult this link. ViCA was fine-tuned based on the LLaVA-NeXT framework (our sincere thanks to the LLaVA-NeXT team), and its model architecture aligns with LLaVA-NeXT. While you can clone the LLaVA-NeXT repository for its base structure, this ViCA2 repository actually includes the LLaVA-NeXT codebase. Furthermore, this repository also contains example scripts used for training the original ViCA. Therefore, if you intend to fine-tune your own ViCA-like model using custom data, cloning this repository is also a viable option.
This repository contains the official implementation and resources for ViCA2 (Visuospatial Cognitive Assistant 2), a novel Multimodal Large Language Model (MLLM) designed to significantly enhance visuospatial cognition. ViCA2 excels at reasoning about spatial layouts, relations, and dynamics in both image and video settings.
We also release ViCA-322K, a new large-scale dataset with over 322,003 spatially grounded question-answer pairs for targeted instruction tuning.
ViCA2-7B achieves state-of-the-art performance on the challenging VSI-Bench benchmark, significantly outperforming larger open-source models and leading proprietary models.
Figure 1: Average performance comparison on VSI-Bench. ViCA2-7B (56.8) surpasses models like LLaVA-NeXT-Video-72B (40.9) and Gemini-1.5 Pro (45.4).
- Motivation
- Key Contributions
- ViCA2 Architecture
- Specialized Datasets for Visuospatial Cognition
- Training Strategy
- Results
- Model Zoo: ViCA2 Checkpoints
- Setup & Usage
- ViCA2-Thinking Variant
- Limitations
- Future Work
While current Multimodal Large Language Models (MLLMs) have shown remarkable progress in general vision-language tasks (e.g., image captioning, VQA), they still struggle significantly with visuospatial cognition. This critical aspect of visual intelligence involves fine-grained understanding of:
- Object layouts and spatial relations (e.g., "Is object A to the left of object B?")
- Temporal order of events (e.g., "Did the red ball appear before the blue cube?")
- Geometric attributes and measurements (e.g., "How far is the chair from the table?", "Estimate the room size.")
These capabilities are crucial for a wide range of downstream applications, including indoor navigation, assistive robotics, video summarization, and embodied AI.
The VSI-Bench benchmark (vision-x-nyu/thinking-in-space) was specifically designed to evaluate such visuospatial reasoning abilities in MLLMs. Analyses conducted with VSI-Bench revealed that even many state-of-the-art MLLMs, including large-scale proprietary models, exhibit relatively low performance on these spatially grounded tasks. This highlights a significant gap in current MLLM capabilities.
Existing MLLMs often lack:
- Architectural components optimized for capturing high-resolution spatial details and structured spatial knowledge. Single semantic vision encoders, common in many MLLMs, often fail to preserve fine-grained layouts.
- Specialized training data targeting diverse spatial reasoning tasks, which is necessary for models to internalize and generalize these complex skills.
ViCA2 is designed to address these fundamental limitations by introducing a novel architecture and a tailored dataset to substantially improve visuospatial understanding in MLLMs.
-
Novel Dual Vision Encoder Architecture: ViCA2 integrates:
- SigLIP: For robust global semantic understanding.
- Hiera (from SAM2): For capturing fine-grained spatial structures and object-centric details. This combination allows for joint reasoning over both semantics and precise spatial cues.
-
Token Ratio Control Mechanism: An efficient mechanism to balance the information flow from the two encoders, managing the trade-off between semantic abstraction, spatial detail, and computational memory constraints.
-
ViCA-322K Dataset: A new, large-scale instruction tuning dataset with over 322,003 spatially grounded question-answer pairs derived from real indoor videos, specifically curated to enhance spatial reasoning.
-
ViCA-Thinking-2.68K Dataset: A specialized, lightweight dataset (2,680 examples) designed to fine-tune models to produce explicit intermediate reasoning steps ("Thoughts") before their final answers. Each entry features a unique video and was generated using a sophisticated pipeline involving Gemini 2.5 Pro for quality control. This dataset is also publicly released.
-
State-of-the-Art Performance: ViCA2-7B achieves an average score of 56.8 on VSI-Bench, significantly outperforming much larger models (e.g., LLaVA-NeXT-Video-72B: 40.9) and proprietary models (e.g., Gemini-1.5 Pro: 45.4), demonstrating strong visuospatial intelligence with a compact model.
-
Open-Source Release: We release the ViCA2 model weights, codebase, and both the ViCA-322K and ViCA-Thinking-2.68K datasets to facilitate further research in visuospatial cognition and explicit reasoning.
ViCA2 builds upon the LLaVA-NeXT framework and introduces key architectural innovations.
Figure 2: Overview of the ViCA2 architecture, integrating SigLIP and Hiera with token ratio control.
ViCA2 leverages a dual vision encoder strategy to capture both global semantics and fine-grained spatial details:
-
SigLIP (Semantic Encoder): We employ
google/siglip-so400m-patch14-384for its strong global semantic feature extraction. This encoder processes all sampled video frames (e.g., 64 frames) and is crucial for general scene understanding and aligning visual concepts with language. -
Hiera (Spatial Encoder): To enhance spatial reasoning, we incorporate the
sam2.1_hiera_b+variant from Meta AI'sfacebook/sam2.1-hiera-base-plus. Hiera's hierarchical, multi-stage architecture excels at modeling fine-grained spatial structures and multiscale visual features. This is essential for tasks requiring precise localization, object-centric understanding, and detailed layout comprehension. In ViCA2, Hiera processes a subset of frames (e.g., 32 frames) to balance detailed spatial analysis with computational efficiency. Features are typically extracted from a deeper stage (e.g., Stage 4 of Hiera) and then spatially pooled.
The outputs from both SigLIP and Hiera are individually projected into the language model's embedding space using separate linear projectors. These projected features are then concatenated along the sequence dimension before being fed to the language decoder (Qwen2-7B), allowing the model to jointly reason over semantic and spatial information.
A Note on Hiera Integration:
Our model implementation utilizes the Hugging Face transformers library, which offers excellent integration with tools like DeepSpeed. However, the original SAM2.1 model released by Meta AI, including its Hiera backbone, was not natively integrated into the Hugging Face ecosystem at the time of our development. To facilitate seamless usage, we extracted the Hiera weights from Meta's official SAM2.1 release and wrapped them with Hugging Face compatibility layers. This allows for easy, one-click instantiation and use of the Hiera encoder within a standard Hugging Face workflow.
While our primary ViCA2 models utilize the Hiera-Base-Plus variant (sam2.1_hiera_b+), we have also prepared and provide a Hugging Face compatible version of Hiera-Large (sam2.1_hiera_l) for research purposes, allowing exploration with a larger spatial encoder. Our adapted Hiera modules and the integration code are publicly available on Hugging Face as part of the ViCA2 release
To effectively manage the balance between the rich semantic information from SigLIP and the detailed spatial cues from Hiera, especially under memory constraints, ViCA2 employs a token ratio control strategy. This strategy is governed by a configurable triplet: (N_hiera, S_stage, s_pool), where:
N_hiera: The number of video frames processed by the Hiera encoder.S_stage: The specific stage within Hiera from which features are extracted (Hiera has a multi-stage hierarchical structure).s_pool: The stride used for spatial pooling applied to the Hiera features, which helps in downsampling tokens while preserving spatial continuity.
This mechanism allows for fine-grained control over the representational budget allocated to semantic versus spatial features. While an exhaustive search for the absolute optimal ratio was beyond the scope of available computational resources, our design process led us to the configuration (32, 4, 2) – 32 frames for Hiera, features from Stage 4, and a pooling stride of 2.
This configuration yields T_siglip = 13,440 tokens from SigLIP (processing 64 frames) and T_hiera = 8,704 tokens from Hiera. The resulting token ratio of T_siglip : T_hiera ≈ 1.54 was found to offer a favorable trade-off, providing a strong balance between global semantic understanding and fine-grained spatial precision, which proved effective for the visuospatial reasoning tasks targeted by ViCA2 within our operational hardware limitations. This strategic allocation enhances the model's ability to perform spatially grounded reasoning without exceeding computational budgets.
To effectively train models for advanced visuospatial reasoning, we developed and publicly released two specialized instruction-tuning datasets. These datasets were instrumental in the development of both the original ViCA and the enhanced ViCA2 models:
-
ViCA-322K:
- A large-scale dataset featuring over 322,003 spatially grounded question-answer pairs.
- Derived from real indoor videos, it covers a diverse range of spatial reasoning tasks including object counting, distance and size estimation, temporal ordering, and complex relational understanding.
- Designed for comprehensive instruction tuning to enhance core spatial perception and reasoning, forming a key training corpus for both ViCA and ViCA2.
- Details & Access: ViCA-322K on Hugging Face Datasets
-
ViCA-Thinking-2.68K:
- A targeted, lightweight dataset comprising 2,680 examples, each from a unique video.
- Specifically curated to fine-tune models (like
ViCA-7B-ThinkingandViCA2-7B-Thinkingvariants) to generate explicit intermediate reasoning steps ("Thoughts") before providing a final answer, fostering more interpretable AI. - Generated using a sophisticated pipeline involving Gemini 2.5 Pro for high-quality "Thought" generation.
- Details & Access: ViCA-Thinking-2.68K on Hugging Face Datasets
These datasets are crucial components of our research, enabling targeted training for enhanced spatial intelligence and explicit reasoning. We encourage researchers to explore and utilize them for advancing visuospatial understanding in MLLMs.
ViCA2 is trained in multiple stages, building upon the lmms-lab/LLaVA-Video-7B-Qwen2 checkpoint:
-
Stage 1: Hiera Projector Warm-up (Alignment)
- Dataset:
liuhaotian/LLaVA-CC3M-Pretrain-595K(image-text pairs). - Trainable: Only the randomly initialized Hiera projector.
- Goal: Establish coarse alignment between Hiera features and the language model's input space.
- Dataset:
-
Stage 2: Full Model Re-alignment
- Dataset: A subset (10%) of
lmms-lab/LLaVA-OneVision-Data. - Trainable: SigLIP, both vision projectors, and the language model. Hiera backbone frozen.
- Goal: Re-align Hiera pathway with SigLIP and LLM after projector initialization, restoring general multimodal capabilities.
- Dataset: A subset (10%) of
-
Stage 3: Targeted Spatial Fine-tuning
- Dataset: Our ViCA-322K.
- Trainable: SigLIP, both vision projectors, and the language model. Hiera backbone frozen.
- Goal: Enhance fine-grained spatial perception and reasoning abilities.
DeepSpeed ZeRO-2 and ZeRO-3 are used for efficient training.
(For ViCA2-7B-Thinking, an additional Stage 4 fine-tunes on ViCA-Thinking-2.68K data.)
| Stage 1 | Stage 2 | Stage 3 | Thinking | |
|---|---|---|---|---|
| Dataset | LLaVA-CC3M-Pretrain-595K | LLaVA-OneVision-Data | ViCA-322K | ViCA-Thinking-2.68K |
| Training Data Samples | 595,375 | 279,353 | 322,003 | 2,680 |
| Trainable Module | Hiera Projection Layer | Full Model (excluding Hiera Module) | ||
| Trainable Parameters | 16M | 8.04B | 8.04B | 8.04B |
| Learning Rate | 1e-3 | 1e-5 | 1e-5 | 1e-5 |
| Epochs | 1 | 1 | 1 | 1 |
| DeepSpeed Stage | ZeRO-2 | ZeRO-3 | ZeRO-3 | ZeRO-3 |
Table 1: Training configuration across all four stages of our hierarchical vision-language model. Stage 1 pretrains only the Hiera projection layer, while stages 2–4 fine-tune the full model excluding the Hiera module.
ViCA2-7B sets a new state-of-the-art on the VSI-Bench, a challenging benchmark for visuospatial intelligence in videos.
| Method | Average | Numerical Answer | Multiple-Choice Answer | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Obj. Count | Abs. Dist. | Obj. Size | Room Size | Rel. Dist. | Rel. Dir. | Route Plan | Appr. Order | ||
| Proprietary Models (API) | |||||||||
| GPT-4o | 34.0 | 46.2 | 5.3 | 43.8 | 38.2 | 37.0 | 41.3 | 31.5 | 28.5 |
| Gemini-1.5 Flash | 42.1 | 49.8 | 30.8 | 53.5 | 54.4 | 37.7 | 41.0 | 31.5 | 37.8 |
| Gemini-1.5 Pro | 45.4 | 56.2 | 30.9 | 64.1 | 43.6 | 51.3 | 46.3 | 36.0 | 34.6 |
| Open-source Models | |||||||||
| InternVL2-8B | 34.6 | 23.1 | 28.7 | 48.2 | 39.8 | 36.7 | 30.7 | 29.9 | 39.6 |
| InternVL2-40B | 36.0 | 34.9 | 26.9 | 46.5 | 31.8 | 42.1 | 32.2 | 34.0 | 39.6 |
| VILA-1.5-8B | 28.9 | 17.4 | 21.8 | 50.3 | 18.8 | 32.1 | 34.8 | 31.0 | 24.8 |
| VILA-1.5-40B | 31.2 | 22.4 | 24.8 | 48.7 | 22.7 | 40.5 | 25.7 | 31.5 | 32.9 |
| LLaVA-NeXT-Video-7B | 35.6 | 48.5 | 14.0 | 47.8 | 24.2 | 43.5 | 42.4 | 34.0 | 30.6 |
| LLaVA-NeXT-Video-72B | 40.9 | 48.9 | 22.8 | 57.4 | 35.3 | 42.4 | 36.7 | 35.0 | 48.6 |
| LLaVA-OneVision-7B | 32.4 | 47.7 | 20.2 | 47.4 | 12.3 | 42.5 | 35.2 | 29.4 | 24.4 |
| LLaVA-OneVision-72B | 40.2 | 43.5 | 23.9 | 57.6 | 37.5 | 42.5 | 39.9 | 32.5 | 44.6 |
| ViCA2-7B (ours) | 56.8(+11.4) | 65.7(+9.5) | 51.0(+20.1) | 75.5(+11.4) | 71.4(+17.0) | 51.6(+0.3) | 34.6 | 38.1(+2.1) | 66.5(+17.9) |
Table 2: Comparison of different models on VSI-Bench. Our ViCA2-7B achieves the best performance across most metrics.
Figure 3: Performance of ViCA2-7B on VSI-Bench scales consistently with the percentage of the
ViCA-322K dataset used for training.
Our experiments demonstrate a clear and positive correlation between the amount of training data from ViCA-322K and the performance of ViCA2-7B on the VSI-Bench. As shown in the figure above, model accuracy improves consistently as the training data percentage increases.
Notably, ViCA2's performance continues to show improvement even when trained on the entirety of the ViCA-322K dataset (322,003 samples). This upward trend, particularly in the later stages of data scaling (e.g., from 95% to 100% of the data), suggests that ViCA2 has not yet reached its performance ceiling with the current dataset size.
Given the increased architectural complexity of ViCA2—specifically its dual-encoder design and token balancing mechanism—it possesses a strong representational capacity. We posit that this architecture is capable of leveraging even larger and more diverse spatial instruction datasets. Therefore, while ViCA2 achieves state-of-the-art results, its full potential is likely yet to be unlocked. Further scaling of high-quality, spatially-grounded training data could lead to significant additional performance gains, pushing the boundaries of visuospatial intelligence in MLLMs of this scale. This underscores a promising direction for future research and dataset curation efforts.
To facilitate reproducibility, further research, and community contributions, we provide access to various checkpoints from the ViCA2 training pipeline. These allow for a deeper understanding of the model's development and can serve as starting points for custom fine-tuning or analysis.
The following model is a variant fine-tuned for explicit reasoning:
| Checkpoint Name | Description | Hugging Face Link |
|---|---|---|
| ViCA2-7B-Thinking | ViCA2-7B further fine-tuned on ViCA-Thinking-2.68K for explicit reasoning capabilities. |
We believe that releasing these intermediate and final checkpoints will be valuable for the community, enabling researchers to build upon our work, explore different stages of multimodal alignment and spatial tuning, and contribute to the ongoing advancement of visuospatial intelligence in MLLMs.
Furthermore, to facilitate research and reproducibility, we are also publicly releasing checkpoints for our ViCA2-7B and ViCA2-7B-Thinking models. These checkpoints were trained with incremental 5% portions of the ViCA-322K dataset, as detailed in the table below.
| ViCA-322K Data (%) | ViCA2-7B Checkpoint | ViCA2-7B-Thinking Checkpoint |
|---|---|---|
| 5% | ||
| 10% | ||
| 15% | ||
| 20% | ||
| 25% | ||
| 30% | ||
| 35% | ||
| 40% | ||
| 45% | ||
| 50% | ||
| 55% | ||
| 60% | ||
| 65% | ||
| 70% | ||
| 75% | ||
| 80% | ||
| 85% | ||
| 90% | ||
| 95% |
Our experiments were conducted using the following GPUs under CUDA 12.1:
- NVIDIA H100 SXM
- NVIDIA H200 SXM
- NVIDIA RTX A6000 (PCIe interface)
Due to hardware availability, different stages of training may have used different GPUs.
git clone https://github.com/nkkbr/ViCA.git
cd ViCA
conda create -n vica2 python=3.10 -y
conda activate vica2
# Install dependencies (with CUDA 12.1 support)
pip install --extra-index-url https://download.pytorch.org/whl/cu121 -e .
# FlashAttention is required and may need to be installed separately
pip install flash-attn==2.5.7You may need to install
git-lfsbefore downloading large files:
# Uncomment and run the following if git-lfs is not already installed
# sudo apt-get update
# sudo apt-get install git-lfs
# git lfs installThen download our datasets:
mkdir vica-data
cd vica-data
# Main dataset
git clone https://huggingface.co/datasets/nkkbr/ViCA-322K
# Dataset used in spatial reasoning tasks
git clone https://huggingface.co/datasets/nkkbr/ViCA-thinking-2.68kFor Stage 2 of training, you'll also need:
git clone https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-DataPlease make sure to update file paths in the relevant scripts under
./scripts.
Note: ViCA and ViCA2 use different model architectures. Please make sure to use the corresponding code for inference.
If needed, please execute the following commands beforehand:
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
conda activate vica2from vica2.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
warnings.filterwarnings("ignore")
def load_video(video_path, max_frames_num,fps=1,force_sample=False):
if max_frames_num == 0:
return np.zeros((1, 336, 336, 3))
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
total_frame_num = len(vr)
video_time = total_frame_num / vr.get_avg_fps()
fps = round(vr.get_avg_fps()/fps)
frame_idx = [i for i in range(0, len(vr), fps)]
frame_time = [i/fps for i in frame_idx]
if len(frame_idx) > max_frames_num or force_sample:
sample_fps = max_frames_num
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
spare_frames = vr.get_batch(frame_idx).asnumpy()
# import pdb;pdb.set_trace()
return spare_frames,frame_time,video_time
pretrained = "nkkbr/ViCA2"
model_name = "vica_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, image_processor_for_sam, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
from datasets import load_dataset
vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
vsi_bench = vsi_bench['test']
data_curr = vsi_bench[90]
video_path = f"[VIDEO PATH]"
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video1= image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video1 = [video1]
video2 = image_processor_for_sam.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video2 = [video2]
conv_template = "qwen_1_5"
# time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
time_instruciton = ""
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n"
question += f"These are frames of a video.\n\n"
question += f"Question: {data_curr['question']}\n"
if data_curr['options'] is not None:
question += '\n'.join(data_curr['options']) + "\n"
question += f"Answer with the option’s letter from the given choices directly.\n"
else:
question += f"Please answer the question using a single word or phrase.\n"
print(f"Prompt:\n{question}")
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
input_ids,
images=video1,
images_for_sam=video2,
modalities= ["video"],
do_sample=False,
temperature=0,
max_new_tokens=1024,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(repr(text_outputs))ViCA is built upon the codebase of LLaVA-NeXT. Since our repository includes the necessary components from LLaVA-NeXT, you can directly use our repo to run inference with ViCA.
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np
import json
from tqdm import tqdm
import os
warnings.filterwarnings("ignore")
def load_video(video_path, max_frames_num,fps=1,force_sample=False):
if max_frames_num == 0:
return np.zeros((1, 336, 336, 3))
vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
total_frame_num = len(vr)
video_time = total_frame_num / vr.get_avg_fps()
fps = round(vr.get_avg_fps()/fps)
frame_idx = [i for i in range(0, len(vr), fps)]
frame_time = [i/fps for i in frame_idx]
if len(frame_idx) > max_frames_num or force_sample:
sample_fps = max_frames_num
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frame_time = [i/vr.get_avg_fps() for i in frame_idx]
frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
spare_frames = vr.get_batch(frame_idx).asnumpy()
# import pdb;pdb.set_trace()
return spare_frames,frame_time,video_time
pretrained = 'nkkbr/ViCA'
model_name = "llava_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) # Add any other thing you want to pass in llava_model_args
model.eval()
from datasets import load_dataset
vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
vsi_bench = vsi_bench['test']
data_curr = vsi_bench[1000]
video_path = f"[VIDEO PATH]"
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)
video = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().to(torch.bfloat16)
video = [video]
conv_template = "qwen_1_5"
# time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
time_instruciton = ""
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n"
question += f"These are frames of a video.\n\n"
question += f"Question: {data_curr['question']}\n"
if data_curr['options'] is not None:
question += '\n'.join(data_curr['options']) + "\n"
question += f"Answer with the option’s letter from the given choices directly.\n"
else:
question += f"Please answer the question using a single word or phrase.\n"
print(f"Prompt:\n{question}")
conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
input_ids,
images=video,
modalities= ["video"],
do_sample=False,
temperature=0,
max_new_tokens=1024,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(repr(text_outputs))We also explored a ViCA2-7B-Thinking model, fine-tuned on the ViCA-Thinking-2.68K dataset. This variant is designed to produce an intermediate "Thoughts" segment before the final answer, aiming to make its reasoning process more explicit.
- Observation: While ViCA2-7B-Thinking can generate these reasoning steps, its performance on VSI-Bench (numerical/multiple-choice accuracy) is lower than the standard ViCA2-7B. This aligns with findings from the original ViCA paper.
- Hypothesis: Generating coherent "Thoughts" demands higher general text generation capabilities. The specialized Stage 3 fine-tuning on ViCA-322K (focused on structured QA) might have inadvertently shifted the model towards more egocentric and templated descriptions, impacting its open-ended generation fluency required for high-quality "Thoughts."
- Potential Solution: Joint training on general vision-language data (like LLaVA-OneVision-Data) and spatial data (ViCA-322K) from earlier stages, rather than sequential fine-tuning, might better preserve general generation quality while instilling spatial reasoning.
- Dataset Coverage Gaps: While ViCA-322K is extensive, it has limitations. For example, ViCA2's performance on "Relative Direction" tasks is relatively lower, likely due to fewer explicit training examples for this specific concept in ViCA-322K.
- Frozen Hiera Backbone: Due to computational constraints, the Hiera vision encoder backbone was kept frozen during the main fine-tuning stages (Stages 2 and 3). Fine-tuning Hiera might unlock further performance gains.
- Domain Specificity: ViCA-322K predominantly features indoor scenes. Generalization to complex outdoor environments or other specialized visual domains might require additional domain-specific fine-tuning.
- Fusion Mechanism: The current architecture uses simple concatenation to fuse features from the dual encoders. More sophisticated, learnable fusion mechanisms could offer improvements.
- ViCA2-Thinking Performance: The "Thinking" variant currently trades off structured QA accuracy for explicit reasoning steps, indicating a need for refined training strategies for this capability.
- Expanding ViCA-322K with more diverse spatial reasoning tasks (e.g., complex relative directions, route planning in unseen layouts).
- Exploring more sophisticated fusion mechanisms for the dual-encoder outputs.
- Investigating the scalability of ViCA2 to larger language models.
- Fine-tuning the Hiera backbone for deeper adaptation.
- Extending ViCA2's application to embodied AI tasks like navigation and robotic interaction.
If you find our work helpful, we would appreciate it if you cite the following papers.
@misc{feng2025vica2,
title={Towards Visuospatial Cognition via Hierarchical Fusion of Visual Experts},
author={Feng, Qi},
publisher={arXiv:2505.12363},
year={2025},
}@misc{feng2025vica,
title={Visuospatial Cognitive Assistant},
author={Feng, Qi},
publisher={arXiv:2505.12312},
year={2025},
}