--- license: apache-2.0 tags: - multimodal - vision-language - video understanding - visuospatial cognition - spatial reasoning - vlm - llava - qwen - siglip - hiera - sam2 - dual-encoder datasets: - liuhaotian/LLaVA-CC3M-Pretrain-595K - lmms-lab/LLaVA-OneVision-Data - nkkbr/ViCA-322K language: - en library_name: transformers pipeline_tag: video-text-to-text model_name: ViCA2-7B model_description: | ViCA2 (Visuospatial Cognitive Assistant 2) is a state-of-the-art large multimodal model tailored for fine-grained visuospatial reasoning in indoor video and image environments. It builds upon the LLaVA-OneVision framework, and introduces a novel dual vision encoder architecture that integrates: - **SigLIP** for high-level semantic abstraction, and - **Hiera** (from SAM2) for detailed spatial structure modeling. This dual-stream design enables robust performance in tasks involving object layouts, relative positioning, temporal order, and geometric reasoning. Trained with a multi-stage strategy on over **322K video-based QA pairs**, ViCA2 significantly surpasses LLaVA-NeXT-Video and Gemini-1.5 Pro. ViCA2 is built with modularity and efficiency in mind, leveraging: - Token ratio control for balancing semantic and spatial token contributions - Hiera stage-specific sampling and projection - Multi-stage DeepSpeed fine-tuning with frozen vision backbones model-index: - name: ViCA2-7B results: - task: type: visual-question-answering dataset: name: VSI-Bench type: vsi-bench metrics: - type: score value: 56.81 name: Average verified: false - type: MRA value: 65.73 name: Object Count - type: MRA value: 50.98 name: Absolute Distance - type: MRA value: 75.54 name: Object Size - type: MRA value: 71.42 name: Room Size - type: accuracy value: 51.55 name: Relative Distance - type: accuracy value: 34.61 name: Relative Direction - type: accuracy value: 38.14 name: Route Plan - type: accuracy value: 66.50 name: Appearance Order --- ## Usage and Full Documentation For detailed model description, training setup, datasets, evaluation results, and inference code, **please refer to the following link**: [![GitHub](https://img.shields.io/badge/GitHub-ViCA2-181717?logo=github&logoColor=white)](https://github.com/nkkbr/ViCA) > You may also be interested in our other project, original **ViCA**. Please refer to the following link: > [![Hugging Face Models](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-ViCA-blue)](https://huggingface.co/nkkbr/ViCA) ## Installation ```bash git clone https://github.com/nkkbr/ViCA.git cd ViCA conda create -n vica2 python=3.10 -y conda activate vica2 # Install dependencies (with CUDA 12.1 support) pip install --extra-index-url https://download.pytorch.org/whl/cu121 -e . # FlashAttention is required and may need to be installed separately pip install flash-attn==2.5.7 ``` ## Inference *Here is a runnable example using ViCA2-7B on a VSI-Bench question.* > **Note**: ViCA and ViCA2 use different model architectures. Please make sure to use the corresponding code for inference. ```python # This inference script is adapted from: # https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2 from vica2.model.builder import load_pretrained_model from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX from llava.conversation import conv_templates, SeparatorStyle from PIL import Image import requests import copy import torch import sys import warnings from decord import VideoReader, cpu import numpy as np warnings.filterwarnings("ignore") def load_video(video_path, max_frames_num,fps=1,force_sample=False): if max_frames_num == 0: return np.zeros((1, 336, 336, 3)) vr = VideoReader(video_path, ctx=cpu(0),num_threads=1) total_frame_num = len(vr) video_time = total_frame_num / vr.get_avg_fps() fps = round(vr.get_avg_fps()/fps) frame_idx = [i for i in range(0, len(vr), fps)] frame_time = [i/fps for i in frame_idx] if len(frame_idx) > max_frames_num or force_sample: sample_fps = max_frames_num uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int) frame_idx = uniform_sampled_frames.tolist() frame_time = [i/vr.get_avg_fps() for i in frame_idx] frame_time = ",".join([f"{i:.2f}s" for i in frame_time]) spare_frames = vr.get_batch(frame_idx).asnumpy() return spare_frames,frame_time,video_time pretrained = "nkkbr/ViCA2-stage2-onevision-ft" model_name = "vica_qwen" device = "cuda" device_map = "auto" tokenizer, model, image_processor, image_processor_for_sam, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map) model.eval() from datasets import load_dataset vsi_bench = load_dataset("nyu-visionx/VSI-Bench") vsi_bench = vsi_bench['test'] data_curr = vsi_bench[90] video_path = f"[VIDEO PATH]" max_frames_num = 64 video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True) video1= image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16() video1 = [video1] video2 = image_processor_for_sam.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16() video2 = [video2] conv_template = "qwen_1_5" # time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video." time_instruciton = "" question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n" question += f"These are frames of a video.\n\n" question += f"Question: {data_curr['question']}\n" if data_curr['options'] is not None: question += '\n'.join(data_curr['options']) + "\n" question += f"Answer with the option’s letter from the given choices directly.\n" else: question += f"Please answer the question using a single word or phrase.\n" print(f"Prompt:\n{question}") conv = copy.deepcopy(conv_templates[conv_template]) conv.append_message(conv.roles[0], question) conv.append_message(conv.roles[1], None) prompt_question = conv.get_prompt() input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device) cont = model.generate( input_ids, images=video1, images_for_sam=video2, modalities= ["video"], do_sample=False, temperature=0, max_new_tokens=1024, ) text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip() print(repr(text_outputs)) ``` ---