ViCA2 / README.md
nkkbr's picture
update readme
8fcbc3e
metadata
license: apache-2.0
tags:
  - multimodal
  - vision-language
  - video understanding
  - visuospatial cognition
  - spatial reasoning
  - vlm
  - llava
  - qwen
  - siglip
  - hiera
  - sam2
  - dual-encoder
datasets:
  - liuhaotian/LLaVA-CC3M-Pretrain-595K
  - lmms-lab/LLaVA-OneVision-Data
  - nkkbr/ViCA-322K
  - nkkbr/ViCA-thinking-2.68k
language:
  - en
library_name: transformers
pipeline_tag: video-text-to-text
model_name: ViCA2-7B
model_description: >
  ViCA2 (Visuospatial Cognitive Assistant 2) is a state-of-the-art large
  multimodal model tailored for fine-grained visuospatial reasoning in indoor
  video and image environments.

  It builds upon the LLaVA-OneVision framework, and introduces a novel dual
  vision encoder architecture that integrates:
    - **SigLIP** for high-level semantic abstraction, and
    - **Hiera** (from SAM2) for detailed spatial structure modeling.

  This dual-stream design enables robust performance in tasks involving object
  layouts, relative positioning, temporal order, and geometric reasoning.

  Trained with a multi-stage strategy on over **322K video-based QA pairs**,
  ViCA2 significantly surpasses LLaVA-NeXT-Video and Gemini-1.5 Pro.


  ViCA2 is built with modularity and efficiency in mind, leveraging:
    - Token ratio control for balancing semantic and spatial token contributions
    - Hiera stage-specific sampling and projection
    - Multi-stage DeepSpeed fine-tuning with frozen vision backbones
model-index:
  - name: ViCA2-7B
    results:
      - task:
          type: visual-question-answering
        dataset:
          name: VSI-Bench
          type: vsi-bench
        metrics:
          - type: score
            value: 56.81
            name: Average
            verified: false
          - type: MRA
            value: 65.73
            name: Object Count
          - type: MRA
            value: 50.98
            name: Absolute Distance
          - type: MRA
            value: 75.54
            name: Object Size
          - type: MRA
            value: 71.42
            name: Room Size
          - type: accuracy
            value: 51.55
            name: Relative Distance
          - type: accuracy
            value: 34.61
            name: Relative Direction
          - type: accuracy
            value: 38.14
            name: Route Plan
          - type: accuracy
            value: 66.5
            name: Appearance Order

Currently under editing.

Installation

git clone https://github.com/nkkbr/ViCA.git
cd ViCA

conda create -n vica2 python=3.10 -y
conda activate vica2

# Install dependencies (with CUDA 12.1 support)
pip install --extra-index-url https://download.pytorch.org/whl/cu121 -e .

# FlashAttention is required and may need to be installed separately
pip install flash-attn==2.5.7

Inference

Here is a runnable example using ViCA2-7B on a VSI-Bench question.

Note: ViCA and ViCA2 use different model architectures. Please make sure to use the corresponding code for inference.

# This inference script is adapted from:
# https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2

from vica2.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path, process_images, tokenizer_image_token
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IGNORE_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from PIL import Image
import requests
import copy
import torch
import sys
import warnings
from decord import VideoReader, cpu
import numpy as np

warnings.filterwarnings("ignore")
def load_video(video_path, max_frames_num,fps=1,force_sample=False):
    if max_frames_num == 0:
        return np.zeros((1, 336, 336, 3))
    vr = VideoReader(video_path, ctx=cpu(0),num_threads=1)
    total_frame_num = len(vr)
    video_time = total_frame_num / vr.get_avg_fps()
    fps = round(vr.get_avg_fps()/fps)
    frame_idx = [i for i in range(0, len(vr), fps)]
    frame_time = [i/fps for i in frame_idx]
    if len(frame_idx) > max_frames_num or force_sample:
        sample_fps = max_frames_num
        uniform_sampled_frames = np.linspace(0, total_frame_num - 1, sample_fps, dtype=int)
        frame_idx = uniform_sampled_frames.tolist()
        frame_time = [i/vr.get_avg_fps() for i in frame_idx]
    frame_time = ",".join([f"{i:.2f}s" for i in frame_time])
    spare_frames = vr.get_batch(frame_idx).asnumpy()
    return spare_frames,frame_time,video_time

pretrained = "nkkbr/ViCA2-stage2-onevision-ft"
model_name = "vica_qwen"
device = "cuda"
device_map = "auto"
tokenizer, model, image_processor, image_processor_for_sam, max_length = load_pretrained_model(pretrained, None, model_name, torch_dtype="bfloat16", device_map=device_map)  
model.eval()


from datasets import load_dataset
vsi_bench = load_dataset("nyu-visionx/VSI-Bench")
vsi_bench = vsi_bench['test']

data_curr = vsi_bench[90]

video_path = f"[VIDEO PATH]"
max_frames_num = 64
video,frame_time,video_time = load_video(video_path, max_frames_num, 1, force_sample=True)

video1= image_processor.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video1 = [video1]
video2 = image_processor_for_sam.preprocess(video, return_tensors="pt")["pixel_values"].cuda().bfloat16()
video2 = [video2]
conv_template = "qwen_1_5"  
# time_instruciton = f"The video lasts for {video_time:.2f} seconds, and {len(video[0])} frames are uniformly sampled from it. These frames are located at {frame_time}.Please answer the following questions related to this video."
time_instruciton = ""
question = DEFAULT_IMAGE_TOKEN + f"\n{time_instruciton}\n\n"
question += f"These are frames of a video.\n\n"
question += f"Question: {data_curr['question']}\n"
if data_curr['options'] is not None:
    question += '\n'.join(data_curr['options']) + "\n"
    question += f"Answer with the option’s letter from the given choices directly.\n"
else:
    question += f"Please answer the question using a single word or phrase.\n"
print(f"Prompt:\n{question}")

conv = copy.deepcopy(conv_templates[conv_template])
conv.append_message(conv.roles[0], question)
conv.append_message(conv.roles[1], None)
prompt_question = conv.get_prompt()
input_ids = tokenizer_image_token(prompt_question, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).to(device)
cont = model.generate(
    input_ids,
    images=video1,
    images_for_sam=video2,
    modalities= ["video"],
    do_sample=False,
    temperature=0,
    max_new_tokens=1024,
)
text_outputs = tokenizer.batch_decode(cont, skip_special_tokens=True)[0].strip()
print(repr(text_outputs))