--- license: cc-by-nc-4.0 library_name: transformers pipeline_tag: video-text-to-text --- # Slow-Fast Architecture for Video Multi-Modal Large Language Models This repository contains the model presented in the paper [Slow-Fast Architecture for Video Multi-Modal Large Language Models](https://huggingface.co/papers/2504.01328). Code: https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM ## Introduction This model uses a novel slow-fast architecture to balance temporal resolution and spatial detail in video understanding, overcoming the sequence length limitations of traditional LLMs. It employs a dual-token strategy: "fast" tokens provide a quick overview, while "slow" tokens allow instruction-aware extraction of details via cross-attention. ## Usage ```python import torch import os import numpy as np from decord import VideoReader, cpu from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN from llava.conversation import conv_templates, SeparatorStyle from llava.model.builder import load_pretrained_model from llava.mm_utils import tokenizer_image_token, get_model_name_from_path from llava.utils import disable_torch_init def load_video(video_path, max_frames_num): vr = VideoReader(video_path, num_threads=4) fps = round(vr.get_avg_fps()) frame_idx = [i for i in range(0, len(vr), fps)] uniform_sampled_frames = np.linspace(0, len(vr) - 1, max_frames_num, dtype=int) frame_idx = uniform_sampled_frames.tolist() spare_frames = vr.get_batch(frame_idx).asnumpy() return spare_frames # Model # Ensure you have cloned the code repository: git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4" # Or other checkpoint video_path = "Slow-Fast-Video-Multimodal-LLM/assets/catinterrupt.mp4" # Example video path from cloned repo question = "Please describe this video in detail." max_frames=64 # Set according to the specific checkpoint disable_torch_init() model_path = os.path.expanduser(model_path) model_name = get_model_name_from_path(model_path) # Make sure to pass trust_remote_code=True tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, use_flash_attn=True, trust_remote_code=True) if model.config.mm_use_im_start_end: prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + " " + question else: prompt = DEFAULT_IMAGE_TOKEN + " " + question conv = conv_templates["qwen_1_5"].copy() conv.append_message(conv.roles[0], prompt) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() # read and process video video = load_video(video_path, max_frames_num=max_frames) video_tensor = image_processor.preprocess(video, return_tensors="pt")["pixel_values"].half().cuda() videos = [video_tensor] input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt') input_ids = input_ids.to(device='cuda', non_blocking=True).unsqueeze(dim=0) with torch.inference_mode(): output_ids = model.generate( input_ids, images=videos, do_sample=True, max_new_tokens=1024, num_beams=1, temperature=0.2, top_p=1.0, use_cache=True) outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip() print(f"User input: {question} ") print(outputs) ``` ## Citation ```bibtex @misc{wang2025slowfast, title={Slow-Fast Architecture for Video Multi-Modal Large Language Models}, author={Haotian Wang and Zhengyuan Yang and Yue Zhao and Bin Lin and Zhe Chen and Yue Cao and Hongxia Yang}, year={2025}, eprint={2504.01328},\ archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2504.01328v1}, } ```