shi-labs
/

slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4

+---
+license: cc-by-nc-4.0
+library_name: transformers
+pipeline_tag: video-text-to-text
+tags:
+- llava
+- qwen2
+- slow-fast
+---
+```markdown
+# Slow-Fast Architecture for Video Multi-Modal Large Language Models (Qwen2-7B, 64 Frames)
+This repository contains the **Slow-Fast Video MLLM (Qwen2-7B, ConvNeXt-576, 64 frames, stride 1/4)** model, presented in the paper [Slow-Fast Architecture for Video Multi-Modal Large Language Models](https://huggingface.co/papers/2504.01328).
+[Code Repository](https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM) | [HuggingFace Collection](https://huggingface.co/collections/shi-labs/slow-fast-video-mllm-67ef347a28772734c15a78b5)
+## Model Description
+This model introduces a novel slow-fast architecture to address the challenge of balancing temporal resolution and spatial detail in video-based multi-modal large language models (MLLMs) under limited compute budgets. Existing methods often compress video representations irreversibly, losing detail.
+Inspired by how humans first skim a video before focusing on relevant parts, the slow-fast design employs a dual-token strategy:
+1.  **"Fast" visual tokens:** A compact set of compressed video features fed into the LLM (Qwen2-7B-Instruct) alongside text embeddings for a quick overview.
+2.  **"Slow" visual tokens:** Uncompressed video features cross-attended by text embeddings via specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity.
+This approach allows processing more input frames (e.g., 64 frames for this checkpoint) while preserving spatial details, leading to significant performance improvements on video understanding benchmarks compared to self-attention-only baselines. This checkpoint uses a Qwen2-7B-Instruct base LLM and a ConvNeXt-576 vision tower.
+<div align="center">
+<img src="https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/images/fig-teaser.png" width="45%">
+</div>
+## Usage
+**Note:** This model relies on custom code integrated within the `transformers` library (`LlavaQwenSlowFastForCausalLM`). Ensure you have the necessary packages installed from the [official repository](https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM) or use `trust_remote_code=True` when loading the model.
+First, clone the repository and install requirements if running locally:
+```bash
+git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
+cd Slow-Fast-Video-Multimodal-LLM
+pip install --upgrade pip
+pip install -r requirements.txt
+# Add the cloned repo path to your PYTHONPATH or install it
+```
+Then, use the following Python script:
+```python
+import torch
+import os
+import numpy as np
+from decord import VideoReader, cpu
+import requests # Required to download video
+# Make sure the necessary llava modules are importable
+# If not installed from the repo, trust_remote_code=True handles this
+from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
+from llava.conversation import conv_templates
+from llava.model.builder import load_pretrained_model
+from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
+from llava.utils import disable_torch_init
+def load_video(video_path, max_frames_num):
+        """Helper function to load video frames."""
+        vr = VideoReader(video_path, num_threads=4)
+        total_frames = len(vr)
+        # Ensure sparse sampling doesn't lead to fewer frames than requested
+        if total_frames >= max_frames_num:
+            # Uniformly sample frames across the video
+            uniform_sampled_frames = np.linspace(0, total_frames - 1, max_frames_num, dtype=int)
+            frame_idx = uniform_sampled_frames.tolist()
+        else:
+            # If video is shorter than max_frames_num, sample all frames and repeat the last
+            frame_idx = list(range(total_frames))
+            frame_idx.extend([total_frames - 1] * (max_frames_num - total_frames))
+        try:
+            spare_frames = vr.get_batch(frame_idx).asnumpy()
+        except Exception as e:
+            print(f"Error loading video frames: {e}")
+            # Fallback or error handling: return None or raise exception
+            # Example: return a black frame tensor of the expected shape
+            # This part depends on how image_processor handles None or errors
+            # For now, re-raising the exception might be best
+            raise e
+        return spare_frames
+# Model configuration
+model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
+video_url = "https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/catinterrupt.mp4"
+video_local_path = "catinterrupt.mp4"
+question = "Please describe this video in detail."
+max_frames = 64 # This checkpoint was trained with 64 frames
+# Download the video if it doesn't exist
+if not os.path.exists(video_local_path):
+    print(f"Downloading video from {video_url}...")
+    response = requests.get(video_url, stream=True)
+    response.raise_for_status() # Raise an exception for bad status codes
+    with open(video_local_path, "wb") as f:
+        for chunk in response.iter_content(chunk_size=8192):
+            f.write(chunk)
+    print("Download complete.")
+# Load the model and processor
+disable_torch_init()
+model_name = get_model_name_from_path(model_path)
+# Use trust_remote_code=True to load the custom architecture
+tokenizer, model, image_processor, context_len = load_pretrained_model(
+    model_path,
+    None,
+    model_name,
+    use_flash_attn=True,      # Use Flash Attention if available
+    device_map="auto",        # Automatically distribute model across GPUs/CPU
+    torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency
+    trust_remote_code=True
+)
+# Prepare the prompt
+if model.config.mm_use_im_start_end:
+    prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "
+" + question
+else:
+    prompt = DEFAULT_IMAGE_TOKEN + "
+" + question
+conv = conv_templates["qwen_1_5"].copy() # Use the appropriate conversation template
+conv.append_message(conv.roles[0], prompt)
+conv.append_message(conv.roles[1], None)
+prompt_final = conv.get_prompt()
+# Load and process video frames
+print("Loading video...")
+video_frames = load_video(video_local_path, max_frames_num=max_frames)
+print(f"Video loaded, shape: {video_frames.shape}")
+# Preprocess video frames
+print("Preprocessing video...")
+# Ensure video has shape (T, H, W, C) before preprocessing
+video_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
+video_tensor = video_tensor.to(model.device, dtype=torch.bfloat16)
+videos = [video_tensor] # The model expects a list of video tensors
+print(f"Video tensor processed, shape: {videos[0].shape}")
+# Tokenize the prompt
+input_ids = tokenizer_image_token(prompt_final, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
+input_ids = input_ids.to(device=model.device, non_blocking=True)
+# Add batch dimension if necessary (tokenizer_image_token might already return batched)
+if input_ids.ndim == 1:
+    input_ids = input_ids.unsqueeze(0)
+print(f"Input IDs processed, shape: {input_ids.shape}")
+# Generate response
+print("Generating response...")
+with torch.inference_mode():
+    output_ids = model.generate(
+        input_ids,
+        images=videos, # Pass the processed video tensor list
+        do_sample=True,
+        temperature=0.2,
+        top_p=1.0,
+        num_beams=1,
+        max_new_tokens=1024,
+        use_cache=True
+    )
+# Decode and print the output
+outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
+print(f"
+User input: {question}
+")
+print(f"Model output:
+{outputs}")
+```
+## License
+The model weights are released under the [CC-BY-NC-4.0 license](LICENSE).
+The code is released under the Apache 2.0 license.
+Users must comply with all terms and conditions of the original licenses, including the specific licenses for the base language model ([Qwen2 License](https://huggingface.co/Qwen/Qwen2-7B-Instruct/blob/main/LICENSE)).
+## Citation
+If you find this work useful, please consider citing the paper:
+```bibtex
+@misc{zhou2025slowfast,
+      title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
+      author={Yifei Zhou and Jiaming Zuo and Chen Change Loy and Chongyang Zhong and Xin Wang and Qi Wu and Weidong Cai and Xiaodong He and Qingzhong Wang and Lei Zhang and Marcelo H. Ang Jr and Boyang Li and Yanfeng Wang and Qinghai He and Fengbei Liu and Liangchen Luo and Jingdong Wang and Conghui He and Wenhai Wang},
+      year={2025},
+      eprint={2504.01328},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV}
+}
+```
+*(Note: Author list based on potential updates to the arXiv paper; please verify with the final published version if available.)*
+```