Add model card for Slow-Fast Video MLLM (Qwen2-7B, 64 Frames)

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +202 -3
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ library_name: transformers
4
+ pipeline_tag: video-text-to-text
5
+ tags:
6
+ - llava
7
+ - qwen2
8
+ - slow-fast
9
+ ---
10
+
11
+ ```markdown
12
+ # Slow-Fast Architecture for Video Multi-Modal Large Language Models (Qwen2-7B, 64 Frames)
13
+
14
+ This repository contains the **Slow-Fast Video MLLM (Qwen2-7B, ConvNeXt-576, 64 frames, stride 1/4)** model, presented in the paper [Slow-Fast Architecture for Video Multi-Modal Large Language Models](https://huggingface.co/papers/2504.01328).
15
+
16
+ [Code Repository](https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM) | [HuggingFace Collection](https://huggingface.co/collections/shi-labs/slow-fast-video-mllm-67ef347a28772734c15a78b5)
17
+
18
+ ## Model Description
19
+
20
+ This model introduces a novel slow-fast architecture to address the challenge of balancing temporal resolution and spatial detail in video-based multi-modal large language models (MLLMs) under limited compute budgets. Existing methods often compress video representations irreversibly, losing detail.
21
+
22
+ Inspired by how humans first skim a video before focusing on relevant parts, the slow-fast design employs a dual-token strategy:
23
+ 1. **"Fast" visual tokens:** A compact set of compressed video features fed into the LLM (Qwen2-7B-Instruct) alongside text embeddings for a quick overview.
24
+ 2. **"Slow" visual tokens:** Uncompressed video features cross-attended by text embeddings via specially designed hybrid decoder layers, enabling instruction-aware extraction of relevant visual details with linear complexity.
25
+
26
+ This approach allows processing more input frames (e.g., 64 frames for this checkpoint) while preserving spatial details, leading to significant performance improvements on video understanding benchmarks compared to self-attention-only baselines. This checkpoint uses a Qwen2-7B-Instruct base LLM and a ConvNeXt-576 vision tower.
27
+
28
+ <div align="center">
29
+ <img src="https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/images/fig-teaser.png" width="45%">
30
+ </div>
31
+
32
+ ## Usage
33
+
34
+ **Note:** This model relies on custom code integrated within the `transformers` library (`LlavaQwenSlowFastForCausalLM`). Ensure you have the necessary packages installed from the [official repository](https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM) or use `trust_remote_code=True` when loading the model.
35
+
36
+ First, clone the repository and install requirements if running locally:
37
+ ```bash
38
+ git clone https://github.com/SHI-Labs/Slow-Fast-Video-Multimodal-LLM.git
39
+ cd Slow-Fast-Video-Multimodal-LLM
40
+ pip install --upgrade pip
41
+ pip install -r requirements.txt
42
+ # Add the cloned repo path to your PYTHONPATH or install it
43
+ ```
44
+
45
+ Then, use the following Python script:
46
+ ```python
47
+ import torch
48
+ import os
49
+ import numpy as np
50
+ from decord import VideoReader, cpu
51
+ import requests # Required to download video
52
+
53
+ # Make sure the necessary llava modules are importable
54
+ # If not installed from the repo, trust_remote_code=True handles this
55
+ from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
56
+ from llava.conversation import conv_templates
57
+ from llava.model.builder import load_pretrained_model
58
+ from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
59
+ from llava.utils import disable_torch_init
60
+
61
+
62
+ def load_video(video_path, max_frames_num):
63
+ """Helper function to load video frames."""
64
+ vr = VideoReader(video_path, num_threads=4)
65
+ total_frames = len(vr)
66
+
67
+ # Ensure sparse sampling doesn't lead to fewer frames than requested
68
+ if total_frames >= max_frames_num:
69
+ # Uniformly sample frames across the video
70
+ uniform_sampled_frames = np.linspace(0, total_frames - 1, max_frames_num, dtype=int)
71
+ frame_idx = uniform_sampled_frames.tolist()
72
+ else:
73
+ # If video is shorter than max_frames_num, sample all frames and repeat the last
74
+ frame_idx = list(range(total_frames))
75
+ frame_idx.extend([total_frames - 1] * (max_frames_num - total_frames))
76
+
77
+ try:
78
+ spare_frames = vr.get_batch(frame_idx).asnumpy()
79
+ except Exception as e:
80
+ print(f"Error loading video frames: {e}")
81
+ # Fallback or error handling: return None or raise exception
82
+ # Example: return a black frame tensor of the expected shape
83
+ # This part depends on how image_processor handles None or errors
84
+ # For now, re-raising the exception might be best
85
+ raise e
86
+
87
+ return spare_frames
88
+
89
+ # Model configuration
90
+ model_path = "shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4"
91
+ video_url = "https://huggingface.co/shi-labs/slowfast-video-mllm-qwen2-7b-convnext-576-frame64-s1t4/resolve/main/assets/catinterrupt.mp4"
92
+ video_local_path = "catinterrupt.mp4"
93
+ question = "Please describe this video in detail."
94
+ max_frames = 64 # This checkpoint was trained with 64 frames
95
+
96
+ # Download the video if it doesn't exist
97
+ if not os.path.exists(video_local_path):
98
+ print(f"Downloading video from {video_url}...")
99
+ response = requests.get(video_url, stream=True)
100
+ response.raise_for_status() # Raise an exception for bad status codes
101
+ with open(video_local_path, "wb") as f:
102
+ for chunk in response.iter_content(chunk_size=8192):
103
+ f.write(chunk)
104
+ print("Download complete.")
105
+
106
+
107
+ # Load the model and processor
108
+ disable_torch_init()
109
+ model_name = get_model_name_from_path(model_path)
110
+
111
+ # Use trust_remote_code=True to load the custom architecture
112
+ tokenizer, model, image_processor, context_len = load_pretrained_model(
113
+ model_path,
114
+ None,
115
+ model_name,
116
+ use_flash_attn=True, # Use Flash Attention if available
117
+ device_map="auto", # Automatically distribute model across GPUs/CPU
118
+ torch_dtype=torch.bfloat16, # Use bfloat16 for efficiency
119
+ trust_remote_code=True
120
+ )
121
+
122
+ # Prepare the prompt
123
+ if model.config.mm_use_im_start_end:
124
+ prompt = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN + "
125
+ " + question
126
+ else:
127
+ prompt = DEFAULT_IMAGE_TOKEN + "
128
+ " + question
129
+
130
+ conv = conv_templates["qwen_1_5"].copy() # Use the appropriate conversation template
131
+ conv.append_message(conv.roles[0], prompt)
132
+ conv.append_message(conv.roles[1], None)
133
+ prompt_final = conv.get_prompt()
134
+
135
+ # Load and process video frames
136
+ print("Loading video...")
137
+ video_frames = load_video(video_local_path, max_frames_num=max_frames)
138
+ print(f"Video loaded, shape: {video_frames.shape}")
139
+
140
+ # Preprocess video frames
141
+ print("Preprocessing video...")
142
+ # Ensure video has shape (T, H, W, C) before preprocessing
143
+ video_tensor = image_processor.preprocess(video_frames, return_tensors="pt")["pixel_values"]
144
+ video_tensor = video_tensor.to(model.device, dtype=torch.bfloat16)
145
+ videos = [video_tensor] # The model expects a list of video tensors
146
+ print(f"Video tensor processed, shape: {videos[0].shape}")
147
+
148
+
149
+ # Tokenize the prompt
150
+ input_ids = tokenizer_image_token(prompt_final, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt')
151
+ input_ids = input_ids.to(device=model.device, non_blocking=True)
152
+ # Add batch dimension if necessary (tokenizer_image_token might already return batched)
153
+ if input_ids.ndim == 1:
154
+ input_ids = input_ids.unsqueeze(0)
155
+ print(f"Input IDs processed, shape: {input_ids.shape}")
156
+
157
+
158
+ # Generate response
159
+ print("Generating response...")
160
+ with torch.inference_mode():
161
+ output_ids = model.generate(
162
+ input_ids,
163
+ images=videos, # Pass the processed video tensor list
164
+ do_sample=True,
165
+ temperature=0.2,
166
+ top_p=1.0,
167
+ num_beams=1,
168
+ max_new_tokens=1024,
169
+ use_cache=True
170
+ )
171
+
172
+ # Decode and print the output
173
+ outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()
174
+ print(f"
175
+ User input: {question}
176
+ ")
177
+ print(f"Model output:
178
+ {outputs}")
179
+ ```
180
+
181
+ ## License
182
+
183
+ The model weights are released under the [CC-BY-NC-4.0 license](LICENSE).
184
+ The code is released under the Apache 2.0 license.
185
+ Users must comply with all terms and conditions of the original licenses, including the specific licenses for the base language model ([Qwen2 License](https://huggingface.co/Qwen/Qwen2-7B-Instruct/blob/main/LICENSE)).
186
+
187
+ ## Citation
188
+
189
+ If you find this work useful, please consider citing the paper:
190
+
191
+ ```bibtex
192
+ @misc{zhou2025slowfast,
193
+ title={Slow-Fast Architecture for Video Multi-Modal Large Language Models},
194
+ author={Yifei Zhou and Jiaming Zuo and Chen Change Loy and Chongyang Zhong and Xin Wang and Qi Wu and Weidong Cai and Xiaodong He and Qingzhong Wang and Lei Zhang and Marcelo H. Ang Jr and Boyang Li and Yanfeng Wang and Qinghai He and Fengbei Liu and Liangchen Luo and Jingdong Wang and Conghui He and Wenhai Wang},
195
+ year={2025},
196
+ eprint={2504.01328},
197
+ archivePrefix={arXiv},
198
+ primaryClass={cs.CV}
199
+ }
200
+ ```
201
+ *(Note: Author list based on potential updates to the arXiv paper; please verify with the final published version if available.)*
202
+ ```