Video-Text-to-Text
Safetensors
mistral
ynhe commited on
Commit
579a6ca
·
verified ·
1 Parent(s): 1905130

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +259 -3
README.md CHANGED
@@ -1,3 +1,259 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: video-text-to-text
4
+ extra_gated_prompt: >-
5
+ You agree to not use the model to conduct experiments that cause harm to human
6
+ subjects.
7
+ extra_gated_fields:
8
+ Name: text
9
+ Company/Organization: text
10
+ Country: text
11
+ E-Mail: text
12
+ ---
13
+
14
+ # InternVideo2-Chat-8B-HD-f16
15
+
16
+ [\[📂 GitHub\]](https://github.com/OpenGVLab/InternVideo/tree/main/InternVideo2) [\[📜 Tech Report\]](https://arxiv.org/abs/2403.15377)
17
+ <!-- [\[🗨️ Chat Demo\]](https://vchat.opengvlab.com/) -->
18
+
19
+ To further enrich the semantics embedded in **InternVideo2** and improve its user-friendly in human communications, we tune InternVideo2 by incorporating it into a VideoLLM with a LLM and a video BLIP. We employ the progressive learning scheme in [VideoChat](https://arxiv.org/abs/2311.17005) by using InternVideo2 as the video encoder and train a video blip for
20
+ communicating with open-sourced LLM. In training, the video encoder will be updated. Detailed training recipts are in [VideoChat](https://arxiv.org/abs/2311.17005). This model has HD training.
21
+
22
+ The BaseLLM of this model is Mistral-7B.**Before using it, please ensure that you have obtained the access permission of Mistral-7B**, if not yet obtained, please go to [Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) to obtain the access permission and add your `HF_token` to the environment variable.
23
+
24
+ ## 📈 Performance
25
+ | Model | MVBench | VideoMME(w/o sub)|
26
+ | --- | --- | --- |
27
+ |[InternVideo2-Chat-8B](https://huggingface.co/OpenGVLab/InternVideo2-Chat-8B)| 60.3 | 41.9 |
28
+ |[InternVideo2-Chat-8B-HD](https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD) | 65.4 | 46.1|
29
+ |[InternVideo2-Chat-8B-HD-F16](https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD_F16) | **67.5** | **49.4**|
30
+ |[InternVideo2-Chat-8B-InternLM](https://huggingface.co/OpenGVLab/InternVideo2_Chat_8B_InternLM2_5)| 61.9| 49.1|
31
+
32
+ ## 🚀 How to use the model
33
+
34
+ 1. Apply for the permission of this project and the base LLM permission
35
+
36
+ 2. Fill the HF user access token into the environment variable
37
+
38
+ ```shell
39
+ export HF_TOKEN=hf_....
40
+ ```
41
+ If you don't know how to obtain the token starting with "hf_", please refer to: [How to Get HF User access Token](https://huggingface.co/docs/hub/security-tokens#user-access-tokens)
42
+
43
+ 3. make sure to have `transformers >= 4.38.0`
44
+
45
+ Install the requisite Python packages from [pip_requirements](https://huggingface.co/OpenGVLab/InternVideo2_chat_8B_HD/blob/main/requirements.txt)
46
+
47
+ 4. Inference with Video input
48
+
49
+ ```Python
50
+ import os
51
+ token = os.environ['HF_TOKEN']
52
+ import torch
53
+
54
+ from transformers import AutoTokenizer, AutoModel
55
+
56
+ tokenizer = AutoTokenizer.from_pretrained('OpenGVLab/InternVideo2_chat_8B_HD_F16',
57
+ trust_remote_code=True,
58
+ use_fast=False,
59
+ token=token)
60
+ if torch.cuda.is_available():
61
+ model = AutoModel.from_pretrained(
62
+ 'OpenGVLab/InternVideo2_chat_8B_HD_F16',
63
+ torch_dtype=torch.bfloat16,
64
+ trust_remote_code=True).cuda()
65
+ else:
66
+ model = AutoModel.from_pretrained(
67
+ 'OpenGVLab/InternVideo2_chat_8B_HD_F16',
68
+ torch_dtype=torch.bfloat16,
69
+ trust_remote_code=True)
70
+
71
+
72
+ from decord import VideoReader, cpu
73
+ from PIL import Image
74
+ import numpy as np
75
+ import numpy as np
76
+ import decord
77
+ from decord import VideoReader, cpu
78
+ import torch.nn.functional as F
79
+ import torchvision.transforms as T
80
+ from torchvision.transforms import PILToTensor
81
+ from torchvision import transforms
82
+ from torchvision.transforms.functional import InterpolationMode
83
+ decord.bridge.set_bridge("torch")
84
+
85
+ def get_index(num_frames, num_segments):
86
+ seg_size = float(num_frames - 1) / num_segments
87
+ start = int(seg_size / 2)
88
+ offsets = np.array([
89
+ start + int(np.round(seg_size * idx)) for idx in range(num_segments)
90
+ ])
91
+ return offsets
92
+
93
+
94
+ def load_video(video_path, num_segments=8, return_msg=False, resolution=224, hd_num=4, padding=False):
95
+ vr = VideoReader(video_path, ctx=cpu(0), num_threads=1)
96
+ num_frames = len(vr)
97
+ frame_indices = get_index(num_frames, num_segments)
98
+
99
+ mean = (0.485, 0.456, 0.406)
100
+ std = (0.229, 0.224, 0.225)
101
+
102
+ transform = transforms.Compose([
103
+ transforms.Lambda(lambda x: x.float().div(255.0)),
104
+ transforms.Normalize(mean, std)
105
+ ])
106
+
107
+ frames = vr.get_batch(frame_indices)
108
+ frames = frames.permute(0, 3, 1, 2)
109
+
110
+ if padding:
111
+ frames = HD_transform_padding(frames.float(), image_size=resolution, hd_num=hd_num)
112
+ else:
113
+ frames = HD_transform_no_padding(frames.float(), image_size=resolution, hd_num=hd_num)
114
+
115
+ frames = transform(frames)
116
+ # print(frames.shape)
117
+ T_, C, H, W = frames.shape
118
+
119
+ sub_img = frames.reshape(
120
+ 1, T_, 3, H//resolution, resolution, W//resolution, resolution
121
+ ).permute(0, 3, 5, 1, 2, 4, 6).reshape(-1, T_, 3, resolution, resolution).contiguous()
122
+
123
+ glb_img = F.interpolate(
124
+ frames.float(), size=(resolution, resolution), mode='bicubic', align_corners=False
125
+ ).to(sub_img.dtype).unsqueeze(0)
126
+
127
+ frames = torch.cat([sub_img, glb_img]).unsqueeze(0)
128
+
129
+ if return_msg:
130
+ fps = float(vr.get_avg_fps())
131
+ sec = ", ".join([str(round(f / fps, 1)) for f in frame_indices])
132
+ # " " should be added in the start and end
133
+ msg = f"The video contains {len(frame_indices)} frames sampled at {sec} seconds."
134
+ return frames, msg
135
+ else:
136
+ return frames
137
+
138
+ def HD_transform_padding(frames, image_size=224, hd_num=6):
139
+ def _padding_224(frames):
140
+ _, _, H, W = frames.shape
141
+ tar = int(np.ceil(H / 224) * 224)
142
+ top_padding = (tar - H) // 2
143
+ bottom_padding = tar - H - top_padding
144
+ left_padding = 0
145
+ right_padding = 0
146
+
147
+ padded_frames = F.pad(
148
+ frames,
149
+ pad=[left_padding, right_padding, top_padding, bottom_padding],
150
+ mode='constant', value=255
151
+ )
152
+ return padded_frames
153
+
154
+ _, _, H, W = frames.shape
155
+ trans = False
156
+ if W < H:
157
+ frames = frames.flip(-2, -1)
158
+ trans = True
159
+ width, height = H, W
160
+ else:
161
+ width, height = W, H
162
+
163
+ ratio = width / height
164
+ scale = 1
165
+ while scale * np.ceil(scale / ratio) <= hd_num:
166
+ scale += 1
167
+ scale -= 1
168
+ new_w = int(scale * image_size)
169
+ new_h = int(new_w / ratio)
170
+
171
+ resized_frames = F.interpolate(
172
+ frames, size=(new_h, new_w),
173
+ mode='bicubic',
174
+ align_corners=False
175
+ )
176
+ padded_frames = _padding_224(resized_frames)
177
+
178
+ if trans:
179
+ padded_frames = padded_frames.flip(-2, -1)
180
+
181
+ return padded_frames
182
+
183
+ def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
184
+ best_ratio_diff = float('inf')
185
+ best_ratio = (1, 1)
186
+ area = width * height
187
+ for ratio in target_ratios:
188
+ target_aspect_ratio = ratio[0] / ratio[1]
189
+ ratio_diff = abs(aspect_ratio - target_aspect_ratio)
190
+ if ratio_diff < best_ratio_diff:
191
+ best_ratio_diff = ratio_diff
192
+ best_ratio = ratio
193
+ elif ratio_diff == best_ratio_diff:
194
+ if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
195
+ best_ratio = ratio
196
+ return best_ratio
197
+
198
+
199
+ def HD_transform_no_padding(frames, image_size=224, hd_num=6, fix_ratio=(2,1)):
200
+ min_num = 1
201
+ max_num = hd_num
202
+ _, _, orig_height, orig_width = frames.shape
203
+ aspect_ratio = orig_width / orig_height
204
+
205
+ # calculate the existing video aspect ratio
206
+ target_ratios = set(
207
+ (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
208
+ i * j <= max_num and i * j >= min_num)
209
+ target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
210
+
211
+ # find the closest aspect ratio to the target
212
+ if fix_ratio:
213
+ target_aspect_ratio = fix_ratio
214
+ else:
215
+ target_aspect_ratio = find_closest_aspect_ratio(
216
+ aspect_ratio, target_ratios, orig_width, orig_height, image_size)
217
+
218
+ # calculate the target width and height
219
+ target_width = image_size * target_aspect_ratio[0]
220
+ target_height = image_size * target_aspect_ratio[1]
221
+ blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
222
+
223
+ # resize the frames
224
+ resized_frame = F.interpolate(
225
+ frames, size=(target_height, target_width),
226
+ mode='bicubic', align_corners=False
227
+ )
228
+ return resized_frame
229
+
230
+ video_path = "yoga.mp4"
231
+ # sample uniformly 16 frames from the video
232
+ video_tensor = load_video(video_path, num_segments=16, return_msg=False, resolution=224, hd_num=6)
233
+ video_tensor = video_tensor.to(model.device)
234
+
235
+ chat_history = []
236
+ response, chat_history = model.chat(tokenizer, '', 'Describe the action step by step.', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
237
+ print(response)
238
+
239
+ response, chat_history = model.chat(tokenizer, '', 'What is she wearing?', media_type='video', media_tensor=video_tensor, chat_history= chat_history, return_history=True,generation_config={'do_sample':False})
240
+ ```
241
+
242
+ ## ✏️ Citation
243
+ If this work is helpful for your research, please consider citing InternVideo and VideoChat.
244
+
245
+ ```
246
+ @article{wang2024internvideo2,
247
+ title={Internvideo2: Scaling video foundation models for multimodal video understanding},
248
+ author={Wang, Yi and Li, Kunchang and Li, Xinhao and Yu, Jiashuo and He, Yinan and Wang, Chenting and Chen, Guo and Pei, Baoqi and Zheng, Rongkun and Xu, Jilan and Wang, Zun and others},
249
+ journal={arXiv preprint arXiv:2403.15377},
250
+ year={2024}
251
+ }
252
+
253
+ @article{li2023videochat,
254
+ title={Videochat: Chat-centric video understanding},
255
+ author={Li, KunChang and He, Yinan and Wang, Yi and Li, Yizhuo and Wang, Wenhai and Luo, Ping and Wang, Yali and Wang, Limin and Qiao, Yu},
256
+ journal={arXiv preprint arXiv:2305.06355},
257
+ year={2023}
258
+ }
259
+ ```