Mantis-VL
/

siglip-video_16384_2fps_128

Transformers

Safetensors

siglip_video

Generated from Trainer

Model card Files Files and versions Community

DongfuJiang commited on Dec 30, 2024

Commit

8b5ed27

verified ·

1 Parent(s): 6410280

Update README.md

Browse files

Files changed (1) hide show

README.md +65 -0

README.md CHANGED Viewed

@@ -54,3 +54,68 @@ The following hyperparameters were used during training:
 - Pytorch 2.5.1+cu124
 - Datasets 2.18.0
 - Tokenizers 0.21.0

 - Pytorch 2.5.1+cu124
 - Datasets 2.18.0
 - Tokenizers 0.21.0
+```python
+from PIL import Image
+import requests
+from transformers import AutoProcessor, AutoModel
+from mantis.models.siglip_video import SiglipVideoModel
+import torch
+import numpy as np
+import av
+def read_video_pyav(container, indices):
+    '''
+    Decode the video with PyAV decoder.
+    Args:
+        container (av.container.input.InputContainer): PyAV container.
+        indices (List[int]): List of frame indices to decode.
+    Returns:
+        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
+    '''
+    frames = []
+    container.seek(0)
+    if len(indices) == 0:
+        # to debug
+        indices = [0]
+        print("No indices to decode, might be an empty video please check")
+    start_index = indices[0]
+    end_index = indices[-1]
+    for i, frame in enumerate(container.decode(video=0)):
+        if i > end_index:
+            break
+        if i >= start_index and i in indices:
+            frames.append(frame)
+    return np.stack([x.to_ndarray(format="rgb24") for x in frames])
+# model = SiglipVideoModel.from_pretrained("google/siglip-so400m-patch14-384")
+model = SiglipVideoModel.from_pretrained("Mantis-VL/siglip-video_16384_2fps_128").to("cuda:2")
+processor = AutoProcessor.from_pretrained("google/siglip-so400m-patch14-384")
+container = av.open("../mochi.mp4")
+# container = av.open("/home/dongfu/WorkSpace/Mantis/data/llava-video/data/0_30_s_youtube_v0_1/videos/liwei_youtube_videos/videos/youtube_video_2024/ytb_F-FpE2GWW84.mp4")
+total_frames = container.streams.video[0].frames
+sample_fps = 2
+ori_fps = container.streams.video[0].average_rate
+indices = np.arange(0, total_frames, int(ori_fps/sample_fps))
+frames = read_video_pyav(container, indices)
+text = "Close-up of a chameleon's eye, with its scaly skin changing color. Ultra high resolution 4k."
+# text = "The video showcases a group of individuals dressed in matching military-style uniforms, consisting of long, light-colored tunics and dark vests, marching in unison. They are carrying large, black, shoulder-mounted weapons, and the background appears to be an open area, possibly a parade ground or a military base, with a clear sky overhead. The text overlay in English reads, 'Talabani won victory over America with an impossible weapon,' suggesting a narrative of triumph using unconventional means. The individuals are seen marching in a coordinated manner, emphasizing discipline and uniformity. As the video progresses, the group continues their synchronized march, maintaining the same background setting. The text overlay, 'Talabani won victory over America with an impossible weapon,' reappears, reinforcing the narrative of triumph. One individual in the foreground is prominently holding a rifle, adding to the display of military prowess. The video emphasizes the themes of discipline, coordination, and military strength."
+print(frames.shape)
+inputs = processor(text=[text], images=frames, padding="max_length", return_tensors="pt")
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+inputs['pixel_values'] = [inputs['pixel_values']]
+with torch.no_grad():
+    outputs = model(**inputs)
+logits_per_video = outputs.logits_per_video
+print(logits_per_video)
+probs = torch.sigmoid(logits_per_video) # these are the probabilities
+print(f"{probs[0][0]:.1%} the video contains the text: '{text}'")
+```