--- language: - en library_name: transformers license: apache-2.0 metrics: - accuracy tags: - multimodal pipeline_tag: video-text-to-text base_model: Qwen/Qwen2.5-VL-7B-Instruct --- # 💡 VideoChat-R1_7B [\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-R1) [\[📜 Tech Report\]](https://arxiv.org/pdf/2504.06958) ## 🚀 How to use the model We provide a simple installation example below: ``` pip install transformers pip install qwen_vl_utils ``` Then you could use our model: ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model_path = "OpenGVLab/VideoChat-R1_7B" # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_path, torch_dtype="auto", device_map="auto", attn_implementation="flash_attention_2" ) # default processer processor = AutoProcessor.from_pretrained(model_path) video_path = "your_video.mp4" question = "Where is the final cup containing the object?" messages = [ { "role": "user", "content": [ { "type": "video", "video": video_path, "max_pixels": 460800, "nframes": 32 }, {"type": "text", "text": f"""{question} Provide your final answer within the tags. """}, ], } ] #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time. # Preparation for inference text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", **video_kwargs, ) inputs = inputs.to("cuda") # Inference generated_ids = model.generate(**inputs, max_new_tokens=512) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## ✏️ Citation ```bibtex @article{li2025videochatr1, title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning}, author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin}, journal={arXiv preprint arXiv:2504.06958}, year={2025} } ```