--- license_name: qwen-research license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE language: - en pipeline_tag: image-text-to-text tags: - multimodal - tracking - lora library_name: transformers --- # Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1 ## Introduction Qwen2.5-VL was not originally trained for object tracking tasks. While it can perform object detection on individual frames or across video inputs, processing N frames sequentially results in identical predictions for each frame. Consequently, the model cannot maintain consistent object IDs across predictions. We provide a LoRA adapter for Qwen2.5-VL-3B that enables object tracking capabilities. ### 🚀 Key Enhancement: * **Possibility of object tracking** - Supports frame-by-frame tracking of arbitrary objects. ### 📝 Training info LoRa for Qwen2.5-VL-3B was trained with next parameters: Max video side: 896 Max frames count: 16 LoRa rank: 64 LoRa alpha: 128 Epochs: 30 (about 9k steps) Dataset for training: TAO-Amodal. #### Output format: ```json { {frame_id:str}: { {object_name:str}: { {object_id:str}: [x_top:int, y_top:int, x_bottom:int, y_bottom:int], ... }, ... }, ... } ``` ##### Tokens Input Tokens count depends on video size and frames count. Example: Video size (16, 504, 896) and single object (object exists on each frame). - Input tokens count: 4759 - Output tokens count: 492 `You can reuse last N predicted frames as memory for next iterations. This way you can process videos of any length.` ## 🛠️ How to ### Requirements ```bash pip install -U torch transformers peft denku ``` ### imports ```python import torch from peft import PeftModel from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor from denku import read_video ``` ### Define prompt templates ```python system_prompt = """ You are professional video assistant. You get a video consisting of N frames. Track objects in a video for each frame. Do prediction frame by frame. For each object in user request output unique ID of each object and coordinates of bounding box. Provide the result in json format. Output format: ```json { {frame_id:str}: { {object_name:str}: { {object_id:str}: [x_top, y_top, x_bottom, y_bottom], } } }\n``` Use additional parameters and instructions from user request. """ user_prompt_template = """ Extract information from the video. Video consists of {N_FRAMES} frames. Track objects in this video: [{OBJECTS_FOR_TRACKING}]. """ ``` ### Load model and LoRa ```python model_name = "Qwen/Qwen2.5-VL-3B-Instruct" model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto" ) processor = Qwen2_5_VLProcessor.from_pretrained(model_name, use_fast=False) adapter_name = "TheDenk/Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1" model = PeftModel.from_pretrained(model, adapter_name, is_trainable=False) ``` ### Prepare video ```python device = "cuda" objects_for_tracking = "person" ## "person, cat", "person, cat, dog" ## Load video and convert to numpy array of shape (num_frames, height, width, channels) video, fps = read_video(video_path="path to video.mp4", start_frame=0, frames_count=16, max_side=896) ``` ### Run inference ```python user_prompt = user_prompt_template.replace("{N_FRAMES}", f"{video.shape[0]}").replace("{OBJECTS_FOR_TRACKING}", objects_for_tracking) conversation = [ { "role": "system", "content": [ {"type": "text", "text": system_prompt} ] }, { "role": "user", "content": [ {"type": "video", "fps": 16}, {"type": "text", "text": user_prompt} ] } ] prompts = processor.apply_chat_template(conversation=conversation, add_generation_prompt=True) inputs = processor( text=[prompts], videos=[video], return_tensors="pt" ) inputs = inputs.to(device) outputs = model.generate(**inputs, do_sample=True, temperature=0.9, top_k=5, top_p=1.0, max_new_tokens=1024 * 2) print(f"[ TOKENS COUNT ] [INPUT: {inputs.input_ids.shape[1]} | OUTPUT: {outputs[0][inputs.input_ids.shape[1]:].shape[0]}") output_text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True) ``` #### Output example: ```python """ ```json {"0": {"person": {"0": [423, 113, 481, 275]}}, "1": {"person": {"0": [425, 115, 481, 275]}}, ... \n``` """ ``` ## 🤝 Acknowledgements Original code and models [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct). ## Contacts

Issues should be raised directly in the repository. For professional support and recommendations please welcomedenk@gmail.com.