Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1
Introduction
Qwen2.5-VL was not originally trained for object tracking tasks. While it can perform object detection on individual frames or across video inputs, processing N frames sequentially results in identical predictions for each frame. Consequently, the model cannot maintain consistent object IDs across predictions.
We provide a LoRA adapter for Qwen2.5-VL-3B that enables object tracking capabilities.
π Key Enhancement:
- Possibility of object tracking - Supports frame-by-frame tracking of arbitrary objects.
π Training info
LoRa for Qwen2.5-VL-3B was trained with next parameters:
Max video side: 896
Max frames count: 16
LoRa rank: 64
LoRa alpha: 128
Epochs: 30 (about 9k steps)
Dataset for training: TAO-Amodal.
Output format:
{
{frame_id:str}: {
{object_name:str}: {
{object_id:str}: [x_top:int, y_top:int, x_bottom:int, y_bottom:int],
...
},
...
},
...
}
Tokens
Input Tokens count depends on video size and frames count. Example: Video size (16, 504, 896) and single object (object exists on each frame).
- Input tokens count: 4759
- Output tokens count: 492
You can reuse last N predicted frames as memory for next iterations. This way you can process videos of any length.
π οΈ How to
Requirements
pip install -U torch transformers peft denku
imports
import torch
from peft import PeftModel
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from denku import read_video
Define prompt templates
system_prompt = """
You are professional video assistant.
You get a video consisting of N frames. Track objects in a video for each frame. Do prediction frame by frame.
For each object in user request output unique ID of each object and coordinates of bounding box.
Provide the result in json format. Output format:
```json
{
{frame_id:str}: {
{object_name:str}: {
{object_id:str}: [x_top, y_top, x_bottom, y_bottom],
}
}
}\n```
Use additional parameters and instructions from user request.
"""
user_prompt_template = """
Extract information from the video.
Video consists of {N_FRAMES} frames.
Track objects in this video: [{OBJECTS_FOR_TRACKING}].
"""
Load model and LoRa
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = Qwen2_5_VLProcessor.from_pretrained(model_name, use_fast=False)
adapter_name = "TheDenk/Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1"
model = PeftModel.from_pretrained(model, adapter_name, is_trainable=False)
Prepare video
device = "cuda"
objects_for_tracking = "person" ## "person, cat", "person, cat, dog"
## Load video and convert to numpy array of shape (num_frames, height, width, channels)
video, fps = read_video(video_path="path to video.mp4", start_frame=0, frames_count=16, max_side=896)
Run inference
user_prompt = user_prompt_template.replace("{N_FRAMES}", f"{video.shape[0]}").replace("{OBJECTS_FOR_TRACKING}", objects_for_tracking)
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": system_prompt}
]
},
{
"role": "user",
"content": [
{"type": "video", "fps": 16},
{"type": "text", "text": user_prompt}
]
}
]
prompts = processor.apply_chat_template(conversation=conversation, add_generation_prompt=True)
inputs = processor(
text=[prompts],
videos=[video],
return_tensors="pt"
)
inputs = inputs.to(device)
outputs = model.generate(**inputs, do_sample=True, temperature=0.9, top_k=5, top_p=1.0, max_new_tokens=1024 * 2)
print(f"[ TOKENS COUNT ] [INPUT: {inputs.input_ids.shape[1]} | OUTPUT: {outputs[0][inputs.input_ids.shape[1]:].shape[0]}")
output_text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
Output example:
"""
```json
{"0": {"person": {"0": [423, 113, 481, 275]}}, "1": {"person": {"0": [425, 115, 481, 275]}}, ... \n```
"""
π€ Acknowledgements
Original code and models Qwen2.5-VL-3B-Instruct.
Contacts
Issues should be raised directly in the repository. For professional support and recommendations please [email protected].