Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1

Introduction

Qwen2.5-VL was not originally trained for object tracking tasks. While it can perform object detection on individual frames or across video inputs, processing N frames sequentially results in identical predictions for each frame. Consequently, the model cannot maintain consistent object IDs across predictions.
We provide a LoRA adapter for Qwen2.5-VL-3B that enables object tracking capabilities.

🚀 Key Enhancement:

Possibility of object tracking - Supports frame-by-frame tracking of arbitrary objects.

📝 Training info

LoRa for Qwen2.5-VL-3B was trained with next parameters:
Max video side: 896
Max frames count: 16
LoRa rank: 64
LoRa alpha: 128
Epochs: 30 (about 9k steps)

Dataset for training: TAO-Amodal.

Output format:

{
    {frame_id:str}: {
        {object_name:str}: {
            {object_id:str}: [x_top:int, y_top:int, x_bottom:int, y_bottom:int],
            ...
        },
        ...
    },
    ...
}

Tokens

Input Tokens count depends on video size and frames count. Example: Video size (16, 504, 896) and single object (object exists on each frame).

Input tokens count: 4759
Output tokens count: 492

You can reuse last N predicted frames as memory for next iterations. This way you can process videos of any length.

🛠️ How to

Requirements

pip install -U torch transformers peft denku

imports

import torch
from peft import PeftModel
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from denku import read_video

Define prompt templates

system_prompt = """
You are professional video assistant.
You get a video consisting of N frames. Track objects in a video for each frame. Do prediction frame by frame.
For each object in user request output unique ID of each object and coordinates of bounding box.
Provide the result in json format. Output format: 

```json
{
    {frame_id:str}: {
        {object_name:str}: {
            {object_id:str}: [x_top, y_top, x_bottom, y_bottom],
        }
    }
}\n```

Use additional parameters and instructions from user request.
"""

user_prompt_template = """
Extract information from the video.
Video consists of {N_FRAMES} frames.
Track objects in this video: [{OBJECTS_FOR_TRACKING}].
"""

Load model and LoRa

model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = Qwen2_5_VLProcessor.from_pretrained(model_name, use_fast=False)

adapter_name = "TheDenk/Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1"
model = PeftModel.from_pretrained(model, adapter_name, is_trainable=False)

Prepare video

device = "cuda"
objects_for_tracking = "person"  ## "person, cat", "person, cat, dog"

## Load video and convert to numpy array of shape (num_frames, height, width, channels)
video, fps = read_video(video_path="path to video.mp4", start_frame=0, frames_count=16, max_side=896)

Run inference

user_prompt = user_prompt_template.replace("{N_FRAMES}", f"{video.shape[0]}").replace("{OBJECTS_FOR_TRACKING}", objects_for_tracking)
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "fps": 16},
            {"type": "text", "text": user_prompt}
        ]
    }
]
prompts = processor.apply_chat_template(conversation=conversation, add_generation_prompt=True) 
inputs = processor(
    text=[prompts], 
    videos=[video], 
    return_tensors="pt"
)
inputs = inputs.to(device)

outputs = model.generate(**inputs, do_sample=True, temperature=0.9, top_k=5, top_p=1.0, max_new_tokens=1024 * 2)
print(f"[ TOKENS COUNT ] [INPUT: {inputs.input_ids.shape[1]} | OUTPUT: {outputs[0][inputs.input_ids.shape[1]:].shape[0]}")

output_text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)

Output example:

"""
```json
{"0": {"person": {"0": [423, 113, 481, 275]}}, "1": {"person": {"0": [425, 115, 481, 275]}}, ... \n```
"""

🤝 Acknowledgements

Original code and models Qwen2.5-VL-3B-Instruct.

Contacts

Issues should be raised directly in the repository. For professional support and recommendations please [email protected].