---
license_name: qwen-research
license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
language:
- en
pipeline_tag: image-text-to-text
tags:
- multimodal
- tracking
- lora
library_name: transformers
---

# Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1


<video controls autoplay src="https://cdn-uploads.huggingface.co/production/uploads/63fde49f6315a264aba6a7ed/cPo3S-tuu3UgV9_aIOhU1.mp4"></video>

## Introduction
Qwen2.5-VL was not originally trained for object tracking tasks. While it can perform object detection on individual frames or across video inputs, processing N frames sequentially results in identical predictions for each frame. Consequently, the model cannot maintain consistent object IDs across predictions.     
We provide a LoRA adapter for Qwen2.5-VL-3B that enables object tracking capabilities.

### 🚀 Key Enhancement:
* **Possibility of object tracking** - Supports frame-by-frame tracking of arbitrary objects.


### 📝 Training info
LoRa for Qwen2.5-VL-3B was trained with next parameters:  
Max video side: 896  
Max frames count: 16  
LoRa rank: 64   
LoRa alpha: 128    
Epochs: 30 (about 9k steps)    

Dataset for training: <a href="https://huggingface.co/datasets/chengyenhsieh/TAO-Amodal">TAO-Amodal</a>.  

#### Output format:
```json
{
    {frame_id:str}: {
        {object_name:str}: {
            {object_id:str}: [x_top:int, y_top:int, x_bottom:int, y_bottom:int],
            ...
        },
        ...
    },
    ...
}
```

##### Tokens
Input Tokens count depends on video size and frames count.
Example:
Video size (16, 504, 896) and single object (object exists on each frame). 
- Input tokens count: 4759
- Output tokens count: 492


`You can reuse last N predicted frames as memory for next iterations. This way you can process videos of any length.`

## 🛠️ How to

### Requirements
```bash
pip install -U torch transformers peft denku
```

### imports
```python
import torch
from peft import PeftModel
from transformers import Qwen2_5_VLForConditionalGeneration, Qwen2_5_VLProcessor
from denku import read_video
```

### Define prompt templates
```python
system_prompt = """
You are professional video assistant.
You get a video consisting of N frames. Track objects in a video for each frame. Do prediction frame by frame.
For each object in user request output unique ID of each object and coordinates of bounding box.
Provide the result in json format. Output format: 

```json
{
    {frame_id:str}: {
        {object_name:str}: {
            {object_id:str}: [x_top, y_top, x_bottom, y_bottom],
        }
    }
}\n```

Use additional parameters and instructions from user request.
"""

user_prompt_template = """
Extract information from the video.
Video consists of {N_FRAMES} frames.
Track objects in this video: [{OBJECTS_FOR_TRACKING}].
"""

```

### Load model and LoRa
```python
model_name = "Qwen/Qwen2.5-VL-3B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = Qwen2_5_VLProcessor.from_pretrained(model_name, use_fast=False)

adapter_name = "TheDenk/Qwen2.5-VL-3B-TrackAnyObject-LoRa-v1"
model = PeftModel.from_pretrained(model, adapter_name, is_trainable=False)
```

### Prepare video
```python
device = "cuda"
objects_for_tracking = "person"  ## "person, cat", "person, cat, dog"

## Load video and convert to numpy array of shape (num_frames, height, width, channels)
video, fps = read_video(video_path="path to video.mp4", start_frame=0, frames_count=16, max_side=896)
```

### Run inference
```python
user_prompt = user_prompt_template.replace("{N_FRAMES}", f"{video.shape[0]}").replace("{OBJECTS_FOR_TRACKING}", objects_for_tracking)
conversation = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": system_prompt}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "video", "fps": 16},
            {"type": "text", "text": user_prompt}
        ]
    }
]
prompts = processor.apply_chat_template(conversation=conversation, add_generation_prompt=True) 
inputs = processor(
    text=[prompts], 
    videos=[video], 
    return_tensors="pt"
)
inputs = inputs.to(device)

outputs = model.generate(**inputs, do_sample=True, temperature=0.9, top_k=5, top_p=1.0, max_new_tokens=1024 * 2)
print(f"[ TOKENS COUNT ] [INPUT: {inputs.input_ids.shape[1]} | OUTPUT: {outputs[0][inputs.input_ids.shape[1]:].shape[0]}")

output_text = processor.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
```

#### Output example:
```python
"""
```json
{"0": {"person": {"0": [423, 113, 481, 275]}}, "1": {"person": {"0": [425, 115, 481, 275]}}, ... \n```
"""
```

## 🤝 Acknowledgements
Original code and models [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct).  

## Contacts
<p>Issues should be raised directly in the repository. For professional support and recommendations please <a>welcomedenk@gmail.com</a>.</p>