wwwyyy commited on
Commit
4df3a2a
·
verified ·
1 Parent(s): 561795c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +164 -160
README.md CHANGED
@@ -1,160 +1,164 @@
1
- # TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
2
-
3
-
4
- <div style='display:flex; gap: 0.25rem; '>
5
- <a href='./TimeZero_TechReport.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a>
6
- <a href='None'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-blue'></a>
7
- </div>
8
-
9
- ### Updates
10
-
11
- - 2025-03-17: TimeZero initial release! Code and evaluation scripts are now available.
12
- - 2025-03-17: TimeZero achieves SOTA performance on Charades-STA!
13
-
14
- ### Overview
15
-
16
- TimeZero is a reasoning-guided Large Vision-Language Model (LVLM) for Temporal Video Grounding (TVG). It excels at identifying temporal segments within videos that correspond to a given natural language query. TimeZero achieves this entirely through a reinforcement learning approach that allows the model to reason about video-language relationships *during inference*.
17
-
18
- Key Features:
19
-
20
- * **Reinforcement Learning Training:** TimeZero is trained *entirely* using reinforcement learning, enhancing its ability to generate accurate temporal boundaries.
21
- * **Test-Time Reasoning:** The model exhibits emergent reasoning capabilities during inference, generating a chain of thought to justify its segment predictions.
22
- * **SOTA Performance:** TimeZero sets a new SOTA on the Charades-STA benchmark.
23
-
24
-
25
- This README provides an overview of TimeZero, including setup instructions, the training process, and evaluation guidelines.
26
-
27
- **Example:**
28
-
29
- ![image](https://github.com/user-attachments/assets/f5ac9e6b-58f5-41e9-878d-a5ae5045b155)
30
-
31
-
32
- **Training Visualization:**
33
-
34
- ![0a466a4bca3bb8d9b2a2af0f15890b4](https://github.com/user-attachments/assets/df1c35f5-8c30-400b-bce6-14e1f766752c)
35
-
36
- ## Setup
37
-
38
- ```bash
39
- conda create -n timezero python=3.11
40
- conda env create -f environment.yml
41
- conda activate timezero
42
- ```
43
-
44
- ## Training
45
-
46
- TimeZero training involves the following steps:
47
-
48
- 1. **Data Preprocessing:**
49
-
50
- Download the dataset [Charades-STA](https://github.com/jiyanggao/TALL#charades-sta-anno-download), [ActivityNet](https://cs.stanford.edu/people/ranjaykrishna/densevid/)
51
-
52
- Before training, you need to preprocess the video data.
53
-
54
- ```bash
55
- bash preprocess_video.sh
56
- ```
57
- Specify the path to the Charades-STA dataset (video files, annotations, etc.).
58
-
59
- 2. **GRPO Training:**
60
-
61
- ```bash
62
- cd scripts
63
- bash run_grpo_video.sh
64
- ```
65
-
66
- **`run_grpo_video.sh`**
67
-
68
- ```bash
69
- #!/bin/bash
70
-
71
- export DEBUG_MODE="false" # Set to "true" for verbose logging during training.
72
- export LOG_PATH="./debug_log.txt"
73
-
74
- torchrun --nproc_per_node="4" \
75
- --nnodes="1" \
76
- --node_rank="0" \
77
- --master_addr="127.0.0.1" \
78
- --master_port="12361" \
79
- src/open_r1/grpo_video.py \
80
- --deepspeed scripts/zero3_offload.json \
81
- --output_dir $OUTDIR \
82
- --model_name_or_path mllm/Qwen2.5-VL-7B-Instruct \
83
- --preprocessed_data_path ./Charades_preprocessed_data_maxpix_3584 \
84
- --train_data_path ./Charades/charades_annotation/train.json \
85
- --eval_data_path ./Charades/charades_annotation/val.json \
86
- --video_folder ./Charades/Charades_v1 \
87
- --dataset_name xxx \
88
- --max_prompt_length 8192 \
89
- --max_completion_length 1024 \
90
- --num_generations 8 \
91
- --per_device_train_batch_size 1 \
92
- --gradient_accumulation_steps 2 \
93
- --logging_steps 1 \
94
- --bf16 \
95
- --torch_dtype bfloat16 \
96
- --data_seed 42 \
97
- --gradient_checkpointing true \
98
- --attn_implementation flash_attention_2 \
99
- --num_train_epochs 2 \
100
- --run_name $WANDB_NAME \
101
- --report_to wandb \
102
- --save_steps 50 \
103
- --save_only_model true
104
- ```
105
-
106
- ## Evaluation
107
-
108
- After training, evaluate your model's performance:
109
-
110
- ```bash
111
- bash scripts/evaluate.sh # Use evaluate.sh for evaluation.
112
- ```
113
- **`evaluate.sh`**
114
- ```
115
- python evaluate.py --model_base <path_to_your_trained_model> --dataset <charades or activitynet>
116
- ```
117
-
118
- > The evaluation script (`evaluate.py`) needs to be implemented to load your model, process the test data, and calculate the relevant metrics ([email protected], [email protected], [email protected], etc.).
119
-
120
- ## Results
121
-
122
- - **Charades-STA (Finetuned)**
123
-
124
- TimeZero outperforms previous state-of-the-art methods by a large margin.
125
-
126
127
- | --------------------- | ---- | ------ | ------ | ------ |
128
- | EaTR (VLP sota) | VLP | - | 68.4 | 44.9 |
129
- | TimeSuite (LVLM sota) | SFT | 79.4 | 67.1 | 43.0 |
130
- | TimeZero (ours) | RL | 83.3 | 72.5 | 47.9 |
131
-
132
- - **ActivityNet (Finetuned)**
133
-
134
- TimeZero surpasses previous state-of-the-art LVLMs.
135
-
136
137
- | ----------------- | ---- | ------ | ------ | ------ |
138
- | EaTR (VLP sota) | VLP | - | 58.18 | 37.64 |
139
- | TRACE (LVLM sota) | SFT | 54.0 | 37.7 | 24.0 |
140
- | TimeZero (ours) | RL | 68.6 | 47.3 | 26.9 |
141
-
142
- ## Acknowledgements
143
-
144
- We thank the authors of the following projects for their contributions:
145
-
146
- * [TRACE](https://github.com/gyxxyg/TRACE)
147
- * [R1-V](https://github.com/Deep-Agent/R1-V)
148
- * [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
149
-
150
- ## Citation
151
-
152
-
153
- ```bibtex
154
- @article{wang2025timezero,
155
- title={TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM},
156
- author={Wang, Ye and Xu, Boshen and Yue, Zihao and Xiao, Zihan and Wang, Ziheng and Zhang, Liang and Yang, Dingyi and Wang, Wenxuan and Jin, Qin},
157
- booktitle={arxiv},
158
- year={2025}
159
- }
160
- ```
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: video-text-to-text
3
+ library_name: transformers
4
+ ---
5
+
6
+ # TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM
7
+
8
+ <div style='display:flex; gap: 0.25rem; '>
9
+ <a href='./TimeZero_TechReport.pdf'><img src='https://img.shields.io/badge/Paper-PDF-red'></a>
10
+ <a href='https://huggingface.co/wwwyyy/TimeZero-Charades-7B'><img src='https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Checkpoint-blue'></a>
11
+ </div>
12
+
13
+ ### Updates
14
+
15
+ - 2025-03-17: TimeZero initial release! Code and evaluation scripts are now available.
16
+ - 2025-03-17: TimeZero achieves SOTA performance on Charades-STA!
17
+
18
+ ### Overview
19
+
20
+ TimeZero is a reasoning-guided Large Vision-Language Model (LVLM) for Temporal Video Grounding (TVG). It excels at identifying temporal segments within videos that correspond to a given natural language query. TimeZero achieves this entirely through a reinforcement learning approach that allows the model to reason about video-language relationships *during inference*.
21
+
22
+ Key Features:
23
+
24
+ * **Reinforcement Learning Training:** TimeZero is trained *entirely* using reinforcement learning, enhancing its ability to generate accurate temporal boundaries.
25
+ * **Test-Time Reasoning:** The model exhibits emergent reasoning capabilities during inference, generating a chain of thought to justify its segment predictions.
26
+ * **SOTA Performance:** TimeZero sets a new SOTA on the Charades-STA benchmark.
27
+
28
+
29
+ This README provides an overview of TimeZero, including setup instructions, the training process, and evaluation guidelines.
30
+
31
+ **Example:**
32
+
33
+ ![image](https://github.com/user-attachments/assets/f5ac9e6b-58f5-41e9-878d-a5ae5045b155)
34
+
35
+
36
+ **Training Visualization:**
37
+
38
+ ![0a466a4bca3bb8d9b2a2af0f15890b4](https://github.com/user-attachments/assets/df1c35f5-8c30-400b-bce6-14e1f766752c)
39
+
40
+ ## Setup
41
+
42
+ ```bash
43
+ conda create -n timezero python=3.11
44
+ conda env create -f environment.yml
45
+ conda activate timezero
46
+ ```
47
+
48
+ ## Training
49
+
50
+ TimeZero training involves the following steps:
51
+
52
+ 1. **Data Preprocessing:**
53
+
54
+ Download the dataset [Charades-STA](https://github.com/jiyanggao/TALL#charades-sta-anno-download), [ActivityNet](https://cs.stanford.edu/people/ranjaykrishna/densevid/)
55
+
56
+ Before training, you need to preprocess the video data.
57
+
58
+ ```bash
59
+ bash preprocess_video.sh
60
+ ```
61
+ Specify the path to the Charades-STA dataset (video files, annotations, etc.).
62
+
63
+ 2. **GRPO Training:**
64
+
65
+ ```bash
66
+ cd scripts
67
+ bash run_grpo_video.sh
68
+ ```
69
+
70
+ **`run_grpo_video.sh`**
71
+
72
+ ```bash
73
+ #!/bin/bash
74
+
75
+ export DEBUG_MODE="false" # Set to "true" for verbose logging during training.
76
+ export LOG_PATH="./debug_log.txt"
77
+
78
+ torchrun --nproc_per_node="4" \
79
+ --nnodes="1" \
80
+ --node_rank="0" \
81
+ --master_addr="127.0.0.1" \
82
+ --master_port="12361" \
83
+ src/open_r1/grpo_video.py \
84
+ --deepspeed scripts/zero3_offload.json \
85
+ --output_dir $OUTDIR \
86
+ --model_name_or_path mllm/Qwen2.5-VL-7B-Instruct \
87
+ --preprocessed_data_path ./Charades_preprocessed_data_maxpix_3584 \
88
+ --train_data_path ./Charades/charades_annotation/train.json \
89
+ --eval_data_path ./Charades/charades_annotation/val.json \
90
+ --video_folder ./Charades/Charades_v1 \
91
+ --dataset_name xxx \
92
+ --max_prompt_length 8192 \
93
+ --max_completion_length 1024 \
94
+ --num_generations 8 \
95
+ --per_device_train_batch_size 1 \
96
+ --gradient_accumulation_steps 2 \
97
+ --logging_steps 1 \
98
+ --bf16 \
99
+ --torch_dtype bfloat16 \
100
+ --data_seed 42 \
101
+ --gradient_checkpointing true \
102
+ --attn_implementation flash_attention_2 \
103
+ --num_train_epochs 2 \
104
+ --run_name $WANDB_NAME \
105
+ --report_to wandb \
106
+ --save_steps 50 \
107
+ --save_only_model true
108
+ ```
109
+
110
+ ## Evaluation
111
+
112
+ After training, evaluate your model's performance:
113
+
114
+ ```bash
115
+ bash scripts/evaluate.sh # Use evaluate.sh for evaluation.
116
+ ```
117
+ **`evaluate.sh`**
118
+ ```
119
+ python evaluate.py --model_base <path_to_your_trained_model> --dataset <charades or activitynet>
120
+ ```
121
+
122
+ > The evaluation script (`evaluate.py`) needs to be implemented to load your model, process the test data, and calculate the relevant metrics ([email protected], [email protected], [email protected], etc.).
123
+
124
+ ## Results
125
+
126
+ - **Charades-STA (Finetuned)**
127
+
128
+ TimeZero outperforms previous state-of-the-art methods by a large margin.
129
+
130
+ | Method | Type | R1@0.3 | R1@0.5 | R1@0.7 |
131
+ | --------------------- | ---- | ------ | ------ | ------ |
132
+ | EaTR (VLP sota) | VLP | - | 68.4 | 44.9 |
133
+ | TimeSuite (LVLM sota) | SFT | 79.4 | 67.1 | 43.0 |
134
+ | TimeZero (ours) | RL | 83.3 | 72.5 | 47.9 |
135
+
136
+ - **ActivityNet (Finetuned)**
137
+
138
+ TimeZero surpasses previous state-of-the-art LVLMs.
139
+
140
+ | Method | Type | R1@0.3 | R1@0.5 | R1@0.7 |
141
+ | ----------------- | ---- | ------ | ------ | ------ |
142
+ | EaTR (VLP sota) | VLP | - | 58.18 | 37.64 |
143
+ | TRACE (LVLM sota) | SFT | 54.0 | 37.7 | 24.0 |
144
+ | TimeZero (ours) | RL | 68.6 | 47.3 | 26.9 |
145
+
146
+ ## Acknowledgements
147
+
148
+ We thank the authors of the following projects for their contributions:
149
+
150
+ * [TRACE](https://github.com/gyxxyg/TRACE)
151
+ * [R1-V](https://github.com/Deep-Agent/R1-V)
152
+ * [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
153
+
154
+ ## Citation
155
+
156
+
157
+ ```bibtex
158
+ @article{wang2025timezero,
159
+ title={TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM},
160
+ author={Wang, Ye and Xu, Boshen and Yue, Zihao and Xiao, Zihan and Wang, Ziheng and Zhang, Liang and Yang, Dingyi and Wang, Wenxuan and Jin, Qin},
161
+ booktitle={arxiv},
162
+ year={2025}
163
+ }
164
+ ```