Add link to paper and code
Browse filesThe model card is already extensively documented. This PR adds a link to the paper page, ensuring the model can be found at: https://huggingface.co/papers/2502.13923
It also adds a link to the official code repository and the project page.
README.md
CHANGED
@@ -1,13 +1,13 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
language:
|
4 |
- en
|
|
|
|
|
5 |
pipeline_tag: image-text-to-text
|
6 |
tags:
|
7 |
- multimodal
|
8 |
-
library_name: transformers
|
9 |
-
base_model:
|
10 |
-
- Qwen/Qwen2.5-VL-7B-Instruct
|
11 |
---
|
12 |
|
13 |
# Qwen2.5-VL-7B-Instruct-AWQ
|
@@ -30,7 +30,6 @@ In the past five months since Qwen2-VL’s release, numerous developers have bui
|
|
30 |
|
31 |
* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
|
32 |
|
33 |
-
|
34 |
#### Model Architecture Updates:
|
35 |
|
36 |
* **Dynamic Resolution and Frame Rate Training for Video Understanding**:
|
@@ -46,15 +45,10 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
|
|
46 |
|
47 |
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
|
48 |
|
49 |
-
|
50 |
-
We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model with AWQ. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
|
51 |
-
|
52 |
-
|
53 |
|
54 |
## Evaluation
|
55 |
|
56 |
-
|
57 |
-
|
58 |
## Requirements
|
59 |
The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
|
60 |
```
|
@@ -65,7 +59,6 @@ or you might encounter the following error:
|
|
65 |
KeyError: 'qwen2_5_vl'
|
66 |
```
|
67 |
|
68 |
-
|
69 |
## Quickstart
|
70 |
|
71 |
Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
|
@@ -79,11 +72,10 @@ or you might encounter the following error:
|
|
79 |
KeyError: 'qwen2_5_vl'
|
80 |
```
|
81 |
|
82 |
-
|
83 |
We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
|
84 |
|
85 |
```bash
|
86 |
-
# It's highly
|
87 |
pip install qwen-vl-utils[decord]==0.0.8
|
88 |
```
|
89 |
|
@@ -94,7 +86,7 @@ If you are not using Linux, you might not be able to install `decord` from PyPI.
|
|
94 |
Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
|
95 |
|
96 |
```python
|
97 |
-
from transformers import Qwen2_5_VLForConditionalGeneration,
|
98 |
from qwen_vl_utils import process_vision_info
|
99 |
|
100 |
# default: Load the model on the available device(s)
|
@@ -156,12 +148,189 @@ output_text = processor.batch_decode(
|
|
156 |
)
|
157 |
print(output_text)
|
158 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
160 |
|
161 |
### 🤖 ModelScope
|
162 |
We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
|
163 |
|
164 |
-
|
165 |
### More Usage Tips
|
166 |
|
167 |
For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
|
@@ -207,18 +376,18 @@ The model supports a wide range of resolution inputs. By default, it uses the na
|
|
207 |
min_pixels = 256 * 28 * 28
|
208 |
max_pixels = 1280 * 28 * 28
|
209 |
processor = AutoProcessor.from_pretrained(
|
210 |
-
"Qwen/Qwen2.5-VL-7B-Instruct
|
211 |
)
|
212 |
```
|
213 |
|
214 |
Besides, We provide two methods for fine-grained control over the image size input to the model:
|
215 |
|
216 |
-
1.
|
217 |
|
218 |
-
2.
|
219 |
|
220 |
```python
|
221 |
-
#
|
222 |
messages = [
|
223 |
{
|
224 |
"role": "user",
|
@@ -233,7 +402,7 @@ messages = [
|
|
233 |
],
|
234 |
}
|
235 |
]
|
236 |
-
#
|
237 |
messages = [
|
238 |
{
|
239 |
"role": "user",
|
@@ -250,55 +419,115 @@ messages = [
|
|
250 |
]
|
251 |
```
|
252 |
|
253 |
-
|
|
|
|
|
|
|
254 |
|
255 |
-
|
256 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
257 |
|
258 |
-
|
|
|
|
|
|
|
|
|
259 |
|
260 |
-
{
|
261 |
-
...,
|
262 |
-
"type": "yarn",
|
263 |
-
"mrope_section": [
|
264 |
-
16,
|
265 |
-
24,
|
266 |
-
24
|
267 |
-
],
|
268 |
-
"factor": 4,
|
269 |
-
"original_max_position_embeddings": 32768
|
270 |
-
}
|
271 |
|
272 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
273 |
|
274 |
-
|
275 |
|
|
|
276 |
|
277 |
-
|
278 |
-
|
279 |
-
|
280 |
|
281 |
-
-
|
282 |
-
- DocVQA_VAL (Accuracy)
|
283 |
-
- MMBench_DEV_EN (Accuracy)
|
284 |
-
- MathVista_MINI (Accuracy)
|
285 |
|
286 |
-
|
287 |
|
288 |
-
|
289 |
-
|
290 |
-
| Qwen2.5-VL-72B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct)) | 70.0 | 96.1 | 88.2 | 75.3 |
|
291 |
-
| | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct-AWQ)) | 69.1 | 96.0 | 87.9 | 73.8 |
|
292 |
-
| Qwen2.5-VL-7B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct)) | 58.4 | 94.9 | 84.1 | 67.9 |
|
293 |
-
| | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct-AWQ)) | 55.6 | 94.6 | 84.2 | 64.7 |
|
294 |
-
| Qwen2.5-VL-3B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct)) | 51.7 | 93.0 | 79.8 | 61.4 |
|
295 |
-
| | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct-AWQ)) | 49.1 | 91.8 | 78.0 | 58.8 |
|
296 |
|
|
|
|
|
|
|
|
|
|
|
|
|
297 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
298 |
|
299 |
## Citation
|
300 |
|
301 |
-
If you find our
|
302 |
|
303 |
```
|
304 |
@misc{qwen2.5-VL,
|
@@ -322,4 +551,8 @@ If you find our work helpful, feel free to give us a cite.
|
|
322 |
journal={arXiv preprint arXiv:2308.12966},
|
323 |
year={2023}
|
324 |
}
|
325 |
-
```
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2.5-VL-7B-Instruct
|
4 |
language:
|
5 |
- en
|
6 |
+
library_name: transformers
|
7 |
+
license: apache-2.0
|
8 |
pipeline_tag: image-text-to-text
|
9 |
tags:
|
10 |
- multimodal
|
|
|
|
|
|
|
11 |
---
|
12 |
|
13 |
# Qwen2.5-VL-7B-Instruct-AWQ
|
|
|
30 |
|
31 |
* **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
|
32 |
|
|
|
33 |
#### Model Architecture Updates:
|
34 |
|
35 |
* **Dynamic Resolution and Frame Rate Training for Video Understanding**:
|
|
|
45 |
|
46 |
We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
|
47 |
|
48 |
+
We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model with AWQ. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/), [GitHub](https://github.com/QwenLM/Qwen2.5-VL/) and the project page at [Qwen Chat](https://chat.qwenlm.ai/) and the technical report available at https://huggingface.co/papers/2502.13923.
|
|
|
|
|
|
|
49 |
|
50 |
## Evaluation
|
51 |
|
|
|
|
|
52 |
## Requirements
|
53 |
The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
|
54 |
```
|
|
|
59 |
KeyError: 'qwen2_5_vl'
|
60 |
```
|
61 |
|
|
|
62 |
## Quickstart
|
63 |
|
64 |
Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
|
|
|
72 |
KeyError: 'qwen2_5_vl'
|
73 |
```
|
74 |
|
|
|
75 |
We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
|
76 |
|
77 |
```bash
|
78 |
+
# It's highly recommended to use `[decord]` feature for faster video loading.
|
79 |
pip install qwen-vl-utils[decord]==0.0.8
|
80 |
```
|
81 |
|
|
|
86 |
Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
|
87 |
|
88 |
```python
|
89 |
+
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
90 |
from qwen_vl_utils import process_vision_info
|
91 |
|
92 |
# default: Load the model on the available device(s)
|
|
|
148 |
)
|
149 |
print(output_text)
|
150 |
```
|
151 |
+
<details>
|
152 |
+
<summary>Multi image inference</summary>
|
153 |
+
|
154 |
+
```python
|
155 |
+
# Messages containing multiple images and a text query
|
156 |
+
messages = [
|
157 |
+
{
|
158 |
+
"role": "user",
|
159 |
+
"content": [
|
160 |
+
{"type": "image", "image": "file:///path/to/image1.jpg"},
|
161 |
+
{"type": "image", "image": "file:///path/to/image2.jpg"},
|
162 |
+
{"type": "text", "text": "What are the common elements in these pictures?"},
|
163 |
+
],
|
164 |
+
}
|
165 |
+
]
|
166 |
+
|
167 |
+
# Preparation for inference
|
168 |
+
text = processor.apply_chat_template(
|
169 |
+
messages, tokenize=False, add_generation_prompt=True
|
170 |
+
)
|
171 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
172 |
+
inputs = processor(
|
173 |
+
text=[text],
|
174 |
+
images=image_inputs,
|
175 |
+
videos=video_inputs,
|
176 |
+
padding=True,
|
177 |
+
return_tensors="pt",
|
178 |
+
)
|
179 |
+
inputs = inputs.to("cuda")
|
180 |
+
|
181 |
+
# Inference
|
182 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
183 |
+
generated_ids_trimmed = [
|
184 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
185 |
+
]
|
186 |
+
output_text = processor.batch_decode(
|
187 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
188 |
+
)
|
189 |
+
print(output_text)
|
190 |
+
```
|
191 |
+
</details>
|
192 |
+
|
193 |
+
<details>
|
194 |
+
<summary>Video inference</summary>
|
195 |
+
|
196 |
+
```python
|
197 |
+
# Messages containing a images list as a video and a text query
|
198 |
+
messages = [
|
199 |
+
{
|
200 |
+
"role": "user",
|
201 |
+
"content": [
|
202 |
+
{
|
203 |
+
"type": "video",
|
204 |
+
"video": [
|
205 |
+
"file:///path/to/frame1.jpg",
|
206 |
+
"file:///path/to/frame2.jpg",
|
207 |
+
"file:///path/to/frame3.jpg",
|
208 |
+
"file:///path/to/frame4.jpg",
|
209 |
+
],
|
210 |
+
},
|
211 |
+
{"type": "text", "text": "Describe this video."},
|
212 |
+
],
|
213 |
+
}
|
214 |
+
]
|
215 |
+
|
216 |
+
# Messages containing a local video path and a text query
|
217 |
+
messages = [
|
218 |
+
{
|
219 |
+
"role": "user",
|
220 |
+
"content": [
|
221 |
+
{
|
222 |
+
"type": "video",
|
223 |
+
"video": "file:///path/to/video1.mp4",
|
224 |
+
"max_pixels": 360 * 420,
|
225 |
+
"fps": 1.0,
|
226 |
+
},
|
227 |
+
{"type": "text", "text": "Describe this video."},
|
228 |
+
],
|
229 |
+
}
|
230 |
+
]
|
231 |
|
232 |
+
# Messages containing a video url and a text query
|
233 |
+
messages = [
|
234 |
+
{
|
235 |
+
"role": "user",
|
236 |
+
"content": [
|
237 |
+
{
|
238 |
+
"type": "video",
|
239 |
+
"video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
|
240 |
+
},
|
241 |
+
{"type": "text", "text": "Describe this video."},
|
242 |
+
],
|
243 |
+
}
|
244 |
+
]
|
245 |
+
|
246 |
+
# Preparation for inference
|
247 |
+
text = processor.apply_chat_template(
|
248 |
+
messages, tokenize=False, add_generation_prompt=True
|
249 |
+
)
|
250 |
+
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
|
251 |
+
inputs = processor(
|
252 |
+
text=[text],
|
253 |
+
images=image_inputs,
|
254 |
+
videos=video_inputs,
|
255 |
+
fps=fps,
|
256 |
+
padding=True,
|
257 |
+
return_tensors="pt",
|
258 |
+
**video_kwargs,
|
259 |
+
)
|
260 |
+
inputs = inputs.to("cuda")
|
261 |
+
|
262 |
+
# Inference
|
263 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
264 |
+
generated_ids_trimmed = [
|
265 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
266 |
+
]
|
267 |
+
output_text = processor.batch_decode(
|
268 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
269 |
+
)
|
270 |
+
print(output_text)
|
271 |
+
```
|
272 |
+
|
273 |
+
Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
|
274 |
+
|
275 |
+
| Backend | HTTP | HTTPS |
|
276 |
+
|-------------|------|-------|
|
277 |
+
| torchvision >= 0.19.0 | ✅ | ✅ |
|
278 |
+
| torchvision < 0.19.0 | ❌ | ❌ |
|
279 |
+
| decord | ✅ | ❌ |
|
280 |
+
</details>
|
281 |
+
|
282 |
+
<details>
|
283 |
+
<summary>Batch inference</summary>
|
284 |
+
|
285 |
+
```python
|
286 |
+
# Sample messages for batch inference
|
287 |
+
messages1 = [
|
288 |
+
{
|
289 |
+
"role": "user",
|
290 |
+
"content": [
|
291 |
+
{"type": "image", "image": "file:///path/to/image1.jpg"},
|
292 |
+
{"type": "image", "image": "file:///path/to/image2.jpg"},
|
293 |
+
{"type": "text", "text": "What are the common elements in these pictures?"},
|
294 |
+
],
|
295 |
+
}
|
296 |
+
]
|
297 |
+
messages2 = [
|
298 |
+
{"role": "system", "content": "You are a helpful assistant."},
|
299 |
+
{"role": "user", "content": "Who are you?"},
|
300 |
+
]
|
301 |
+
# Combine messages for batch processing
|
302 |
+
messages = [messages1, messages2]
|
303 |
+
|
304 |
+
# Preparation for batch inference
|
305 |
+
texts = [
|
306 |
+
processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
|
307 |
+
for msg in messages
|
308 |
+
]
|
309 |
+
image_inputs, video_inputs = process_vision_info(messages)
|
310 |
+
inputs = processor(
|
311 |
+
text=texts,
|
312 |
+
images=image_inputs,
|
313 |
+
videos=video_inputs,
|
314 |
+
padding=True,
|
315 |
+
return_tensors="pt",
|
316 |
+
)
|
317 |
+
inputs = inputs.to("cuda")
|
318 |
+
|
319 |
+
# Batch Inference
|
320 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
321 |
+
generated_ids_trimmed = [
|
322 |
+
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
323 |
+
]
|
324 |
+
output_texts = processor.batch_decode(
|
325 |
+
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
326 |
+
)
|
327 |
+
print(output_texts)
|
328 |
+
```
|
329 |
+
</details>
|
330 |
|
331 |
### 🤖 ModelScope
|
332 |
We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
|
333 |
|
|
|
334 |
### More Usage Tips
|
335 |
|
336 |
For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
|
|
|
376 |
min_pixels = 256 * 28 * 28
|
377 |
max_pixels = 1280 * 28 * 28
|
378 |
processor = AutoProcessor.from_pretrained(
|
379 |
+
"Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
|
380 |
)
|
381 |
```
|
382 |
|
383 |
Besides, We provide two methods for fine-grained control over the image size input to the model:
|
384 |
|
385 |
+
1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
|
386 |
|
387 |
+
2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
|
388 |
|
389 |
```python
|
390 |
+
# resized_height and resized_width
|
391 |
messages = [
|
392 |
{
|
393 |
"role": "user",
|
|
|
402 |
],
|
403 |
}
|
404 |
]
|
405 |
+
# min_pixels and max_pixels
|
406 |
messages = [
|
407 |
{
|
408 |
"role": "user",
|
|
|
419 |
]
|
420 |
```
|
421 |
|
422 |
+
#### Add ids for Multiple Image Inputs
|
423 |
+
By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:
|
424 |
+
<details>
|
425 |
+
<summary>Add vision ids</summary>
|
426 |
|
427 |
+
```python
|
428 |
+
conversation = [
|
429 |
+
{
|
430 |
+
"role": "user",
|
431 |
+
"content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}],
|
432 |
+
},
|
433 |
+
{
|
434 |
+
"role": "assistant",
|
435 |
+
"content": "I'm doing well, thank you for asking. How can I assist you today?",
|
436 |
+
},
|
437 |
+
{
|
438 |
+
"role": "user",
|
439 |
+
"content": [
|
440 |
+
{"type": "text", "text": "Can you describe these images and video?"},
|
441 |
+
{"type": "image"},
|
442 |
+
{"type": "image"},
|
443 |
+
{"type": "video"},
|
444 |
+
{"type": "text", "text": "These are from my vacation."},
|
445 |
+
],
|
446 |
+
},
|
447 |
+
{
|
448 |
+
"role": "assistant",
|
449 |
+
"content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?",
|
450 |
+
},
|
451 |
+
{
|
452 |
+
"role": "user",
|
453 |
+
"content": "It was a trip to the mountains. Can you see the details in the images and video?",
|
454 |
+
},
|
455 |
+
]
|
456 |
|
457 |
+
# default:
|
458 |
+
prompt_without_id = processor.apply_chat_template(
|
459 |
+
conversation, add_generation_prompt=True
|
460 |
+
)
|
461 |
+
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
|
462 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
463 |
|
464 |
+
# add ids
|
465 |
+
prompt_with_id = processor.apply_chat_template(
|
466 |
+
conversation, add_generation_prompt=True, add_vision_id=True
|
467 |
+
)
|
468 |
+
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
|
469 |
+
```
|
470 |
+
</details>
|
471 |
|
472 |
+
#### Flash-Attention 2 to speed up generation
|
473 |
|
474 |
+
First, make sure to install the latest version of Flash Attention 2:
|
475 |
|
476 |
+
```bash
|
477 |
+
pip install -U flash-attn --no-build-isolation
|
478 |
+
```
|
479 |
|
480 |
+
Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
|
|
|
|
|
|
|
481 |
|
482 |
+
To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
|
483 |
|
484 |
+
```python
|
485 |
+
from transformers import Qwen2_5_VLForConditionalGeneration
|
|
|
|
|
|
|
|
|
|
|
|
|
486 |
|
487 |
+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
488 |
+
"Qwen/Qwen2.5-VL-7B-Instruct",
|
489 |
+
torch_dtype=torch.bfloat16,
|
490 |
+
attn_implementation="flash_attention_2",
|
491 |
+
)
|
492 |
+
```
|
493 |
|
494 |
+
### Try Qwen2.5-VL-72B with API!
|
495 |
+
|
496 |
+
To explore Qwen2.5-VL-72B, a more fascinating multimodal model, we encourage you to test our cutting-edge API service. Let's start the exciting journey right now!
|
497 |
+
|
498 |
+
#### Installation
|
499 |
+
```bash
|
500 |
+
pip install dashscope
|
501 |
+
```
|
502 |
+
|
503 |
+
#### Examples
|
504 |
+
```python
|
505 |
+
import dashscope
|
506 |
+
|
507 |
+
|
508 |
+
dashscope.api_key = "your_api_key"
|
509 |
+
|
510 |
+
messages = [{
|
511 |
+
'role': 'user',
|
512 |
+
'content': [
|
513 |
+
{
|
514 |
+
'image': "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
|
515 |
+
},
|
516 |
+
{
|
517 |
+
'text': 'What are in the image?'
|
518 |
+
},
|
519 |
+
]
|
520 |
+
}]
|
521 |
+
|
522 |
+
response = dashscope.MultiModalConversation.call(model='qwen2.5-vl-72b-instruct', messages=messages)
|
523 |
+
print(response)
|
524 |
+
```
|
525 |
+
|
526 |
+
For more usage, please refer to the tutorial at [aliyun](https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api).
|
527 |
|
528 |
## Citation
|
529 |
|
530 |
+
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
|
531 |
|
532 |
```
|
533 |
@misc{qwen2.5-VL,
|
|
|
551 |
journal={arXiv preprint arXiv:2308.12966},
|
552 |
year={2023}
|
553 |
}
|
554 |
+
```
|
555 |
+
|
556 |
+
<br>
|
557 |
+
|
558 |
+
Code: https://github.com/QwenLM/Qwen2.5-VL
|