nielsr HF staff commited on
Commit
5e5daec
·
verified ·
1 Parent(s): 7fc93a5

Add link to paper and code

Browse files

The model card is already extensively documented. This PR adds a link to the paper page, ensuring the model can be found at: https://huggingface.co/papers/2502.13923
It also adds a link to the official code repository and the project page.

Files changed (1) hide show
  1. README.md +289 -56
README.md CHANGED
@@ -1,13 +1,13 @@
1
  ---
2
- license: apache-2.0
 
3
  language:
4
  - en
 
 
5
  pipeline_tag: image-text-to-text
6
  tags:
7
  - multimodal
8
- library_name: transformers
9
- base_model:
10
- - Qwen/Qwen2.5-VL-7B-Instruct
11
  ---
12
 
13
  # Qwen2.5-VL-7B-Instruct-AWQ
@@ -30,7 +30,6 @@ In the past five months since Qwen2-VL’s release, numerous developers have bui
30
 
31
  * **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
32
 
33
-
34
  #### Model Architecture Updates:
35
 
36
  * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
@@ -46,15 +45,10 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
46
 
47
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
48
 
49
-
50
- We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model with AWQ. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
51
-
52
-
53
 
54
  ## Evaluation
55
 
56
-
57
-
58
  ## Requirements
59
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
60
  ```
@@ -65,7 +59,6 @@ or you might encounter the following error:
65
  KeyError: 'qwen2_5_vl'
66
  ```
67
 
68
-
69
  ## Quickstart
70
 
71
  Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
@@ -79,11 +72,10 @@ or you might encounter the following error:
79
  KeyError: 'qwen2_5_vl'
80
  ```
81
 
82
-
83
  We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
84
 
85
  ```bash
86
- # It's highly recommanded to use `[decord]` feature for faster video loading.
87
  pip install qwen-vl-utils[decord]==0.0.8
88
  ```
89
 
@@ -94,7 +86,7 @@ If you are not using Linux, you might not be able to install `decord` from PyPI.
94
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
95
 
96
  ```python
97
- from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
98
  from qwen_vl_utils import process_vision_info
99
 
100
  # default: Load the model on the available device(s)
@@ -156,12 +148,189 @@ output_text = processor.batch_decode(
156
  )
157
  print(output_text)
158
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
 
161
  ### 🤖 ModelScope
162
  We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
163
 
164
-
165
  ### More Usage Tips
166
 
167
  For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
@@ -207,18 +376,18 @@ The model supports a wide range of resolution inputs. By default, it uses the na
207
  min_pixels = 256 * 28 * 28
208
  max_pixels = 1280 * 28 * 28
209
  processor = AutoProcessor.from_pretrained(
210
- "Qwen/Qwen2.5-VL-7B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
211
  )
212
  ```
213
 
214
  Besides, We provide two methods for fine-grained control over the image size input to the model:
215
 
216
- 1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
217
 
218
- 2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
219
 
220
  ```python
221
- # min_pixels and max_pixels
222
  messages = [
223
  {
224
  "role": "user",
@@ -233,7 +402,7 @@ messages = [
233
  ],
234
  }
235
  ]
236
- # resized_height and resized_width
237
  messages = [
238
  {
239
  "role": "user",
@@ -250,55 +419,115 @@ messages = [
250
  ]
251
  ```
252
 
253
- ### Processing Long Texts
 
 
 
254
 
255
- The current `config.json` is set for context length up to 32,768 tokens.
256
- To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
257
 
258
- For supported frameworks, you could add the following to `config.json` to enable YaRN:
 
 
 
 
259
 
260
- {
261
- ...,
262
- "type": "yarn",
263
- "mrope_section": [
264
- 16,
265
- 24,
266
- 24
267
- ],
268
- "factor": 4,
269
- "original_max_position_embeddings": 32768
270
- }
271
 
272
- However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
 
 
 
 
 
 
273
 
274
- At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
275
 
 
276
 
277
- ### Benchmark
278
- #### Performance of Quantized Models
279
- This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2.5-VL series. Specifically, we report:
280
 
281
- - MMMU_VAL (Accuracy)
282
- - DocVQA_VAL (Accuracy)
283
- - MMBench_DEV_EN (Accuracy)
284
- - MathVista_MINI (Accuracy)
285
 
286
- We use [VLMEvalkit](https://github.com/open-compass/VLMEvalKit) to evaluate all models.
287
 
288
- | Model Size | Quantization | MMMU_VAL | DocVQA_VAL | MMBench_EDV_EN | MathVista_MINI |
289
- | --- | --- | --- | --- | --- | --- |
290
- | Qwen2.5-VL-72B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct)) | 70.0 | 96.1 | 88.2 | 75.3 |
291
- | | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct-AWQ)) | 69.1 | 96.0 | 87.9 | 73.8 |
292
- | Qwen2.5-VL-7B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct)) | 58.4 | 94.9 | 84.1 | 67.9 |
293
- | | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct-AWQ)) | 55.6 | 94.6 | 84.2 | 64.7 |
294
- | Qwen2.5-VL-3B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct)) | 51.7 | 93.0 | 79.8 | 61.4 |
295
- | | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct-AWQ)) | 49.1 | 91.8 | 78.0 | 58.8 |
296
 
 
 
 
 
 
 
297
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
298
 
299
  ## Citation
300
 
301
- If you find our work helpful, feel free to give us a cite.
302
 
303
  ```
304
  @misc{qwen2.5-VL,
@@ -322,4 +551,8 @@ If you find our work helpful, feel free to give us a cite.
322
  journal={arXiv preprint arXiv:2308.12966},
323
  year={2023}
324
  }
325
- ```
 
 
 
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-7B-Instruct
4
  language:
5
  - en
6
+ library_name: transformers
7
+ license: apache-2.0
8
  pipeline_tag: image-text-to-text
9
  tags:
10
  - multimodal
 
 
 
11
  ---
12
 
13
  # Qwen2.5-VL-7B-Instruct-AWQ
 
30
 
31
  * **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
32
 
 
33
  #### Model Architecture Updates:
34
 
35
  * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
 
45
 
46
  We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
47
 
48
+ We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model with AWQ. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/), [GitHub](https://github.com/QwenLM/Qwen2.5-VL/) and the project page at [Qwen Chat](https://chat.qwenlm.ai/) and the technical report available at https://huggingface.co/papers/2502.13923.
 
 
 
49
 
50
  ## Evaluation
51
 
 
 
52
  ## Requirements
53
  The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
54
  ```
 
59
  KeyError: 'qwen2_5_vl'
60
  ```
61
 
 
62
  ## Quickstart
63
 
64
  Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
 
72
  KeyError: 'qwen2_5_vl'
73
  ```
74
 
 
75
  We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
76
 
77
  ```bash
78
+ # It's highly recommended to use `[decord]` feature for faster video loading.
79
  pip install qwen-vl-utils[decord]==0.0.8
80
  ```
81
 
 
86
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
87
 
88
  ```python
89
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
90
  from qwen_vl_utils import process_vision_info
91
 
92
  # default: Load the model on the available device(s)
 
148
  )
149
  print(output_text)
150
  ```
151
+ <details>
152
+ <summary>Multi image inference</summary>
153
+
154
+ ```python
155
+ # Messages containing multiple images and a text query
156
+ messages = [
157
+ {
158
+ "role": "user",
159
+ "content": [
160
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
161
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
162
+ {"type": "text", "text": "What are the common elements in these pictures?"},
163
+ ],
164
+ }
165
+ ]
166
+
167
+ # Preparation for inference
168
+ text = processor.apply_chat_template(
169
+ messages, tokenize=False, add_generation_prompt=True
170
+ )
171
+ image_inputs, video_inputs = process_vision_info(messages)
172
+ inputs = processor(
173
+ text=[text],
174
+ images=image_inputs,
175
+ videos=video_inputs,
176
+ padding=True,
177
+ return_tensors="pt",
178
+ )
179
+ inputs = inputs.to("cuda")
180
+
181
+ # Inference
182
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
183
+ generated_ids_trimmed = [
184
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
185
+ ]
186
+ output_text = processor.batch_decode(
187
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
188
+ )
189
+ print(output_text)
190
+ ```
191
+ </details>
192
+
193
+ <details>
194
+ <summary>Video inference</summary>
195
+
196
+ ```python
197
+ # Messages containing a images list as a video and a text query
198
+ messages = [
199
+ {
200
+ "role": "user",
201
+ "content": [
202
+ {
203
+ "type": "video",
204
+ "video": [
205
+ "file:///path/to/frame1.jpg",
206
+ "file:///path/to/frame2.jpg",
207
+ "file:///path/to/frame3.jpg",
208
+ "file:///path/to/frame4.jpg",
209
+ ],
210
+ },
211
+ {"type": "text", "text": "Describe this video."},
212
+ ],
213
+ }
214
+ ]
215
+
216
+ # Messages containing a local video path and a text query
217
+ messages = [
218
+ {
219
+ "role": "user",
220
+ "content": [
221
+ {
222
+ "type": "video",
223
+ "video": "file:///path/to/video1.mp4",
224
+ "max_pixels": 360 * 420,
225
+ "fps": 1.0,
226
+ },
227
+ {"type": "text", "text": "Describe this video."},
228
+ ],
229
+ }
230
+ ]
231
 
232
+ # Messages containing a video url and a text query
233
+ messages = [
234
+ {
235
+ "role": "user",
236
+ "content": [
237
+ {
238
+ "type": "video",
239
+ "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
240
+ },
241
+ {"type": "text", "text": "Describe this video."},
242
+ ],
243
+ }
244
+ ]
245
+
246
+ # Preparation for inference
247
+ text = processor.apply_chat_template(
248
+ messages, tokenize=False, add_generation_prompt=True
249
+ )
250
+ image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
251
+ inputs = processor(
252
+ text=[text],
253
+ images=image_inputs,
254
+ videos=video_inputs,
255
+ fps=fps,
256
+ padding=True,
257
+ return_tensors="pt",
258
+ **video_kwargs,
259
+ )
260
+ inputs = inputs.to("cuda")
261
+
262
+ # Inference
263
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
264
+ generated_ids_trimmed = [
265
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
266
+ ]
267
+ output_text = processor.batch_decode(
268
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
269
+ )
270
+ print(output_text)
271
+ ```
272
+
273
+ Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
274
+
275
+ | Backend | HTTP | HTTPS |
276
+ |-------------|------|-------|
277
+ | torchvision >= 0.19.0 | ✅ | ✅ |
278
+ | torchvision < 0.19.0 | ❌ | ❌ |
279
+ | decord | ✅ | ❌ |
280
+ </details>
281
+
282
+ <details>
283
+ <summary>Batch inference</summary>
284
+
285
+ ```python
286
+ # Sample messages for batch inference
287
+ messages1 = [
288
+ {
289
+ "role": "user",
290
+ "content": [
291
+ {"type": "image", "image": "file:///path/to/image1.jpg"},
292
+ {"type": "image", "image": "file:///path/to/image2.jpg"},
293
+ {"type": "text", "text": "What are the common elements in these pictures?"},
294
+ ],
295
+ }
296
+ ]
297
+ messages2 = [
298
+ {"role": "system", "content": "You are a helpful assistant."},
299
+ {"role": "user", "content": "Who are you?"},
300
+ ]
301
+ # Combine messages for batch processing
302
+ messages = [messages1, messages2]
303
+
304
+ # Preparation for batch inference
305
+ texts = [
306
+ processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
307
+ for msg in messages
308
+ ]
309
+ image_inputs, video_inputs = process_vision_info(messages)
310
+ inputs = processor(
311
+ text=texts,
312
+ images=image_inputs,
313
+ videos=video_inputs,
314
+ padding=True,
315
+ return_tensors="pt",
316
+ )
317
+ inputs = inputs.to("cuda")
318
+
319
+ # Batch Inference
320
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
321
+ generated_ids_trimmed = [
322
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
323
+ ]
324
+ output_texts = processor.batch_decode(
325
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
326
+ )
327
+ print(output_texts)
328
+ ```
329
+ </details>
330
 
331
  ### 🤖 ModelScope
332
  We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
333
 
 
334
  ### More Usage Tips
335
 
336
  For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
 
376
  min_pixels = 256 * 28 * 28
377
  max_pixels = 1280 * 28 * 28
378
  processor = AutoProcessor.from_pretrained(
379
+ "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
380
  )
381
  ```
382
 
383
  Besides, We provide two methods for fine-grained control over the image size input to the model:
384
 
385
+ 1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
386
 
387
+ 2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
388
 
389
  ```python
390
+ # resized_height and resized_width
391
  messages = [
392
  {
393
  "role": "user",
 
402
  ],
403
  }
404
  ]
405
+ # min_pixels and max_pixels
406
  messages = [
407
  {
408
  "role": "user",
 
419
  ]
420
  ```
421
 
422
+ #### Add ids for Multiple Image Inputs
423
+ By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:
424
+ <details>
425
+ <summary>Add vision ids</summary>
426
 
427
+ ```python
428
+ conversation = [
429
+ {
430
+ "role": "user",
431
+ "content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}],
432
+ },
433
+ {
434
+ "role": "assistant",
435
+ "content": "I'm doing well, thank you for asking. How can I assist you today?",
436
+ },
437
+ {
438
+ "role": "user",
439
+ "content": [
440
+ {"type": "text", "text": "Can you describe these images and video?"},
441
+ {"type": "image"},
442
+ {"type": "image"},
443
+ {"type": "video"},
444
+ {"type": "text", "text": "These are from my vacation."},
445
+ ],
446
+ },
447
+ {
448
+ "role": "assistant",
449
+ "content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?",
450
+ },
451
+ {
452
+ "role": "user",
453
+ "content": "It was a trip to the mountains. Can you see the details in the images and video?",
454
+ },
455
+ ]
456
 
457
+ # default:
458
+ prompt_without_id = processor.apply_chat_template(
459
+ conversation, add_generation_prompt=True
460
+ )
461
+ # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
462
 
 
 
 
 
 
 
 
 
 
 
 
463
 
464
+ # add ids
465
+ prompt_with_id = processor.apply_chat_template(
466
+ conversation, add_generation_prompt=True, add_vision_id=True
467
+ )
468
+ # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
469
+ ```
470
+ </details>
471
 
472
+ #### Flash-Attention 2 to speed up generation
473
 
474
+ First, make sure to install the latest version of Flash Attention 2:
475
 
476
+ ```bash
477
+ pip install -U flash-attn --no-build-isolation
478
+ ```
479
 
480
+ Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
 
 
 
481
 
482
+ To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
483
 
484
+ ```python
485
+ from transformers import Qwen2_5_VLForConditionalGeneration
 
 
 
 
 
 
486
 
487
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
488
+ "Qwen/Qwen2.5-VL-7B-Instruct",
489
+ torch_dtype=torch.bfloat16,
490
+ attn_implementation="flash_attention_2",
491
+ )
492
+ ```
493
 
494
+ ### Try Qwen2.5-VL-72B with API!
495
+
496
+ To explore Qwen2.5-VL-72B, a more fascinating multimodal model, we encourage you to test our cutting-edge API service. Let's start the exciting journey right now!
497
+
498
+ #### Installation
499
+ ```bash
500
+ pip install dashscope
501
+ ```
502
+
503
+ #### Examples
504
+ ```python
505
+ import dashscope
506
+
507
+
508
+ dashscope.api_key = "your_api_key"
509
+
510
+ messages = [{
511
+ 'role': 'user',
512
+ 'content': [
513
+ {
514
+ 'image': "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
515
+ },
516
+ {
517
+ 'text': 'What are in the image?'
518
+ },
519
+ ]
520
+ }]
521
+
522
+ response = dashscope.MultiModalConversation.call(model='qwen2.5-vl-72b-instruct', messages=messages)
523
+ print(response)
524
+ ```
525
+
526
+ For more usage, please refer to the tutorial at [aliyun](https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api).
527
 
528
  ## Citation
529
 
530
+ If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
531
 
532
  ```
533
  @misc{qwen2.5-VL,
 
551
  journal={arXiv preprint arXiv:2308.12966},
552
  year={2023}
553
  }
554
+ ```
555
+
556
+ <br>
557
+
558
+ Code: https://github.com/QwenLM/Qwen2.5-VL