Add link to paper and code

The model card is already extensively documented. This PR adds a link to the paper page, ensuring the model can be found at: https://huggingface.co/papers/2502.13923
It also adds a link to the official code repository and the project page.

Files changed (1) hide show

README.md +289 -56

README.md CHANGED Viewed

@@ -1,13 +1,13 @@
 ---
-license: apache-2.0
 language:
 - en
 pipeline_tag: image-text-to-text
 tags:
 - multimodal
-library_name: transformers
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
 ---
 # Qwen2.5-VL-7B-Instruct-AWQ
@@ -30,7 +30,6 @@ In the past five months since Qwen2-VL’s release, numerous developers have bui
 * **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
 #### Model Architecture Updates:
 * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
@@ -46,15 +45,10 @@ We extend dynamic resolution to the temporal dimension by adopting dynamic FPS s
 We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
-We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model with AWQ. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/) and [GitHub](https://github.com/QwenLM/Qwen2.5-VL).
 ## Evaluation
 ## Requirements
 The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
 ```
@@ -65,7 +59,6 @@ or you might encounter the following error:
 KeyError: 'qwen2_5_vl'
 ```
 ## Quickstart
 Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
@@ -79,11 +72,10 @@ or you might encounter the following error:
 KeyError: 'qwen2_5_vl'
 ```
 We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
 ```bash
-# It's highly recommanded to use `[decord]` feature for faster video loading.
 pip install qwen-vl-utils[decord]==0.0.8
 ```
@@ -94,7 +86,7 @@ If you are not using Linux, you might not be able to install `decord` from PyPI.
 Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
 ```python
-from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 from qwen_vl_utils import process_vision_info
 # default: Load the model on the available device(s)
@@ -156,12 +148,189 @@ output_text = processor.batch_decode(
 )
 print(output_text)
 ```
 ### 🤖 ModelScope
 We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
 ### More Usage Tips
 For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
@@ -207,18 +376,18 @@ The model supports a wide range of resolution inputs. By default, it uses the na
 min_pixels = 256 * 28 * 28
 max_pixels = 1280 * 28 * 28
 processor = AutoProcessor.from_pretrained(
-    "Qwen/Qwen2.5-VL-7B-Instruct-AWQ", min_pixels=min_pixels, max_pixels=max_pixels
 )
 ```
 Besides, We provide two methods for fine-grained control over the image size input to the model:
-1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
-2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
 ```python
-# min_pixels and max_pixels
 messages = [
     {
         "role": "user",
@@ -233,7 +402,7 @@ messages = [
         ],
     }
 ]
-# resized_height and resized_width
 messages = [
     {
         "role": "user",
@@ -250,55 +419,115 @@ messages = [
 ]
 ```
-### Processing Long Texts
-The current `config.json` is set for context length up to 32,768 tokens.
-To handle extensive inputs exceeding 32,768 tokens, we utilize [YaRN](https://arxiv.org/abs/2309.00071), a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.
-For supported frameworks, you could add the following to `config.json` to enable YaRN:
-{
-	...,
-    "type": "yarn",
-    "mrope_section": [
-        16,
-        24,
-        24
-    ],
-    "factor": 4,
-    "original_max_position_embeddings": 32768
-}
-However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
-At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
-### Benchmark
-#### Performance of Quantized Models
-This section reports the generation performance of quantized models (including GPTQ and AWQ) of the Qwen2.5-VL series. Specifically, we report:
-- MMMU_VAL (Accuracy)
-- DocVQA_VAL (Accuracy)
-- MMBench_DEV_EN (Accuracy)
-- MathVista_MINI (Accuracy)
-We use [VLMEvalkit](https://github.com/open-compass/VLMEvalKit) to evaluate all models.
-| Model Size | Quantization | MMMU_VAL | DocVQA_VAL | MMBench_EDV_EN | MathVista_MINI  |
-| --- | --- | --- | --- | --- | --- |
-| Qwen2.5-VL-72B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct)) | 70.0 | 96.1 | 88.2 | 75.3 |
-|  | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-72B-Instruct-AWQ)) | 69.1 | 96.0 | 87.9 | 73.8 |
-| Qwen2.5-VL-7B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct)) | 58.4 | 94.9 | 84.1 | 67.9 |
-|  | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-7B-Instruct-AWQ)) | 55.6 | 94.6 | 84.2 | 64.7 |
-| Qwen2.5-VL-3B-Instruct | BF16<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct)) | 51.7 | 93.0 | 79.8 | 61.4 |
-|  | AWQ<br><sup>([🤗](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct-AWQ)[🤖](https://modelscope.cn/models/qwen/Qwen2.5-VL-3B-Instruct-AWQ)) | 49.1 | 91.8 | 78.0 | 58.8 |
 ## Citation
-If you find our work helpful, feel free to give us a cite.
 ```
 @misc{qwen2.5-VL,
@@ -322,4 +551,8 @@ If you find our work helpful, feel free to give us a cite.
   journal={arXiv preprint arXiv:2308.12966},
   year={2023}
 }
-```

 ---
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
 language:
 - en
+library_name: transformers
+license: apache-2.0
 pipeline_tag: image-text-to-text
 tags:
 - multimodal
 ---
 # Qwen2.5-VL-7B-Instruct-AWQ
 * **Generating structured outputs**: for data like scans of invoices, forms, tables, etc. Qwen2.5-VL supports structured outputs of their contents, benefiting usages in finance, commerce, etc.
 #### Model Architecture Updates:
 * **Dynamic Resolution and Frame Rate Training for Video Understanding**:
 We enhance both training and inference speeds by strategically implementing window attention into the ViT. The ViT architecture is further optimized with SwiGLU and RMSNorm, aligning it with the structure of the Qwen2.5 LLM.
+We have three models with 3, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2.5-VL model with AWQ. For more information, visit our [Blog](https://qwenlm.github.io/blog/qwen2.5-vl/), [GitHub](https://github.com/QwenLM/Qwen2.5-VL/) and the project page at [Qwen Chat](https://chat.qwenlm.ai/) and the technical report available at https://huggingface.co/papers/2502.13923.
 ## Evaluation
 ## Requirements
 The code of Qwen2.5-VL has been in the latest Hugging face transformers and we advise you to build from source with command:
 ```
 KeyError: 'qwen2_5_vl'
 ```
 ## Quickstart
 Below, we provide simple examples to show how to use Qwen2.5-VL with 🤖 ModelScope and 🤗 Transformers.
 KeyError: 'qwen2_5_vl'
 ```
 We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
 ```bash
+# It's highly recommended to use `[decord]` feature for faster video loading.
 pip install qwen-vl-utils[decord]==0.0.8
 ```
 Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
 ```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
 from qwen_vl_utils import process_vision_info
 # default: Load the model on the available device(s)
 )
 print(output_text)
 ```
+<details>
+<summary>Multi image inference</summary>
+```python
+# Messages containing multiple images and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image1.jpg"},
+            {"type": "image", "image": "file:///path/to/image2.jpg"},
+            {"type": "text", "text": "What are the common elements in these pictures?"},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+</details>
+<details>
+<summary>Video inference</summary>
+```python
+# Messages containing a images list as a video and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": [
+                    "file:///path/to/frame1.jpg",
+                    "file:///path/to/frame2.jpg",
+                    "file:///path/to/frame3.jpg",
+                    "file:///path/to/frame4.jpg",
+                ],
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Messages containing a local video path and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "file:///path/to/video1.mp4",
+                "max_pixels": 360 * 420,
+                "fps": 1.0,
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Messages containing a video url and a text query
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "video",
+                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
+            },
+            {"type": "text", "text": "Describe this video."},
+        ],
+    }
+]
+# Preparation for inference
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    fps=fps,
+    padding=True,
+    return_tensors="pt",
+    **video_kwargs,
+)
+inputs = inputs.to("cuda")
+# Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+Video URL compatibility largely depends on the third-party library version. The details are in the table below. change the backend by `FORCE_QWENVL_VIDEO_READER=torchvision` or `FORCE_QWENVL_VIDEO_READER=decord` if you prefer not to use the default one.
+| Backend     | HTTP | HTTPS |
+|-------------|------|-------|
+| torchvision >= 0.19.0 | ✅  | ✅   |
+| torchvision < 0.19.0  | ❌  | ❌   |
+| decord      | ✅  | ❌   |
+</details>
+<details>
+<summary>Batch inference</summary>
+```python
+# Sample messages for batch inference
+messages1 = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "file:///path/to/image1.jpg"},
+            {"type": "image", "image": "file:///path/to/image2.jpg"},
+            {"type": "text", "text": "What are the common elements in these pictures?"},
+        ],
+    }
+]
+messages2 = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "Who are you?"},
+]
+# Combine messages for batch processing
+messages = [messages1, messages2]
+# Preparation for batch inference
+texts = [
+    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
+    for msg in messages
+]
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=texts,
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+# Batch Inference
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_texts = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_texts)
+```
+</details>
 ### 🤖 ModelScope
 We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
 ### More Usage Tips
 For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
 min_pixels = 256 * 28 * 28
 max_pixels = 1280 * 28 * 28
 processor = AutoProcessor.from_pretrained(
+    "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
 )
 ```
 Besides, We provide two methods for fine-grained control over the image size input to the model:
+1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
+2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
 ```python
+# resized_height and resized_width
 messages = [
     {
         "role": "user",
         ],
     }
 ]
+# min_pixels and max_pixels
 messages = [
     {
         "role": "user",
 ]
 ```
+#### Add ids for Multiple Image Inputs
+By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:
+<details>
+<summary>Add vision ids</summary>
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}],
+    },
+    {
+        "role": "assistant",
+        "content": "I'm doing well, thank you for asking. How can I assist you today?",
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Can you describe these images and video?"},
+            {"type": "image"},
+            {"type": "image"},
+            {"type": "video"},
+            {"type": "text", "text": "These are from my vacation."},
+        ],
+    },
+    {
+        "role": "assistant",
+        "content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?",
+    },
+    {
+        "role": "user",
+        "content": "It was a trip to the mountains. Can you see the details in the images and video?",
+    },
+]
+# default:
+prompt_without_id = processor.apply_chat_template(
+    conversation, add_generation_prompt=True
+)
+# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
+# add ids
+prompt_with_id = processor.apply_chat_template(
+    conversation, add_generation_prompt=True, add_vision_id=True
+)
+# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
+```
+</details>
+#### Flash-Attention 2 to speed up generation
+First, make sure to install the latest version of Flash Attention 2:
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
+To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen2.5-VL-7B-Instruct",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+)
+```
+### Try Qwen2.5-VL-72B with API!
+To explore Qwen2.5-VL-72B, a more fascinating multimodal model, we encourage you to test our cutting-edge API service. Let's start the exciting journey right now!
+#### Installation
+```bash
+pip install dashscope
+```
+#### Examples
+```python
+import dashscope
+dashscope.api_key = "your_api_key"
+messages = [{
+    'role': 'user',
+    'content': [
+        {
+            'image': "https://dashscope.oss-cn-beijing.aliyuncs.com/images/dog_and_girl.jpeg"
+        },
+        {
+            'text': 'What are in the image?'
+        },
+    ]
+}]
+response = dashscope.MultiModalConversation.call(model='qwen2.5-vl-72b-instruct', messages=messages)
+print(response)
+```
+For more usage, please refer to the tutorial at [aliyun](https://help.aliyun.com/zh/model-studio/developer-reference/qwen-vl-api).
 ## Citation
+If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
 ```
 @misc{qwen2.5-VL,
   journal={arXiv preprint arXiv:2308.12966},
   year={2023}
 }
+```
+<br>
+Code: https://github.com/QwenLM/Qwen2.5-VL