lbourdois commited on
Commit
93819b0
·
verified ·
1 Parent(s): 3ab0a53

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +392 -380
README.md CHANGED
@@ -1,381 +1,393 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
- base_model:
6
- - google/paligemma-3b-mix-448
7
- - Qwen/Qwen2.5-1.5B-Instruct
8
- - google/siglip-so400m-patch14-384
9
- base_model_relation: merge
10
- language:
11
- - multilingual
12
- tags:
13
- - eagle
14
- - VLM
15
- ---
16
-
17
-
18
- # Eagle-2
19
-
20
- [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](http://arxiv.org/abs/2501.14818)
21
- [\[🤗 HF Demo\]](https://huggingface.co/spaces/nvidia/Eagle2-Demo)
22
-
23
- # News:
24
- - We update the model arch to `eagle_2_5_vl` to support `generate` feature.
25
-
26
-
27
- ## Introduction
28
-
29
- We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
30
-
31
-
32
-
33
- In this repo, we are open-sourcing Eagle2-2B, a lightweight model that achieves remarkable efficiency and speed while maintaining solid performance.
34
-
35
-
36
-
37
-
38
-
39
-
40
-
41
-
42
- ## Model Zoo
43
- We provide the following models:
44
-
45
- | model name | LLM | Vision | Max Length| HF Link|
46
- | ----------- | ------- |---------|-|-|
47
- | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
48
- | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
49
- | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
50
-
51
- ## Benchmark Results
52
- | Benchmark | InternVL2-2B | InternVL2.5-2B | InternVL2-4B |Qwen2-VL-2B| Eagle2-2B|
53
- | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
54
- | DocVQA<sub>test</sub> | 86.9 | 88.7 | 89.2 |90.1|88.0|
55
- | ChartQA<sub>test</sub> | 76.2 | 79.2 | 81.5 |73.0|82.0|
56
- | InfoVQA<sub>test</sub> | 58.9 | 60.9 | 67.0 |65.5|65.8|
57
- | TextVQA<sub>val</sub> | 73.4 | 74.3 | 74.4 |79.7|79.1|
58
- | OCRBench | 784 | 804 | 788 |809|818|
59
- | MME<sub>sum</sub> | 1876.8 | 2138.2 | 2059.8 |1872.0 | 2109.8
60
- | RealWorldQA | 57.3 | 60.1 | 60.7 |62.6|63.1|
61
- | AI2D<sub>test</sub> | 74.1 | 74.9 | 74.7 | 78.9 |79.3|
62
- | MMMU<sub>val</sub> | 36.3 | 43.6 | 47.9 |41.1|43.1|
63
- | MMVet<sub>GPT-4-Turbo</sub> | 39.5 | 60.8 | 51.0 | 49.5|53.8|
64
- | HallBench<sub>avg</sub> | 37.9 | 42.6 | 41.9 |41.7|45.8
65
- | MathVista<sub>testmini</sub> | 46.3 | 51.3 | 58.6 |43.0|54.7|
66
- | MMstar | 50.1 | 53.7 | 54.3|48.0|56.4|
67
-
68
-
69
-
70
- ## Quick Start
71
-
72
-
73
-
74
- We provide a [inference script](./demo.py) to help you quickly start using the model. We support different input types:
75
- - pure text input
76
- - single image input
77
- - multiple image input
78
- - video input
79
-
80
- ### Install the dependencies
81
-
82
- ```bash
83
- pip install transformers
84
- pip install flash-attn
85
- ```
86
-
87
-
88
- ### single image
89
-
90
- ```python
91
- from PIL import Image
92
- import requests
93
- from transformers import AutoProcessor, AutoModel
94
- import torch
95
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
96
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
97
- processor.tokenizer.padding_side = "left"
98
-
99
- messages = [
100
- {
101
- "role": "user",
102
- "content": [
103
- {
104
- "type": "image",
105
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
106
- },
107
- {"type": "text", "text": "Describe this image."},
108
- ],
109
- }
110
- ]
111
-
112
- text_list = [processor.apply_chat_template(
113
- messages, tokenize=False, add_generation_prompt=True
114
- )]
115
- image_inputs, video_inputs = processor.process_vision_info(messages)
116
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
117
- inputs = inputs.to("cuda")
118
- model = model.to("cuda")
119
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
120
- output_text = processor.batch_decode(
121
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
122
- )
123
- print(output_text)
124
- ```
125
-
126
- ### stream generation
127
-
128
- ```python
129
- from PIL import Image
130
- import requests
131
- from transformers import AutoProcessor, AutoModel, AutoTokenizer
132
- import torch
133
-
134
- from transformers import TextIteratorStreamer
135
- import threading
136
-
137
-
138
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
139
- tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
140
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
141
- processor.tokenizer.padding_side = "left"
142
-
143
- messages = [
144
- {
145
- "role": "user",
146
- "content": [
147
- {
148
- "type": "image",
149
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
150
- },
151
- {"type": "text", "text": "Describe this image."},
152
- ],
153
- }
154
- ]
155
-
156
- text_list = [processor.apply_chat_template(
157
- messages, tokenize=False, add_generation_prompt=True
158
- )]
159
- image_inputs, video_inputs = processor.process_vision_info(messages)
160
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
161
- inputs = inputs.to("cuda")
162
- model = model.to("cuda")
163
-
164
- streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
165
-
166
- generation_kwargs = dict(
167
- **inputs,
168
- streamer=streamer,
169
- max_new_tokens=1024,
170
- do_sample=True,
171
- top_p=0.95,
172
- temperature=0.8
173
- )
174
- thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
175
- thread.start()
176
-
177
-
178
- for new_text in streamer:
179
- print(new_text, end="", flush=True)
180
- ```
181
-
182
- ### multiple-images
183
-
184
- ```python
185
- from PIL import Image
186
- import requests
187
- from transformers import AutoProcessor, AutoModel
188
- import torch
189
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
190
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
191
- processor.tokenizer.padding_side = "left"
192
-
193
- messages = [
194
- {
195
- "role": "user",
196
- "content": [
197
- {
198
- "type": "image",
199
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
200
- },
201
- {
202
- "type": "image",
203
- "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
204
- },
205
- {"type": "text", "text": "Describe these two images."},
206
- ],
207
- }
208
- ]
209
-
210
- text_list = [processor.apply_chat_template(
211
- messages, tokenize=False, add_generation_prompt=True
212
- )]
213
- image_inputs, video_inputs = processor.process_vision_info(messages)
214
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
215
- inputs = inputs.to("cuda")
216
- model = model.to("cuda")
217
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
218
- output_text = processor.batch_decode(
219
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
220
- )
221
- print(output_text)
222
- ```
223
-
224
- ### single video
225
-
226
- ```python
227
-
228
- from PIL import Image
229
- import requests
230
- from transformers import AutoProcessor, AutoModel
231
- import torch
232
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
233
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
234
- processor.tokenizer.padding_side = "left"
235
-
236
- messages = [
237
- {
238
- "role": "user",
239
- "content": [
240
- {
241
- "type": "video",
242
- "video": "../Eagle2-8B/space_woaudio.mp4",
243
- },
244
- {"type": "text", "text": "Describe this video."},
245
- ],
246
- }
247
- ]
248
-
249
- text_list = [processor.apply_chat_template(
250
- messages, tokenize=False, add_generation_prompt=True
251
- )]
252
- image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
253
-
254
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
255
- inputs = inputs.to("cuda")
256
- model = model.to("cuda")
257
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
258
- output_text = processor.batch_decode(
259
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
260
- )
261
- print(output_text)
262
-
263
- ```
264
-
265
- ### multieple videos
266
-
267
- ```python
268
- from PIL import Image
269
- import requests
270
- from transformers import AutoProcessor, AutoModel
271
- import torch
272
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
273
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
274
- processor.tokenizer.padding_side = "left"
275
-
276
- messages = [
277
- {
278
- "role": "user",
279
- "content": [
280
- {
281
- "type": "video",
282
- "video": "../Eagle2-8B/space_woaudio.mp4",
283
- "nframes": 10,
284
- },
285
- {
286
- "type": "video",
287
- "video": "../Eagle2-8B/video_ocr.mp4",
288
- "nframes": 10,
289
- },
290
- {"type": "text", "text": "Describe these two videos respectively."},
291
- ],
292
- }
293
- ]
294
-
295
- text_list = [processor.apply_chat_template(
296
- messages, tokenize=False, add_generation_prompt=True
297
- )]
298
- image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
299
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
300
- inputs = inputs.to("cuda")
301
- model = model.to("cuda")
302
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
303
- output_text = processor.batch_decode(
304
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
305
- )
306
- print(output_text)
307
- ```
308
-
309
- ### batch inference
310
-
311
- ```python
312
- from PIL import Image
313
- import requests
314
- from transformers import AutoProcessor, AutoModel
315
- import torch
316
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
317
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
318
- processor.tokenizer.padding_side = "left"
319
-
320
- messages1 = [
321
- {
322
- "role": "user",
323
- "content": [
324
- {
325
- "type": "image",
326
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
327
- },
328
- {"type": "text", "text": "Describe this image."},
329
- ],
330
- }
331
- ]
332
-
333
- messages2 = [
334
- {
335
- "role": "user",
336
- "content": [
337
- {
338
- "type": "image",
339
- "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
340
- },
341
- {"type": "text", "text": "Describe this image."},
342
- ],
343
- }
344
- ]
345
-
346
- text_list = [processor.apply_chat_template(
347
- messages, tokenize=False, add_generation_prompt=True
348
- ) for messages in [messages1, messages2]]
349
- image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
350
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
351
- inputs = inputs.to("cuda")
352
- model = model.to("cuda")
353
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
354
- output_text = processor.batch_decode(
355
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
356
- )
357
- print(output_text)
358
- ```
359
-
360
- ## TODO
361
- - [ ] Support vLLM Inference
362
- - [ ] Provide AWQ Quantization Weights
363
- - [ ] Provide fine-tuning scripts
364
-
365
-
366
- ## License/Terms of Use
367
- - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
368
- - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
369
- - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
370
- - Model License of Qwen2.5-1.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE)
371
- - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
372
-
373
-
374
-
375
- ## Citation
376
-
377
- ## Ethical Considerations
378
- NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
379
-
380
- Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
 
 
 
 
 
 
 
 
 
 
 
381
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - google/paligemma-3b-mix-448
7
+ - Qwen/Qwen2.5-1.5B-Instruct
8
+ - google/siglip-so400m-patch14-384
9
+ base_model_relation: merge
10
+ language:
11
+ - zho
12
+ - eng
13
+ - fra
14
+ - spa
15
+ - por
16
+ - deu
17
+ - ita
18
+ - rus
19
+ - jpn
20
+ - kor
21
+ - vie
22
+ - tha
23
+ - ara
24
+ tags:
25
+ - eagle
26
+ - VLM
27
+ ---
28
+
29
+
30
+ # Eagle-2
31
+
32
+ [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](http://arxiv.org/abs/2501.14818)
33
+ [\[🤗 HF Demo\]](https://huggingface.co/spaces/nvidia/Eagle2-Demo)
34
+
35
+ # News:
36
+ - We update the model arch to `eagle_2_5_vl` to support `generate` feature.
37
+
38
+
39
+ ## Introduction
40
+
41
+ We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
42
+
43
+
44
+
45
+ In this repo, we are open-sourcing Eagle2-2B, a lightweight model that achieves remarkable efficiency and speed while maintaining solid performance.
46
+
47
+
48
+
49
+
50
+
51
+
52
+
53
+
54
+ ## Model Zoo
55
+ We provide the following models:
56
+
57
+ | model name | LLM | Vision | Max Length| HF Link|
58
+ | ----------- | ------- |---------|-|-|
59
+ | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
60
+ | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
61
+ | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
62
+
63
+ ## Benchmark Results
64
+ | Benchmark | InternVL2-2B | InternVL2.5-2B | InternVL2-4B |Qwen2-VL-2B| Eagle2-2B|
65
+ | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
66
+ | DocVQA<sub>test</sub> | 86.9 | 88.7 | 89.2 |90.1|88.0|
67
+ | ChartQA<sub>test</sub> | 76.2 | 79.2 | 81.5 |73.0|82.0|
68
+ | InfoVQA<sub>test</sub> | 58.9 | 60.9 | 67.0 |65.5|65.8|
69
+ | TextVQA<sub>val</sub> | 73.4 | 74.3 | 74.4 |79.7|79.1|
70
+ | OCRBench | 784 | 804 | 788 |809|818|
71
+ | MME<sub>sum</sub> | 1876.8 | 2138.2 | 2059.8 |1872.0 | 2109.8
72
+ | RealWorldQA | 57.3 | 60.1 | 60.7 |62.6|63.1|
73
+ | AI2D<sub>test</sub> | 74.1 | 74.9 | 74.7 | 78.9 |79.3|
74
+ | MMMU<sub>val</sub> | 36.3 | 43.6 | 47.9 |41.1|43.1|
75
+ | MMVet<sub>GPT-4-Turbo</sub> | 39.5 | 60.8 | 51.0 | 49.5|53.8|
76
+ | HallBench<sub>avg</sub> | 37.9 | 42.6 | 41.9 |41.7|45.8
77
+ | MathVista<sub>testmini</sub> | 46.3 | 51.3 | 58.6 |43.0|54.7|
78
+ | MMstar | 50.1 | 53.7 | 54.3|48.0|56.4|
79
+
80
+
81
+
82
+ ## Quick Start
83
+
84
+
85
+
86
+ We provide a [inference script](./demo.py) to help you quickly start using the model. We support different input types:
87
+ - pure text input
88
+ - single image input
89
+ - multiple image input
90
+ - video input
91
+
92
+ ### Install the dependencies
93
+
94
+ ```bash
95
+ pip install transformers
96
+ pip install flash-attn
97
+ ```
98
+
99
+
100
+ ### single image
101
+
102
+ ```python
103
+ from PIL import Image
104
+ import requests
105
+ from transformers import AutoProcessor, AutoModel
106
+ import torch
107
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
108
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
109
+ processor.tokenizer.padding_side = "left"
110
+
111
+ messages = [
112
+ {
113
+ "role": "user",
114
+ "content": [
115
+ {
116
+ "type": "image",
117
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
118
+ },
119
+ {"type": "text", "text": "Describe this image."},
120
+ ],
121
+ }
122
+ ]
123
+
124
+ text_list = [processor.apply_chat_template(
125
+ messages, tokenize=False, add_generation_prompt=True
126
+ )]
127
+ image_inputs, video_inputs = processor.process_vision_info(messages)
128
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
129
+ inputs = inputs.to("cuda")
130
+ model = model.to("cuda")
131
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
132
+ output_text = processor.batch_decode(
133
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
134
+ )
135
+ print(output_text)
136
+ ```
137
+
138
+ ### stream generation
139
+
140
+ ```python
141
+ from PIL import Image
142
+ import requests
143
+ from transformers import AutoProcessor, AutoModel, AutoTokenizer
144
+ import torch
145
+
146
+ from transformers import TextIteratorStreamer
147
+ import threading
148
+
149
+
150
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
151
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
152
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
153
+ processor.tokenizer.padding_side = "left"
154
+
155
+ messages = [
156
+ {
157
+ "role": "user",
158
+ "content": [
159
+ {
160
+ "type": "image",
161
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
162
+ },
163
+ {"type": "text", "text": "Describe this image."},
164
+ ],
165
+ }
166
+ ]
167
+
168
+ text_list = [processor.apply_chat_template(
169
+ messages, tokenize=False, add_generation_prompt=True
170
+ )]
171
+ image_inputs, video_inputs = processor.process_vision_info(messages)
172
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
173
+ inputs = inputs.to("cuda")
174
+ model = model.to("cuda")
175
+
176
+ streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
177
+
178
+ generation_kwargs = dict(
179
+ **inputs,
180
+ streamer=streamer,
181
+ max_new_tokens=1024,
182
+ do_sample=True,
183
+ top_p=0.95,
184
+ temperature=0.8
185
+ )
186
+ thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
187
+ thread.start()
188
+
189
+
190
+ for new_text in streamer:
191
+ print(new_text, end="", flush=True)
192
+ ```
193
+
194
+ ### multiple-images
195
+
196
+ ```python
197
+ from PIL import Image
198
+ import requests
199
+ from transformers import AutoProcessor, AutoModel
200
+ import torch
201
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
202
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
203
+ processor.tokenizer.padding_side = "left"
204
+
205
+ messages = [
206
+ {
207
+ "role": "user",
208
+ "content": [
209
+ {
210
+ "type": "image",
211
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
212
+ },
213
+ {
214
+ "type": "image",
215
+ "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
216
+ },
217
+ {"type": "text", "text": "Describe these two images."},
218
+ ],
219
+ }
220
+ ]
221
+
222
+ text_list = [processor.apply_chat_template(
223
+ messages, tokenize=False, add_generation_prompt=True
224
+ )]
225
+ image_inputs, video_inputs = processor.process_vision_info(messages)
226
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
227
+ inputs = inputs.to("cuda")
228
+ model = model.to("cuda")
229
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
230
+ output_text = processor.batch_decode(
231
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
232
+ )
233
+ print(output_text)
234
+ ```
235
+
236
+ ### single video
237
+
238
+ ```python
239
+
240
+ from PIL import Image
241
+ import requests
242
+ from transformers import AutoProcessor, AutoModel
243
+ import torch
244
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
245
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
246
+ processor.tokenizer.padding_side = "left"
247
+
248
+ messages = [
249
+ {
250
+ "role": "user",
251
+ "content": [
252
+ {
253
+ "type": "video",
254
+ "video": "../Eagle2-8B/space_woaudio.mp4",
255
+ },
256
+ {"type": "text", "text": "Describe this video."},
257
+ ],
258
+ }
259
+ ]
260
+
261
+ text_list = [processor.apply_chat_template(
262
+ messages, tokenize=False, add_generation_prompt=True
263
+ )]
264
+ image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
265
+
266
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
267
+ inputs = inputs.to("cuda")
268
+ model = model.to("cuda")
269
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
270
+ output_text = processor.batch_decode(
271
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
272
+ )
273
+ print(output_text)
274
+
275
+ ```
276
+
277
+ ### multieple videos
278
+
279
+ ```python
280
+ from PIL import Image
281
+ import requests
282
+ from transformers import AutoProcessor, AutoModel
283
+ import torch
284
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
285
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
286
+ processor.tokenizer.padding_side = "left"
287
+
288
+ messages = [
289
+ {
290
+ "role": "user",
291
+ "content": [
292
+ {
293
+ "type": "video",
294
+ "video": "../Eagle2-8B/space_woaudio.mp4",
295
+ "nframes": 10,
296
+ },
297
+ {
298
+ "type": "video",
299
+ "video": "../Eagle2-8B/video_ocr.mp4",
300
+ "nframes": 10,
301
+ },
302
+ {"type": "text", "text": "Describe these two videos respectively."},
303
+ ],
304
+ }
305
+ ]
306
+
307
+ text_list = [processor.apply_chat_template(
308
+ messages, tokenize=False, add_generation_prompt=True
309
+ )]
310
+ image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
311
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
312
+ inputs = inputs.to("cuda")
313
+ model = model.to("cuda")
314
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
315
+ output_text = processor.batch_decode(
316
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
317
+ )
318
+ print(output_text)
319
+ ```
320
+
321
+ ### batch inference
322
+
323
+ ```python
324
+ from PIL import Image
325
+ import requests
326
+ from transformers import AutoProcessor, AutoModel
327
+ import torch
328
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
329
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
330
+ processor.tokenizer.padding_side = "left"
331
+
332
+ messages1 = [
333
+ {
334
+ "role": "user",
335
+ "content": [
336
+ {
337
+ "type": "image",
338
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
339
+ },
340
+ {"type": "text", "text": "Describe this image."},
341
+ ],
342
+ }
343
+ ]
344
+
345
+ messages2 = [
346
+ {
347
+ "role": "user",
348
+ "content": [
349
+ {
350
+ "type": "image",
351
+ "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
352
+ },
353
+ {"type": "text", "text": "Describe this image."},
354
+ ],
355
+ }
356
+ ]
357
+
358
+ text_list = [processor.apply_chat_template(
359
+ messages, tokenize=False, add_generation_prompt=True
360
+ ) for messages in [messages1, messages2]]
361
+ image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
362
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
363
+ inputs = inputs.to("cuda")
364
+ model = model.to("cuda")
365
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
366
+ output_text = processor.batch_decode(
367
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
368
+ )
369
+ print(output_text)
370
+ ```
371
+
372
+ ## TODO
373
+ - [ ] Support vLLM Inference
374
+ - [ ] Provide AWQ Quantization Weights
375
+ - [ ] Provide fine-tuning scripts
376
+
377
+
378
+ ## License/Terms of Use
379
+ - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
380
+ - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
381
+ - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
382
+ - Model License of Qwen2.5-1.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct/blob/main/LICENSE)
383
+ - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
384
+
385
+
386
+
387
+ ## Citation
388
+
389
+ ## Ethical Considerations
390
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
391
+
392
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
393