Improve language tag

#5
by lbourdois - opened
Files changed (1) hide show
  1. README.md +253 -242
README.md CHANGED
@@ -1,243 +1,254 @@
1
- ---
2
- language:
3
- - en
4
- - ko
5
- license: cc-by-nc-4.0
6
- tags:
7
- - multimodal
8
- - conversational
9
- - ncsoft
10
- - varco
11
- base_model:
12
- - Qwen/Qwen2.5-14B-Instruct
13
- - google/siglip-so400m-patch14-384
14
- library_name: transformers
15
- pipeline_tag: image-text-to-text
16
- ---
17
-
18
- # VARCO-VISION-14B
19
-
20
- ## About the Model
21
-
22
- **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).
23
-
24
- - **Developed by:** NC Research, Multimodal Generation Team
25
- - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
26
- - **Blog(Korean):** [VARCO-VISION Technical Report Summary](https://ncsoft.github.io/ncresearch/95ad8712e60063e9ac97538504ac3eea0ac530af)
27
- - **Demo Page:** *The demo page is no longer available.*
28
- - **Languages:** Korean, English
29
- - **License:** CC BY-NC 4.0
30
- - **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
31
- - **Base Model:**
32
- - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
33
- - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
34
- - **Huggingface Version Model:** [NCSOFT/VARCO-VISION-14B-HF](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF)
35
- - **Korean VLM Benchmarks:**
36
- - You can use the following benchmark datasets in the [LLMs-Eval toolkit](https://github.com/EvolvingLMMs-Lab/lmms-eval).
37
- - [NCSOFT/K-MMBench](https://huggingface.co/datasets/NCSOFT/K-MMBench)
38
- - [NCSOFT/K-SEED](https://huggingface.co/datasets/NCSOFT/K-SEED)
39
- - [NCSOFT/K-MMStar](https://huggingface.co/datasets/NCSOFT/K-MMStar)
40
- - [NCSOFT/K-DTCBench](https://huggingface.co/datasets/NCSOFT/K-DTCBench)
41
- - [NCSOFT/K-LLaVA-W](https://huggingface.co/datasets/NCSOFT/K-LLaVA-W)
42
-
43
- - **you can also evaluate VARCO-VISION-14B in the [VLMEval kit](https://github.com/open-compass/VLMEvalKit)**.
44
- - **This model is for research purposes only. Commercial use is prohibited.**
45
-
46
- ## Uses
47
-
48
- ### Direct Use
49
-
50
- To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:
51
-
52
- ```bash
53
- git clone https://github.com/LLaVA-VL/LLaVA-NeXT
54
- cd LLaVA-NeXT
55
- pip install -e ".[train]"
56
- ```
57
-
58
- After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:
59
-
60
- ```python
61
- import torch
62
- from transformers import AutoTokenizer
63
- from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM
64
- from llava.mm_utils import tokenizer_image_token, process_images
65
-
66
- model_name = "NCSOFT/VARCO-VISION-14B"
67
- tokenizer = AutoTokenizer.from_pretrained(model_name)
68
- model = LlavaQwenForCausalLM.from_pretrained(
69
- model_name,
70
- torch_dtype=torch.float16,
71
- attn_implementation="flash_attention_2",
72
- low_cpu_mem_usage=True,
73
- device_map="auto"
74
- )
75
-
76
- vision_tower = model.get_vision_tower()
77
- image_processor = vision_tower.image_processor
78
- ```
79
-
80
- Prepare an image and a text input. You need to preprocess the image and tokenize the text. Pass the processed inputs to the model to generate predictions.
81
-
82
- ```python
83
- import requests
84
- from PIL import Image
85
-
86
- # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
87
- # Each value in "content" has to be a list of dicts with types ("text", "image")
88
- conversation = [
89
- {
90
- "role": "user",
91
- "content": [
92
- {"type": "text", "text": "Describe this image."},
93
- {"type": "image"},
94
- ],
95
- },
96
- ]
97
- prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
98
-
99
- IMAGE_TOKEN_INDEX = -200
100
- EOS_TOKEN = "<|im_end|>"
101
- input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
102
- input_ids = input_ids.unsqueeze(0).to(model.device)
103
-
104
- image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
105
- raw_image = Image.open(requests.get(image_url, stream=True).raw)
106
- image_tensors = process_images([raw_image], image_processor, model.config)
107
- image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]
108
- image_sizes = [raw_image.size]
109
-
110
- with torch.inference_mode():
111
- output_ids = model.generate(
112
- input_ids,
113
- images=image_tensors,
114
- image_sizes=image_sizes,
115
- do_sample=False,
116
- max_new_tokens=1024,
117
- use_cache=True,
118
- )
119
-
120
- outputs = tokenizer.batch_decode(output_ids)[0]
121
- if outputs.endswith(EOS_TOKEN):
122
- outputs = outputs[: -len(EOS_TOKEN)]
123
-
124
- outputs = outputs.strip()
125
- print(outputs)
126
- ```
127
-
128
- ### Specialized Features
129
-
130
- If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text.
131
-
132
- The following special tokens are used to define specific tasks, inputs, and outputs for the model:
133
-
134
- - `<gro>`: Indicates that the model's response should include bounding box information.
135
- - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
136
- - `<char>` and `</char>`: Used to mark a text phrase.
137
- - `<obj>` and `</obj>`: Used to indicate an object.
138
- - `<bbox>` and `</bbox>`: Used to represent a bounding box.
139
- - `<delim>`: Represents multiple location points for a single object or text.
140
-
141
- #### Grounding
142
- Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token `<gro>` to the question.
143
-
144
- ```python
145
- conversation = [
146
- {
147
- "role": "user",
148
- "content": [
149
- {"type": "text", "text": "<gro>\nDescribe the image in detail."},
150
- {"type": "image"},
151
- ],
152
- },
153
- ]
154
- ```
155
-
156
- **Expected Output Example:**
157
- ```html
158
- The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
159
- ```
160
-
161
- <img src="assets/grounding.png" alt="Grounding Example" width="400"/>
162
-
163
- #### Referring
164
-
165
- VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within `<obj>` and `</obj>` tags. You have to specify its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position.
166
-
167
- ```python
168
- conversation = [
169
- {
170
- "role": "user",
171
- "content": [
172
- {
173
- "type": "text",
174
- "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",
175
- },
176
- {"type": "image"},
177
- ],
178
- },
179
- ]
180
- ```
181
-
182
- **Expected Output Example:**
183
- ```
184
- **์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์˜ ๋ฒ„ํŠผ์—๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž, ๋ฉ”๋‰ด, ์„ค์ •, ์žฌ์ƒ/์ผ์‹œ์ •์ง€ ๋“ฑ์˜ ๊ธฐ๋Šฅ์ด ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž๋Š” ์ด๋ฅผ ํ†ตํ•ด ์†์‰ฝ๊ฒŒ ๊ธฐ๊ธฐ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
185
- ```
186
-
187
- #### OCR
188
-
189
- To perform Optical Character Recognition (OCR), use the `<ocr>` token.
190
-
191
- ```python
192
- image_file = "./assets/ocr_1.png"
193
- raw_image = Image.open(image_file)
194
-
195
- conversation = [
196
- {
197
- "role": "user",
198
- "content": [
199
- {"type": "text", "text": "<ocr>"},
200
- {"type": "image"},
201
- ],
202
- },
203
- ]
204
- ```
205
-
206
- **Expected Output Example:**
207
-
208
- ```
209
- <char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>
210
- <char>124๋ฒˆ๊ธธ</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>
211
- <char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>
212
- <char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>
213
- <char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>
214
- <char>์‹œํฅ</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>
215
- <char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
216
- <char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>
217
- <char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
218
- <char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>
219
- <char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>
220
- <char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>
221
- <char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>
222
- <char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>
223
- <char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>
224
- <char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>
225
- ```
226
-
227
- <img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
228
-
229
- ## Citing the Model
230
-
231
- If you use VARCO-VISION-14B in your research, please cite the following:
232
-
233
- ```bibtex
234
- @misc{ju2024varcovisionexpandingfrontierskorean,
235
- title={VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models},
236
- author={Jeongho Ju and Daeyoung Kim and SunYoung Park and Youngjune Kim},
237
- year={2024},
238
- eprint={2411.19103},
239
- archivePrefix={arXiv},
240
- primaryClass={cs.CV},
241
- url={https://arxiv.org/abs/2411.19103},
242
- }
 
 
 
 
 
 
 
 
 
 
 
243
  ```
 
1
+ ---
2
+ language:
3
+ - zho
4
+ - eng
5
+ - fra
6
+ - spa
7
+ - por
8
+ - deu
9
+ - ita
10
+ - rus
11
+ - jpn
12
+ - kor
13
+ - vie
14
+ - tha
15
+ - ara
16
+ license: cc-by-nc-4.0
17
+ tags:
18
+ - multimodal
19
+ - conversational
20
+ - ncsoft
21
+ - varco
22
+ base_model:
23
+ - Qwen/Qwen2.5-14B-Instruct
24
+ - google/siglip-so400m-patch14-384
25
+ library_name: transformers
26
+ pipeline_tag: image-text-to-text
27
+ ---
28
+
29
+ # VARCO-VISION-14B
30
+
31
+ ## About the Model
32
+
33
+ **VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models. The model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).
34
+
35
+ - **Developed by:** NC Research, Multimodal Generation Team
36
+ - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
37
+ - **Blog(Korean):** [VARCO-VISION Technical Report Summary](https://ncsoft.github.io/ncresearch/95ad8712e60063e9ac97538504ac3eea0ac530af)
38
+ - **Demo Page:** *The demo page is no longer available.*
39
+ - **Languages:** Korean, English
40
+ - **License:** CC BY-NC 4.0
41
+ - **Architecture:** VARCO-VISION-14B follows the architecture of [LLaVA-OneVision](https://arxiv.org/abs/2408.03326).
42
+ - **Base Model:**
43
+ - **Language Model:** [Qwen/Qwen2.5-14B-Instruct](https://huggingface.co/Qwen/Qwen2.5-14B-Instruct)
44
+ - **Vision Encoder:** [google/siglip-so400m-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
45
+ - **Huggingface Version Model:** [NCSOFT/VARCO-VISION-14B-HF](https://huggingface.co/NCSOFT/VARCO-VISION-14B-HF)
46
+ - **Korean VLM Benchmarks:**
47
+ - You can use the following benchmark datasets in the [LLMs-Eval toolkit](https://github.com/EvolvingLMMs-Lab/lmms-eval).
48
+ - [NCSOFT/K-MMBench](https://huggingface.co/datasets/NCSOFT/K-MMBench)
49
+ - [NCSOFT/K-SEED](https://huggingface.co/datasets/NCSOFT/K-SEED)
50
+ - [NCSOFT/K-MMStar](https://huggingface.co/datasets/NCSOFT/K-MMStar)
51
+ - [NCSOFT/K-DTCBench](https://huggingface.co/datasets/NCSOFT/K-DTCBench)
52
+ - [NCSOFT/K-LLaVA-W](https://huggingface.co/datasets/NCSOFT/K-LLaVA-W)
53
+
54
+ - **you can also evaluate VARCO-VISION-14B in the [VLMEval kit](https://github.com/open-compass/VLMEvalKit)**.
55
+ - **This model is for research purposes only. Commercial use is prohibited.**
56
+
57
+ ## Uses
58
+
59
+ ### Direct Use
60
+
61
+ To load VARCO-VISION-14B, start by cloning and installing **LLaVA-NeXT**:
62
+
63
+ ```bash
64
+ git clone https://github.com/LLaVA-VL/LLaVA-NeXT
65
+ cd LLaVA-NeXT
66
+ pip install -e ".[train]"
67
+ ```
68
+
69
+ After installing **LLaVA-NeXT**, you can load VARCO-VISION-14B using the following code:
70
+
71
+ ```python
72
+ import torch
73
+ from transformers import AutoTokenizer
74
+ from llava.model.language_model.llava_qwen import LlavaQwenForCausalLM
75
+ from llava.mm_utils import tokenizer_image_token, process_images
76
+
77
+ model_name = "NCSOFT/VARCO-VISION-14B"
78
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
79
+ model = LlavaQwenForCausalLM.from_pretrained(
80
+ model_name,
81
+ torch_dtype=torch.float16,
82
+ attn_implementation="flash_attention_2",
83
+ low_cpu_mem_usage=True,
84
+ device_map="auto"
85
+ )
86
+
87
+ vision_tower = model.get_vision_tower()
88
+ image_processor = vision_tower.image_processor
89
+ ```
90
+
91
+ Prepare an image and a text input. You need to preprocess the image and tokenize the text. Pass the processed inputs to the model to generate predictions.
92
+
93
+ ```python
94
+ import requests
95
+ from PIL import Image
96
+
97
+ # Define a chat history and use `apply_chat_template` to get correctly formatted prompt
98
+ # Each value in "content" has to be a list of dicts with types ("text", "image")
99
+ conversation = [
100
+ {
101
+ "role": "user",
102
+ "content": [
103
+ {"type": "text", "text": "Describe this image."},
104
+ {"type": "image"},
105
+ ],
106
+ },
107
+ ]
108
+ prompt = tokenizer.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
109
+
110
+ IMAGE_TOKEN_INDEX = -200
111
+ EOS_TOKEN = "<|im_end|>"
112
+ input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
113
+ input_ids = input_ids.unsqueeze(0).to(model.device)
114
+
115
+ image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
116
+ raw_image = Image.open(requests.get(image_url, stream=True).raw)
117
+ image_tensors = process_images([raw_image], image_processor, model.config)
118
+ image_tensors = [image_tensor.half().to(model.device) for image_tensor in image_tensors]
119
+ image_sizes = [raw_image.size]
120
+
121
+ with torch.inference_mode():
122
+ output_ids = model.generate(
123
+ input_ids,
124
+ images=image_tensors,
125
+ image_sizes=image_sizes,
126
+ do_sample=False,
127
+ max_new_tokens=1024,
128
+ use_cache=True,
129
+ )
130
+
131
+ outputs = tokenizer.batch_decode(output_ids)[0]
132
+ if outputs.endswith(EOS_TOKEN):
133
+ outputs = outputs[: -len(EOS_TOKEN)]
134
+
135
+ outputs = outputs.strip()
136
+ print(outputs)
137
+ ```
138
+
139
+ ### Specialized Features
140
+
141
+ If a question is based on bounding boxes or require bounding boxes as an output, please include the special tokens in the input text.
142
+
143
+ The following special tokens are used to define specific tasks, inputs, and outputs for the model:
144
+
145
+ - `<gro>`: Indicates that the model's response should include bounding box information.
146
+ - `<ocr>`: Specifies OCR tasks for recognizing text within an image.
147
+ - `<char>` and `</char>`: Used to mark a text phrase.
148
+ - `<obj>` and `</obj>`: Used to indicate an object.
149
+ - `<bbox>` and `</bbox>`: Used to represent a bounding box.
150
+ - `<delim>`: Represents multiple location points for a single object or text.
151
+
152
+ #### Grounding
153
+ Grounding refers to a task where the model needs to identify specific locations within an image to provide an appropriate answer. To perform grounding, prepend the special token `<gro>` to the question.
154
+
155
+ ```python
156
+ conversation = [
157
+ {
158
+ "role": "user",
159
+ "content": [
160
+ {"type": "text", "text": "<gro>\nDescribe the image in detail."},
161
+ {"type": "image"},
162
+ ],
163
+ },
164
+ ]
165
+ ```
166
+
167
+ **Expected Output Example:**
168
+ ```html
169
+ The image shows <obj>two cats</obj><bbox>0.521, 0.049, 0.997, 0.783<delim>0.016, 0.108, 0.512, 0.99</bbox> lying on <obj>a pink blanket</obj><bbox>0.002, 0.231, 0.999, 0.999</bbox>. The cat on the left is lying on its side with its head resting on the blanket and its body stretched out. The cat on the right is lying on its back with its paws stretched out and its head turned to the side. Both cats appear relaxed and comfortable. There are also <obj>two remote controls</obj><bbox>0.039, 0.138, 0.283, 0.257<delim>0.508, 0.166, 0.581, 0.295</bbox> placed near the cats, one on each side of them.
170
+ ```
171
+
172
+ <img src="assets/grounding.png" alt="Grounding Example" width="400"/>
173
+
174
+ #### Referring
175
+
176
+ VARCO-VISION-14B can handle location-specific questions using bounding boxes. To perform referring tasks, make a conversation including the object of interest within `<obj>` and `</obj>` tags. You have to specify its location with `<bbox>` and `</bbox>` tags. This allows the model to understand the context and focus on the object at the specified location. A bbox is represented in a form of (x1, y1, x2, y2). The first two values indicate the top-left position of a bbox, and the latter two values are the bottom-right position.
177
+
178
+ ```python
179
+ conversation = [
180
+ {
181
+ "role": "user",
182
+ "content": [
183
+ {
184
+ "type": "text",
185
+ "text": "<obj>์ด ๋ฌผ๊ฑด</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>์€ ์–ด๋–ป๊ฒŒ ์“ฐ๋Š”๊ฑฐ์•ผ?",
186
+ },
187
+ {"type": "image"},
188
+ ],
189
+ },
190
+ ]
191
+ ```
192
+
193
+ **Expected Output Example:**
194
+ ```
195
+ **์ด ๋ฌผ๊ฑด**์€ ๋ฆฌ๋ชจ์ปจ์œผ๋กœ, ์ฃผ๋กœ ํ…”๋ ˆ๋น„์ „์ด๋‚˜ ๋‹ค๋ฅธ ์ „์ž ๊ธฐ๊ธฐ๋ฅผ ์›๊ฒฉ์œผ๋กœ ์กฐ์ž‘ํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋ฒ„ํŠผ์„ ๋ˆ„๋ฅด๋ฉด ์ฑ„๋„ ๋ณ€๊ฒฝ, ๋ณผ๋ฅจ ์กฐ์ ˆ, ์ „์› ์ผœ๊ธฐ/๋„๊ธฐ ๋“ฑ์˜ ๊ธฐ๋Šฅ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฆฌ๋ชจ์ปจ์˜ ๋ฒ„ํŠผ์—๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ์ˆซ์ž, ๋ฉ”๋‰ด, ์„ค์ •, ์žฌ์ƒ/์ผ์‹œ์ •์ง€ ๋“ฑ์˜ ๊ธฐ๋Šฅ์ด ํฌํ•จ๋˜์–ด ์žˆ์œผ๋ฉฐ, ์‚ฌ์šฉ์ž๋Š” ์ด๋ฅผ ํ†ตํ•ด ์†์‰ฝ๊ฒŒ ๊ธฐ๊ธฐ๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
196
+ ```
197
+
198
+ #### OCR
199
+
200
+ To perform Optical Character Recognition (OCR), use the `<ocr>` token.
201
+
202
+ ```python
203
+ image_file = "./assets/ocr_1.png"
204
+ raw_image = Image.open(image_file)
205
+
206
+ conversation = [
207
+ {
208
+ "role": "user",
209
+ "content": [
210
+ {"type": "text", "text": "<ocr>"},
211
+ {"type": "image"},
212
+ ],
213
+ },
214
+ ]
215
+ ```
216
+
217
+ **Expected Output Example:**
218
+
219
+ ```
220
+ <char>๋ฐฑ๋ฒ”๋กœ</char><bbox>0.172, 0.265, 0.328, 0.34</bbox>
221
+ <char>124๋ฒˆ๊ธธ</char><bbox>0.349, 0.265, 0.512, 0.34</bbox>
222
+ <char>Baekbeom-ro</char><bbox>0.171, 0.335, 0.432, 0.391</bbox>
223
+ <char>124</char><bbox>0.444, 0.34, 0.508, 0.391</bbox>
224
+ <char>๋งŒ์ˆ˜์ฃผ๊ณต์•„ํŒŒํŠธ</char><bbox>0.109, 0.528, 0.335, 0.594</bbox>
225
+ <char>์‹œํฅ</char><bbox>0.443, 0.516, 0.522, 0.578</bbox>
226
+ <char>์‹œ์ฒญ</char><bbox>0.711, 0.521, 0.811, 0.594</bbox>
227
+ <char>Mansu</char><bbox>0.103, 0.601, 0.181, 0.647</bbox>
228
+ <char>Jugong</char><bbox>0.186, 0.601, 0.273, 0.658</bbox>
229
+ <char>Apt</char><bbox>0.281, 0.601, 0.327, 0.651</bbox>
230
+ <char>42</char><bbox>0.377, 0.601, 0.416, 0.647</bbox>
231
+ <char>Shieung</char><bbox>0.445, 0.578, 0.53, 0.623</bbox>
232
+ <char>์ธ์ฒœ๋Œ€๊ณต์›</char><bbox>0.431, 0.623, 0.609, 0.684</bbox>
233
+ <char>๋ชจ๋ž˜๋‚ด์‹œ์žฅ์—ญ</char><bbox>0.651, 0.591, 0.873, 0.664</bbox>
234
+ <char>IncheonGrand</char><bbox>0.433, 0.684, 0.561, 0.723</bbox>
235
+ <char>Park</char><bbox>0.564, 0.684, 0.611, 0.723</bbox>
236
+ ```
237
+
238
+ <img src="assets/ocr_2.jpg" alt="OCR Example" width="350"/>
239
+
240
+ ## Citing the Model
241
+
242
+ If you use VARCO-VISION-14B in your research, please cite the following:
243
+
244
+ ```bibtex
245
+ @misc{ju2024varcovisionexpandingfrontierskorean,
246
+ title={VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models},
247
+ author={Jeongho Ju and Daeyoung Kim and SunYoung Park and Youngjune Kim},
248
+ year={2024},
249
+ eprint={2411.19103},
250
+ archivePrefix={arXiv},
251
+ primaryClass={cs.CV},
252
+ url={https://arxiv.org/abs/2411.19103},
253
+ }
254
  ```