EraX-VL-7B-V1.5
Introduction ๐
Hot on the heels of the popular EraX-VL-7B-V1.0 model, we proudly present EraX-VL-7B-V1.5, another robust multimodal model for OCR (optical character recognition) and VQA (visual question-answering) that excels in various languages ๐, with a particular focus on Vietnamese ๐ป๐ณ. This model stands out for its precise recognition capabilities across a range of documents ๐, including medical forms ๐ฉบ, invoices ๐งพ, bills of sale ๐ณ, quotes ๐, and medical records ๐. This functionality is expected to be highly beneficial for hospitals ๐ฅ, clinics ๐, insurance companies ๐ก๏ธ, and other similar applications ๐. Built on the solid foundation of the Qwen/Qwen2-VL-2B-Instruct[1], which we found to be of high quality and fluent in Vietnamese, EraX-VL-7B-V1.5
has been fine-tuned to enhance its performance. We plan to continue improving and releasing new versions for free, along with sharing performance benchmarks in the near future.
One standing-out feature of EraX-VL-7B-V1.5 is the capability to do multi-turn Q&A with impressive reasoning capability!
NOTA BENE:
- EraX-VL-7B-V1.5 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to twist your prompt carefully depending on your tasks.
- This model was NOT finetuned with medical (X-ray) dataset or car accidences (yet). Stay tune for updated version coming up sometime early 2025.
EraX-VL-7B-V1.5 is a young member of our EraX's Lร nhGPT collection of LLM models.
- Developed by:
- Nguyแป n Anh Nguyรชn ([email protected])
- Nguyแป n Hแป Nam (BCG)
- Phแบกm Huแปณnh Nhแบญt ([email protected])
- Phแบกm ฤรฌnh Thแปฅc ([email protected])
- Funded by: Bamboo Capital Group and EraX
- Model type: Multimodal Transformer with over 7B parameters
- Languages (NLP): Primarily Vietnamese with multilingual capabilities
- License: Apache 2.0
- Fine-tuned from: Qwen/Qwen2-VL-7B-Instruct
- Prompt examples: Some popular prompt examples on Github.
Benchmarks ๐
๐ LeaderBoard
The EraX-VL-7B-V1.5 achieved exceptionally high performance compared to other models of equal size or even 10 times larger, and we open-source! You can re-run the benchmark at any time.
Models | Open-Source | VI-MTVQA |
---|---|---|
EraX-VL-7B-V1.5 ๐ฅ | โ | 47.2 |
Qwen2-VL 72B ๐ฅ | โ | 41.6 |
ViGPT-VL ๐ฅ | โ | 39.1 |
EraX-VL-2B-V1.5 | โ | 38.2 |
EraX-VL-7B-V1 | โ | 37.6 |
Vintern-1B-V2 | โ | 37.4 |
Qwen2-VL 7B | โ | 30.0 |
Claude3 Opus | โ | 29.1 |
GPT-4o mini | โ | 29.1 |
GPT-4V | โ | 28.9 |
Gemini Ultra | โ | 28.6 |
InternVL2 76B | โ | 26.9 |
QwenVL Max | โ | 23.5 |
Claude3 Sonnet | โ | 20.8 |
QwenVL Plus | โ | 18.1 |
MiniCPM-V2.5 | โ | 15.3 |
The test code for evaluating models in the paper can be found in: EraX-JS-Company/EraX-MTVQA-Benchmark
API trial ๐
Please contact [email protected] for API access inquiry.
Examples ๐งฉ
1. OCR - Optical Character Recognition for Multi-Images
Example 01: Citizen identification card

Front View

Back View
Source: Google Support
{
"Sแป thแบป": "037094012351",
"Hแป vร tรชn": "TRแปNH QUANG DUY",
"Ngร y sinh": "04/09/1994",
"Giแปi tรญnh": "Nam",
"Quแปc tแปch": "Viแปt Nam",
"Quรช quรกn": "Tรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh",
"Nฦกi thฦฐแปng trรบ": "Xรณm 6\nTรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh",
"Cรณ giรก trแป ฤแบฟn": "04/09/2034",
"ฤแบทc ฤiแปm nhรขn dแบกng": "sแบนo chแบฅm c. 1cm trรชn ฤuรดi mแบฏt trรกi",
"Nฦกi cแบฅp": "CแปคC TRฦฏแปNG CแปคC CแบขNH SรT\nQUแบขN Lร HรNH CHรNH Vแป TRแบฌT Tแปฐ Xร HแปI",
"Ngร y cแบฅp": "10/12/2022",
"Cรกn bแป kรฝ tรชn": "Nguyแป
n Quแปc Hรนng",
"Mรฃ ฤแปnh danh": "IDVNM0940123513037094012351"
}
Example 02: Driver's License

Front View

Back View
Source: Bรกo Phรกp luแบญt
{
"No.":"400116012313"
"Fullname":"NGUYแปN VฤN DลจNG"
"Date_of_birth":"08/06/1979"
"Nationality":"VIแปT NAM"
"Address":"X. Quแปณnh Hแบงu, H. Quแปณnh Lฦฐu, T. Nghแป An
Nghแป An, ngร y/date 23 thรกng/month 04 nฤm/year 2022"
"Hang_Class":"FC"
"Expires":"23/04/2027"
"Place_of_issue":"Nghแป An"
"Date_of_issue":"ngร y/date 23 thรกng/month 04 nฤm/year 2022"
"Signer":"Trแบงn Anh Tuแบฅn"
"Cรกc loแบกi xe ฤฦฐแปฃc phรฉp":"ร tรด hแบกng C kรฉo rฦกmoรณc, ฤแบงu kรฉo kรฉo sฦกmi rฦกmoรณc vร xe hแบกng B1, B2, C, FB2 (Motor vehicle of class C with a trailer, semi-trailer truck and vehicles of classes B1, B2, C, FB2)"
"Mรฃ sแป":""
}
Example 03: Vehicle Registration Certificate

Source: Bรกo Vietnamnet
{
"Tรชn chแปง xe": "NGUYแปN TรN NHUแบฌN",
"ฤแปa chแป": "KE27 Kp3 P.TTTรขy Q7",
"Nhรฃn hiแปu": "HONDA",
"Sแป loแบกi": "DYLAN",
"Mร u sฦกn": "Trแบฏng",
"Nฤm sแบฃn xuแบฅt": "2012",
"Sแป mรกy": "F03E-0057735",
"Sแป khung": "SA04F-070410",
"Dung tรญch": "152",
"Sแป chแป ngแปi": "02",
"Biแปn sแป ฤฤng kรฝ": "59V1-498.89",
"ฤฤng kรฝ lแบงn ฤแบงu ngร y": "08/06/2004",
"Chแปฉc vแปฅ": "Thฦฐแปฃng tรก",
"Ngฦฐแปi kรฝ": "Trแบงn Vฤn Hiแปu"
}
Example 04: Vehicle Registration

Source: https://llumar.com.vn
{
"vehicle": {
"registration_number": "30A-072.36",
"vehicle_inspection_number": "2903V-093515",
"type": "รด tรด con",
"mark": "MERCEDES-BENZ",
"model_code": "C300 W204",
"engine_number": "27294732096079",
"chassis_number": "RLMGF5EX3DV005333",
"manufactured_year_and_country": "2013, Viแปt Nam",
"life_time_limit_to": "",
"commercial_use": "",
"modification": ""
},
"specifications": {
"wheel_formula": "4x2",
"wheel_tread": "1521/1512 (mm)",
"overall_dimension": "4650 x 1770 x 1429 (mm)",
"largest_luggage_container_dimension": "",
"wheelbase": "2760 (mm)",
"kerb_mass": "1575 (kg)",
"design_authorized_pay_load": "",
"design_authorized_total_mass": "2090/2090 (kg)",
"design_authorized_towed_mass": "",
"permissible_number_of_pers_carried": "5 chแป ngแปi, 0 chแป ฤแปฉng, 0 chแป nแบฑm",
"type_of_fuel_used": "Xฤng",
"engine_displacement": "2996 (cm3)",
"max_output_per_rpm": "170(kW)/6000vph",
"number": "KC-1292285"
},
"inspection_report_number": "2905V-20953/16",
"valid_until": "31/01/2018",
"place_date_of_issue": "Hร Nแปi, ngร y 1 thรกng 8 nฤm 2016",
"inspection_center": "ฤฦ N Vแป KIแปM ฤแปNH XE Cฦ GIแปI",
"signature": "Ngแปc Tuแบฅn",
"equipped_with_tachograph": "",
"inspection_stamp_was_not_issued": "",
"notes": "Biแปn ฤฤng kรฝ nแปn trแบฏng"
}
Example 05: Receipt

Source: https://tintucketoan.com/
{
'Mแบซu sแป': '01GKTKT3/001',
'Kรฝ hiแปu': 'TC/18P',
'Sแป': '0000030',
'Hแป tรชn ngฦฐแปi mua hร ng': None,
'Tรชn ฤฦกn vแป': 'Cรดng Ty TNHH Kแบฟ Toรกn Hร Nแปi',
'Mรฃ sแป thuแบฟ': '0106235869',
'ฤแปa chแป': 'Sแป 49 Ngรต 322 Lรช Trแปng Tแบฅn, phฦฐแปng Khฦฐฦกng Mai, quแบญn Thanh Xuรขn, Hร Nแปi',
'Hรฌnh thแปฉc thanh toรกn': 'TM',
'STT': None,
'Tรชn hร ng hรณa, dแปch vแปฅ': 'Tra cแปฉu phแบงn mแปm thฦฐ viแปn phรกp luแบญt trแปฑc tuyแบฟn',
'ฤฦกn vแป tรญnh': None,
'Sแป lฦฐแปฃng': None,
'ฤฦกn giรก': '168.000',
'Thร nh tiแปn': '2.016.000',
'Thuแบฟ suแบฅt GTGT': None,
'Tiแปn thuแบฟ GTGT': None,
'Tแปng cแปng tiแปn thanh toรกn': '2.016.000',
'Sแป tiแปn viแบฟt bแบฑng chแปฏ': 'Hai triแปu, khรดng trฤm mฦฐแปi sรกu nghรฌn ฤแปng',
'Ngฦฐแปi bรกn hร ng': 'Bรนi Vฤn Hรนng',
'Chแปฉc vแปฅ ngฦฐแปi bรกn hร ng': 'TRฦฏแปNG CHI NHรNH'
}
2.1 Image Captioning

Hรฌnh แบฃnh lร biแปu ฤแป BMI theo tuแปi, thแป hiแปn mแปi quan hแป giแปฏa chแป sแป khแปi cฦก thแป (BMI) vร ฤแป tuแปi cแปงa trแบป em. Biแปu ฤแป ฤฦฐแปฃc chia thร nh cรกc vรนng mร u khรกc nhau tฦฐฦกng แปฉng vแปi cรกc mแปฉc BMI khรกc nhau:
- Vรนng mร u ฤแป: Chแป sแป BMI cao hฦกn 25, cho thแบฅy tรฌnh trแบกng bรฉo phรฌ.
- Vรนng mร u vร ng: Chแป sแป BMI nแบฑm trong khoแบฃng tแปซ 18 ฤแบฟn 25, cho thแบฅy nguy cฦก bรฉo phรฌ.
- Vรนng mร u xanh lรก cรขy nhแบกt: Chแป sแป BMI nแบฑm trong khoแบฃng tแปซ 16 ฤแบฟn 18, cho thแบฅy sแปฉc khแปe dinh dฦฐแปกng tแปt.
- Vรนng mร u xanh lรก cรขy ฤแบญm: Chแป sแป BMI thแบฅp hฦกn 16, cho thแบฅy tรฌnh trแบกng thiแบฟu cรขn.
Trแปฅc tung biแปu diแป n chแป sแป BMI, trแปฅc hoร nh biแปu diแป n tuแปi (tรญnh bแบฑng nฤm). ฤฦฐแปng cong mร u xรกm ฤแบญm thแป hiแปn ฤฦฐแปng chuแบฉn BMI theo tuแปi. Cรกc ฤฦฐแปng cong nร y cho thแบฅy sแปฑ thay ฤแปi BMI theo thแปi gian, giรบp ฤรกnh giรก sแปฑ phรกt triแปn cรขn nแบทng cแปงa trแบป em. Vรญ dแปฅ, แป trแบป em dฦฐแปi 3 tuแปi, BMI thฦฐแปng dao ฤแปng trong vรนng thiแบฟu cรขn hoแบทc sแปฉc khแปe dinh dฦฐแปกng tแปt. Khi trแบป lแปn lรชn, BMI cรณ xu hฦฐแปng tฤng dแบงn, nhฦฐng tแปc ฤแป tฤng trฦฐแปng cรณ thแป khรกc nhau tรนy thuแปc vร o tแปซng cรก nhรขn. Biแปu ฤแป cลฉng hiแปn thแป cรกc phแบงn trฤm phรขn vแป (Percentile), cho biแบฟt tแปท lแป phแบงn trฤm trแบป em cรณ BMI thแบฅp hฦกn hoแบทc cao hฦกn so vแปi mแปt nhรณm trแบป em cรนng ฤแป tuแปi. ฤiแปu nร y giรบp so sรกnh BMI cแปงa trแบป em vแปi tiรชu chuแบฉn quแปc tแบฟ.
2.2 Image Captioning
.jpg)
Hรฌnh แบฃnh chแปฅp mแปt cแบฃnh tฦฐแปฃng ฤแบงy xรบc ฤแปng vร bi thแบฃm, dฦฐแปng nhฦฐ diแป n ra แป mแปt khu vแปฑc nghรจo khรณ, cรณ thแป lร mแปt khu ฤแปnh cฦฐ hoแบทc khu แป chuแปt. Trung tรขm cแปงa bแปฉc แบฃnh lร mแปt chiแบฟc xe ฤแบฉy ฤฦฐแปฃc kรฉo bแปi mแปt con lแปซa. Trรชn xe ฤแบฉy cรณ mแปt nhรณm ngฦฐแปi, bao gแปm mแปt ngฦฐแปi ฤร n รดng lแปn tuแปi cรณ vแบป nhฦฐ lร ngฦฐแปi hฦฐแปng dแบซn, mแปt phแปฅ nแปฏ mแบทc รกo choร ng ฤen, mแปt phแปฅ nแปฏ trแบป mแบทc รกo xรกm, mแปt bรฉ gรกi nhแป ฤฦฐแปฃc che mแบทt bแบฑng khฤn trรนm ฤแบงu, vร mแปt cแบญu bรฉ mแบทc รกo xanh lรก cรขy. Hแป cรณ vแบป nhฦฐ ฤang di chuyแปn tแปซ mแปt khu vแปฑc bแป tร n phรก sang mแปt nฦกi khรกc. Xung quanh hแป lร nhแปฏng mแบฃnh vแปฅn, rรกc thแบฃi vร nhแปฏng tรฒa nhร ฤแป nรกt, cho thแบฅy mแปt mรดi trฦฐแปng sแปng khรณ khฤn vร thiแบฟu thแปn. Phรญa sau hแป, cรณ mแปt sแป ngฦฐแปi khรกc ฤang ฤi bแป hoแบทc ฤi xe ฤแบกp, mแปt sแป ngฦฐแปi ฤang mang theo hร nh lรฝ. Bแบงu trแปi trong xanh, nhฦฐng khรดng khรญ chung cแปงa bแปฉc แบฃnh lร nแบทng nแป vร buแปn bรฃ. Cรณ thแป ฤรขy lร mแปt cแบฃnh tฦฐแปฃng di cฦฐ, tแป nแบกn, hoแบทc mแปt sแปฑ kiแปn thแบฃm khแปc nร o ฤรณ.
3. Multi-turns Visual Question Answering and Reasoning

Source: https://kenh14.vn
User: Hรฃy cho biแบฟt thรดng tin bแปnh nhรขn
Assistant: Tรชn: NGUYแปN THแป THรI TRINH, Ngร y sinh: 09/09/1993, Giแปi tรญnh: Nแปฏ, ฤแปa chแป: TรN ฤแปNH - Q1 - TPHCM, Mรฃ thแบป BHYT: 19169080, Sแป khรกm bแปnh: 2569501, Mรฃ khu vแปฑc (K1/K2/K3): 2
User: Hแป ฤฦฐแปฃc chuแบฉn ฤoรกn bแป bแปnh gรฌ?
Assistant: Nhiแป m trรนng ฤฦฐแปng ruแปt do vi khuแบฉn, khรดng xรกc ฤแปnh loร i
User: Ngฦฐแปi bแปnh cรณ tiแปn sแปญ bแปnh khรดng? Nแบฟu cรณ thรฌ lร gรฌ?
Assistant: Thiแบฟu Calci do chแบฟ ฤแป ฤn
User: Bแปnh nhรขn ฤรฃ thแปฑc hiแปn nhแปฏng xรฉt nghiแปm gรฌ?
Assistant: [NQ286] Tแปng phรขn tรญch tแบฟ bร o mรกu ngoแบกi vi (bแบฑng mรกy ฤแบฟm laser), [NQ286] ฤiแปn giแบฃi ฤแป (Na, K, Cl) [Mรกu], [NQ286] ฤo hoแบกt ฤแป ALT (GPT) [Mรกu], [NQ286] ฤo hoแบกt ฤแป AST (GOT) [Mรกu], [NQ286] ฤแปnh lฦฐแปฃng Calci toร n phแบงn [Mรกu], [NQ286] ฤแปnh lฦฐแปฃng Creatinin [Mรกu], [NQ286] ฤแปnh lฦฐแปฃng Glucose [Mรกu], [NQ286] ฤแปnh lฦฐแปฃng Urรช mรกu [Mรกu], [NQ286] Siรชu รขm แป bแปฅng (gan mแบญt, tแปฅy, lรกch, thแบญn, bร ng quang)
User: Sแป tiแปn Vแบญt tฦฐ y tแบฟ mร ngฦฐแปi bแปnh phแบฃi trแบฃ lร bao nhiรชu?
Assistant: 19,708
User: Cรกc loแบกi thuแปc dแปch truyแปn ngฦฐแปi bแปnh phแบฃi trแบฃ lร gรฌ, tแปng sแป tiแปn lร bao nhiรชu?
Assistant: Atropin Sulfat (Syv2019), Bicefzidim 1g (Syv2019), GONCAL (Syv2019), Lactated Ringer's-500ml (Syv2019), Nฦฐแปc cแบฅt pha tiรชm 5ml (Syv2019), Sodium Chloride 0.9% -500ml (Syv2019), Vincomid (Syv2019), Vinopa (Syv2019), tแปng cแปng 45,234 ฤแปng
Quickstart ๐ฎ
Install the necessary packages:
python -m pip install git+https://github.com/huggingface/transformers accelerate
python -m pip install qwen-vl-utils
pip install flash-attn --no-build-isolation
Then you can use EraX-VL-7B-V1.5
like this:
import os
import base64
import json
import cv2
import numpy as np
import matplotlib.pyplot as plt
import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path = "erax/EraX-VL-7B-V1.5"
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# processor = AutoProcessor.from_pretrained(model_path)
min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
model_path,
min_pixels=min_pixels,
max_pixels=max_pixels,
)
image_path ="image.jpg"
with open(image_path, "rb") as f:
encoded_image = base64.b64encode(f.read())
decoded_image_text = encoded_image.decode('utf-8')
base64_data = f"data:image;base64,{decoded_image_text}"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": base64_data,
},
{
"type": "text",
"text": "Trรญch xuแบฅt thรดng tin nแปi dung tแปซ hรฌnh แบฃnh ฤฦฐแปฃc cung cแบฅp."
},
],
}
]
# Prepare prompt
tokenized_text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[ tokenized_text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Generation configs
generation_config = model.generation_config
generation_config.do_sample = True
generation_config.temperature = 1.0
generation_config.top_k = 1
generation_config.top_p = 0.9
generation_config.min_p = 0.1
generation_config.best_of = 5
generation_config.max_new_tokens = 2048
generation_config.repetition_penalty = 1.06
# Inference
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
References ๐
[1] Qwen team. Qwen2-VL. 2024.
[2] Bai, Jinze, et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv preprint arXiv:2308.12966 (2023).
[4] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024).
[5] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
[6] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024).
[7] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024).
Contact ๐ค
- For correspondence regarding this work or inquiry for API trial, please contact Nguyแป n Anh Nguyรชn at [email protected].
- Follow us on EraX Github
- Downloads last month
- 26