Logo

EraX-VL-7B-V1.5

Introduction ๐ŸŽ‰

Hot on the heels of the popular EraX-VL-7B-V1.0 model, we proudly present EraX-VL-7B-V1.5, another robust multimodal model for OCR (optical character recognition) and VQA (visual question-answering) that excels in various languages ๐ŸŒ, with a particular focus on Vietnamese ๐Ÿ‡ป๐Ÿ‡ณ. This model stands out for its precise recognition capabilities across a range of documents ๐Ÿ“, including medical forms ๐Ÿฉบ, invoices ๐Ÿงพ, bills of sale ๐Ÿ’ณ, quotes ๐Ÿ“„, and medical records ๐Ÿ’Š. This functionality is expected to be highly beneficial for hospitals ๐Ÿฅ, clinics ๐Ÿ’‰, insurance companies ๐Ÿ›ก๏ธ, and other similar applications ๐Ÿ“‹. Built on the solid foundation of the Qwen/Qwen2-VL-2B-Instruct[1], which we found to be of high quality and fluent in Vietnamese, EraX-VL-7B-V1.5 has been fine-tuned to enhance its performance. We plan to continue improving and releasing new versions for free, along with sharing performance benchmarks in the near future.

One standing-out feature of EraX-VL-7B-V1.5 is the capability to do multi-turn Q&A with impressive reasoning capability!

NOTA BENE:

  • EraX-VL-7B-V1.5 is NOT a typical OCR-only tool likes Tesseract but is a Multimodal LLM-based model. To use it effectively, you may have to twist your prompt carefully depending on your tasks.
  • This model was NOT finetuned with medical (X-ray) dataset or car accidences (yet). Stay tune for updated version coming up sometime early 2025.

EraX-VL-7B-V1.5 is a young member of our EraX's Lร nhGPT collection of LLM models.

Benchmarks ๐Ÿ“Š

๐Ÿ† LeaderBoard

The EraX-VL-7B-V1.5 achieved exceptionally high performance compared to other models of equal size or even 10 times larger, and we open-source! You can re-run the benchmark at any time.

Models Open-Source VI-MTVQA
EraX-VL-7B-V1.5 ๐Ÿฅ‡ โœ… 47.2
Qwen2-VL 72B ๐Ÿฅˆ โœ˜ 41.6
ViGPT-VL ๐Ÿฅ‰ โœ˜ 39.1
EraX-VL-2B-V1.5 โœ… 38.2
EraX-VL-7B-V1 โœ… 37.6
Vintern-1B-V2 โœ… 37.4
Qwen2-VL 7B โœ… 30.0
Claude3 Opus โœ˜ 29.1
GPT-4o mini โœ˜ 29.1
GPT-4V โœ˜ 28.9
Gemini Ultra โœ˜ 28.6
InternVL2 76B โœ… 26.9
QwenVL Max โœ˜ 23.5
Claude3 Sonnet โœ˜ 20.8
QwenVL Plus โœ˜ 18.1
MiniCPM-V2.5 โœ… 15.3

The test code for evaluating models in the paper can be found in: EraX-JS-Company/EraX-MTVQA-Benchmark

API trial ๐ŸŽ‰

Please contact [email protected] for API access inquiry.

Examples ๐Ÿงฉ

1. OCR - Optical Character Recognition for Multi-Images

Example 01: Citizen identification card

Front View

Front View

Back View

Back View

Source: Google Support

{
    "Sแป‘ thแบป": "037094012351",
    "Hแป vร  tรชn": "TRแปŠNH QUANG DUY",
    "Ngร y sinh": "04/09/1994",
    "Giแป›i tรญnh": "Nam",
    "Quแป‘c tแป‹ch": "Viแป‡t Nam",
    "Quรช quรกn": "Tรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh",
    "Nฦกi thฦฐแปng trรบ": "Xรณm 6\nTรขn Thร nh, Kim Sฦกn, Ninh Bรฌnh",
    "Cรณ giรก trแป‹ ฤ‘แบฟn": "04/09/2034",
    "ฤแบทc ฤ‘iแปƒm nhรขn dแบกng": "sแบนo chแบฅm c. 1cm trรชn ฤ‘uรดi mแบฏt trรกi",
    "Nฦกi cแบฅp": "CแปคC TRฦฏแปžNG CแปคC CแบขNH SรT\nQUแบขN Lร Hร€NH CHรNH Vแป€ TRแบฌT Tแปฐ Xรƒ Hแป˜I",
    "Ngร y cแบฅp": "10/12/2022",
    "Cรกn bแป™ kรฝ tรชn": "Nguyแป…n Quแป‘c Hรนng",
    "Mรฃ ฤ‘แป‹nh danh": "IDVNM0940123513037094012351"
}

Example 02: Driver's License

Front View

Front View

Back View

Back View

Source: Bรกo Phรกp luแบญt

{
  "No.":"400116012313"
  "Fullname":"NGUYแป„N Vฤ‚N DลจNG"
  "Date_of_birth":"08/06/1979"
  "Nationality":"VIแป†T NAM"
  "Address":"X. Quแปณnh Hแบงu, H. Quแปณnh Lฦฐu, T. Nghแป‡ An
  Nghแป‡ An, ngร y/date 23 thรกng/month 04 nฤƒm/year 2022"
  "Hang_Class":"FC"
  "Expires":"23/04/2027"
  "Place_of_issue":"Nghแป‡ An"
  "Date_of_issue":"ngร y/date 23 thรกng/month 04 nฤƒm/year 2022"
  "Signer":"Trแบงn Anh Tuแบฅn"
  "Cรกc loแบกi xe ฤ‘ฦฐแปฃc phรฉp":"ร” tรด hแบกng C kรฉo rฦกmoรณc, ฤ‘แบงu kรฉo kรฉo sฦกmi rฦกmoรณc vร  xe hแบกng B1, B2, C, FB2 (Motor vehicle of class C with a trailer, semi-trailer truck and vehicles of classes B1, B2, C, FB2)"
  "Mรฃ sแป‘":""
}

Example 03: Vehicle Registration Certificate

Source: Bรกo Vietnamnet

{
    "Tรชn chแปง xe": "NGUYแป„N Tร”N NHUแบฌN",
    "ฤแป‹a chแป‰": "KE27 Kp3 P.TTTรขy Q7",
    "Nhรฃn hiแป‡u": "HONDA",
    "Sแป‘ loแบกi": "DYLAN",
    "Mร u sฦกn": "Trแบฏng",
    "Nฤƒm sแบฃn xuแบฅt": "2012",
    "Sแป‘ mรกy": "F03E-0057735",
    "Sแป‘ khung": "SA04F-070410",
    "Dung tรญch": "152",
    "Sแป‘ chแป— ngแป“i": "02",
    "Biแปƒn sแป‘ ฤ‘ฤƒng kรฝ": "59V1-498.89",
    "ฤฤƒng kรฝ lแบงn ฤ‘แบงu ngร y": "08/06/2004",
    "Chแปฉc vแปฅ": "Thฦฐแปฃng tรก",
    "Ngฦฐแปi kรฝ": "Trแบงn Vฤƒn Hiแปƒu"
}

Example 04: Vehicle Registration

Source: https://llumar.com.vn

{
    "vehicle": {
        "registration_number": "30A-072.36",
        "vehicle_inspection_number": "2903V-093515",
        "type": "รด tรด con",
        "mark": "MERCEDES-BENZ",
        "model_code": "C300 W204",
        "engine_number": "27294732096079",
        "chassis_number": "RLMGF5EX3DV005333",
        "manufactured_year_and_country": "2013, Viแป‡t Nam",
        "life_time_limit_to": "",
        "commercial_use": "",
        "modification": ""
    },
    "specifications": {
        "wheel_formula": "4x2",
        "wheel_tread": "1521/1512 (mm)",
        "overall_dimension": "4650 x 1770 x 1429 (mm)",
        "largest_luggage_container_dimension": "",
        "wheelbase": "2760 (mm)",
        "kerb_mass": "1575 (kg)",
        "design_authorized_pay_load": "",
        "design_authorized_total_mass": "2090/2090 (kg)",
        "design_authorized_towed_mass": "",
        "permissible_number_of_pers_carried": "5 chแป— ngแป“i, 0 chแป— ฤ‘แปฉng, 0 chแป— nแบฑm",
        "type_of_fuel_used": "Xฤƒng",
        "engine_displacement": "2996 (cm3)",
        "max_output_per_rpm": "170(kW)/6000vph",
        "number": "KC-1292285"
    },
    "inspection_report_number": "2905V-20953/16",
    "valid_until": "31/01/2018",
    "place_date_of_issue": "Hร  Nแป™i, ngร y 1 thรกng 8 nฤƒm 2016",
    "inspection_center": "ฤฦ N VแปŠ KIแป‚M ฤแปŠNH XE Cฦ  GIแปšI",
    "signature": "Ngแปc Tuแบฅn",
    "equipped_with_tachograph": "",
    "inspection_stamp_was_not_issued": "",
    "notes": "Biแปƒn ฤ‘ฤƒng kรฝ nแปn trแบฏng"
}

Example 05: Receipt

Source: https://tintucketoan.com/

{
    'Mแบซu sแป‘': '01GKTKT3/001', 
    'Kรฝ hiแป‡u': 'TC/18P', 
    'Sแป‘': '0000030', 
    'Hแป tรชn ngฦฐแปi mua hร ng': None, 
    'Tรชn ฤ‘ฦกn vแป‹': 'Cรดng Ty TNHH Kแบฟ Toรกn Hร  Nแป™i', 
    'Mรฃ sแป‘ thuแบฟ': '0106235869', 
    'ฤแป‹a chแป‰': 'Sแป‘ 49 Ngรต 322 Lรช Trแปng Tแบฅn, phฦฐแปng Khฦฐฦกng Mai, quแบญn Thanh Xuรขn, Hร  Nแป™i', 
    'Hรฌnh thแปฉc thanh toรกn': 'TM', 
    'STT': None, 
    'Tรชn hร ng hรณa, dแป‹ch vแปฅ': 'Tra cแปฉu phแบงn mแปm thฦฐ viแป‡n phรกp luแบญt trแปฑc tuyแบฟn', 
    'ฤฦกn vแป‹ tรญnh': None, 
    'Sแป‘ lฦฐแปฃng': None, 
    'ฤฦกn giรก': '168.000', 
    'Thร nh tiแปn': '2.016.000', 
    'Thuแบฟ suแบฅt GTGT': None, 
    'Tiแปn thuแบฟ GTGT': None, 
    'Tแป•ng cแป™ng tiแปn thanh toรกn': '2.016.000', 
    'Sแป‘ tiแปn viแบฟt bแบฑng chแปฏ': 'Hai triแป‡u, khรดng trฤƒm mฦฐแปi sรกu nghรฌn ฤ‘แป“ng', 
    'Ngฦฐแปi bรกn hร ng': 'Bรนi Vฤƒn Hรนng', 
    'Chแปฉc vแปฅ ngฦฐแปi bรกn hร ng': 'TRฦฏแปžNG CHI NHรNH'
}

2.1 Image Captioning

Hรฌnh แบฃnh lร  biแปƒu ฤ‘แป“ BMI theo tuแป•i, thแปƒ hiแป‡n mแป‘i quan hแป‡ giแปฏa chแป‰ sแป‘ khแป‘i cฦก thแปƒ (BMI) vร  ฤ‘แป™ tuแป•i cแปงa trแบป em. Biแปƒu ฤ‘แป“ ฤ‘ฦฐแปฃc chia thร nh cรกc vรนng mร u khรกc nhau tฦฐฦกng แปฉng vแป›i cรกc mแปฉc BMI khรกc nhau:

  • Vรนng mร u ฤ‘แป: Chแป‰ sแป‘ BMI cao hฦกn 25, cho thแบฅy tรฌnh trแบกng bรฉo phรฌ.
  • Vรนng mร u vร ng: Chแป‰ sแป‘ BMI nแบฑm trong khoแบฃng tแปซ 18 ฤ‘แบฟn 25, cho thแบฅy nguy cฦก bรฉo phรฌ.
  • Vรนng mร u xanh lรก cรขy nhแบกt: Chแป‰ sแป‘ BMI nแบฑm trong khoแบฃng tแปซ 16 ฤ‘แบฟn 18, cho thแบฅy sแปฉc khแปe dinh dฦฐแปกng tแป‘t.
  • Vรนng mร u xanh lรก cรขy ฤ‘แบญm: Chแป‰ sแป‘ BMI thแบฅp hฦกn 16, cho thแบฅy tรฌnh trแบกng thiแบฟu cรขn.

Trแปฅc tung biแปƒu diแป…n chแป‰ sแป‘ BMI, trแปฅc hoร nh biแปƒu diแป…n tuแป•i (tรญnh bแบฑng nฤƒm). ฤฦฐแปng cong mร u xรกm ฤ‘แบญm thแปƒ hiแป‡n ฤ‘ฦฐแปng chuแบฉn BMI theo tuแป•i. Cรกc ฤ‘ฦฐแปng cong nร y cho thแบฅy sแปฑ thay ฤ‘แป•i BMI theo thแปi gian, giรบp ฤ‘รกnh giรก sแปฑ phรกt triแปƒn cรขn nแบทng cแปงa trแบป em. Vรญ dแปฅ, แปŸ trแบป em dฦฐแป›i 3 tuแป•i, BMI thฦฐแปng dao ฤ‘แป™ng trong vรนng thiแบฟu cรขn hoแบทc sแปฉc khแปe dinh dฦฐแปกng tแป‘t. Khi trแบป lแป›n lรชn, BMI cรณ xu hฦฐแป›ng tฤƒng dแบงn, nhฦฐng tแป‘c ฤ‘แป™ tฤƒng trฦฐแปŸng cรณ thแปƒ khรกc nhau tรนy thuแป™c vร o tแปซng cรก nhรขn. Biแปƒu ฤ‘แป“ cลฉng hiแปƒn thแป‹ cรกc phแบงn trฤƒm phรขn vแป‹ (Percentile), cho biแบฟt tแปท lแป‡ phแบงn trฤƒm trแบป em cรณ BMI thแบฅp hฦกn hoแบทc cao hฦกn so vแป›i mแป™t nhรณm trแบป em cรนng ฤ‘แป™ tuแป•i. ฤiแปu nร y giรบp so sรกnh BMI cแปงa trแบป em vแป›i tiรชu chuแบฉn quแป‘c tแบฟ.

2.2 Image Captioning

Hรฌnh แบฃnh chแปฅp mแป™t cแบฃnh tฦฐแปฃng ฤ‘แบงy xรบc ฤ‘แป™ng vร  bi thแบฃm, dฦฐแปng nhฦฐ diแป…n ra แปŸ mแป™t khu vแปฑc nghรจo khรณ, cรณ thแปƒ lร  mแป™t khu ฤ‘แป‹nh cฦฐ hoแบทc khu แป• chuแป™t. Trung tรขm cแปงa bแปฉc แบฃnh lร  mแป™t chiแบฟc xe ฤ‘แบฉy ฤ‘ฦฐแปฃc kรฉo bแปŸi mแป™t con lแปซa. Trรชn xe ฤ‘แบฉy cรณ mแป™t nhรณm ngฦฐแปi, bao gแป“m mแป™t ngฦฐแปi ฤ‘ร n รดng lแป›n tuแป•i cรณ vแบป nhฦฐ lร  ngฦฐแปi hฦฐแป›ng dแบซn, mแป™t phแปฅ nแปฏ mแบทc รกo choร ng ฤ‘en, mแป™t phแปฅ nแปฏ trแบป mแบทc รกo xรกm, mแป™t bรฉ gรกi nhแป ฤ‘ฦฐแปฃc che mแบทt bแบฑng khฤƒn trรนm ฤ‘แบงu, vร  mแป™t cแบญu bรฉ mแบทc รกo xanh lรก cรขy. Hแป cรณ vแบป nhฦฐ ฤ‘ang di chuyแปƒn tแปซ mแป™t khu vแปฑc bแป‹ tร n phรก sang mแป™t nฦกi khรกc. Xung quanh hแป lร  nhแปฏng mแบฃnh vแปฅn, rรกc thแบฃi vร  nhแปฏng tรฒa nhร  ฤ‘แป• nรกt, cho thแบฅy mแป™t mรดi trฦฐแปng sแป‘ng khรณ khฤƒn vร  thiแบฟu thแป‘n. Phรญa sau hแป, cรณ mแป™t sแป‘ ngฦฐแปi khรกc ฤ‘ang ฤ‘i bแป™ hoแบทc ฤ‘i xe ฤ‘แบกp, mแป™t sแป‘ ngฦฐแปi ฤ‘ang mang theo hร nh lรฝ. Bแบงu trแปi trong xanh, nhฦฐng khรดng khรญ chung cแปงa bแปฉc แบฃnh lร  nแบทng nแป vร  buแป“n bรฃ. Cรณ thแปƒ ฤ‘รขy lร  mแป™t cแบฃnh tฦฐแปฃng di cฦฐ, tแป‹ nแบกn, hoแบทc mแป™t sแปฑ kiแป‡n thแบฃm khแป‘c nร o ฤ‘รณ.

3. Multi-turns Visual Question Answering and Reasoning

Source: https://kenh14.vn

  • User: Hรฃy cho biแบฟt thรดng tin bแป‡nh nhรขn

  • Assistant: Tรชn: NGUYแป„N THแปŠ THรI TRINH, Ngร y sinh: 09/09/1993, Giแป›i tรญnh: Nแปฏ, ฤแป‹a chแป‰: Tร‚N ฤแปŠNH - Q1 - TPHCM, Mรฃ thแบป BHYT: 19169080, Sแป‘ khรกm bแป‡nh: 2569501, Mรฃ khu vแปฑc (K1/K2/K3): 2

  • User: Hแป ฤ‘ฦฐแปฃc chuแบฉn ฤ‘oรกn bแป‹ bแป‡nh gรฌ?

  • Assistant: Nhiแป…m trรนng ฤ‘ฦฐแปng ruแป™t do vi khuแบฉn, khรดng xรกc ฤ‘แป‹nh loร i

  • User: Ngฦฐแปi bแป‡nh cรณ tiแปn sแปญ bแป‡nh khรดng? Nแบฟu cรณ thรฌ lร  gรฌ?

  • Assistant: Thiแบฟu Calci do chแบฟ ฤ‘แป™ ฤƒn

  • User: Bแป‡nh nhรขn ฤ‘รฃ thแปฑc hiแป‡n nhแปฏng xรฉt nghiแป‡m gรฌ?

  • Assistant: [NQ286] Tแป•ng phรขn tรญch tแบฟ bร o mรกu ngoแบกi vi (bแบฑng mรกy ฤ‘แบฟm laser), [NQ286] ฤiแป‡n giแบฃi ฤ‘แป“ (Na, K, Cl) [Mรกu], [NQ286] ฤo hoแบกt ฤ‘แป™ ALT (GPT) [Mรกu], [NQ286] ฤo hoแบกt ฤ‘แป™ AST (GOT) [Mรกu], [NQ286] ฤแป‹nh lฦฐแปฃng Calci toร n phแบงn [Mรกu], [NQ286] ฤแป‹nh lฦฐแปฃng Creatinin [Mรกu], [NQ286] ฤแป‹nh lฦฐแปฃng Glucose [Mรกu], [NQ286] ฤแป‹nh lฦฐแปฃng Urรช mรกu [Mรกu], [NQ286] Siรชu รขm แป• bแปฅng (gan mแบญt, tแปฅy, lรกch, thแบญn, bร ng quang)

  • User: Sแป‘ tiแปn Vแบญt tฦฐ y tแบฟ mร  ngฦฐแปi bแป‡nh phแบฃi trแบฃ lร  bao nhiรชu?

  • Assistant: 19,708

  • User: Cรกc loแบกi thuแป‘c dแป‹ch truyแปn ngฦฐแปi bแป‡nh phแบฃi trแบฃ lร  gรฌ, tแป•ng sแป‘ tiแปn lร  bao nhiรชu?

  • Assistant: Atropin Sulfat (Syv2019), Bicefzidim 1g (Syv2019), GONCAL (Syv2019), Lactated Ringer's-500ml (Syv2019), Nฦฐแป›c cแบฅt pha tiรชm 5ml (Syv2019), Sodium Chloride 0.9% -500ml (Syv2019), Vincomid (Syv2019), Vinopa (Syv2019), tแป•ng cแป™ng 45,234 ฤ‘แป“ng

Quickstart ๐ŸŽฎ

Install the necessary packages:

python -m pip install git+https://github.com/huggingface/transformers accelerate
python -m pip install qwen-vl-utils
pip install flash-attn --no-build-isolation

Then you can use EraX-VL-7B-V1.5 like this:

import os
import base64
import json

import cv2
import numpy as np
import matplotlib.pyplot as  plt

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "erax/EraX-VL-7B-V1.5"

model = Qwen2VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="eager", # replace with "flash_attention_2" if your GPU is Ampere architecture
    device_map="auto"
)

tokenizer =  AutoTokenizer.from_pretrained(model_path)
# processor = AutoProcessor.from_pretrained(model_path)

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
     model_path,
     min_pixels=min_pixels,
     max_pixels=max_pixels,
 )

image_path ="image.jpg"

with open(image_path, "rb") as f:
    encoded_image = base64.b64encode(f.read())
decoded_image_text = encoded_image.decode('utf-8')
base64_data = f"data:image;base64,{decoded_image_text}"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": base64_data,
            },
            {
                "type": "text",
                "text": "Trรญch xuแบฅt thรดng tin nแป™i dung tแปซ hรฌnh แบฃnh ฤ‘ฦฐแปฃc cung cแบฅp."
            },
        ],
    }
]

# Prepare prompt
tokenized_text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[ tokenized_text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Generation configs
generation_config =  model.generation_config
generation_config.do_sample   = True
generation_config.temperature = 1.0
generation_config.top_k       = 1
generation_config.top_p       = 0.9
generation_config.min_p       = 0.1
generation_config.best_of     = 5
generation_config.max_new_tokens     = 2048
generation_config.repetition_penalty = 1.06

# Inference
generated_ids = model.generate(**inputs, generation_config=generation_config)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])

References ๐Ÿ“‘

[1] Qwen team. Qwen2-VL. 2024.

[2] Bai, Jinze, et al. "Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond." arXiv preprint arXiv:2308.12966 (2023).

[4] Yang, An, et al. "Qwen2 technical report." arXiv preprint arXiv:2407.10671 (2024).

[5] Chen, Zhe, et al. "Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[6] Chen, Zhe, et al. "How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites." arXiv preprint arXiv:2404.16821 (2024).

[7] Tran, Chi, and Huong Le Thanh. "LaVy: Vietnamese Multimodal Large Language Model." arXiv preprint arXiv:2404.07922 (2024).

Contact ๐Ÿค

  • For correspondence regarding this work or inquiry for API trial, please contact Nguyแป…n Anh Nguyรชn at [email protected].
  • Follow us on EraX Github
Downloads last month
26
Safetensors
Model size
8.29B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mxw1998/EraX-VL-7B-V1.5

Base model

Qwen/Qwen2-VL-7B
Finetuned
(303)
this model