Heron-NVILA-Lite-15B
Heron-NVILA-Lite-15B is a vision language model trained for Japanese, based on the NVILA-Lite architecture.
Model Overview
- Developer: Turing Inc.
- Vision Encoder: paligemma-siglip-so400m-patch14-448
- Projector: mlp_downsample_3x3_fix
- LLM: Qwen2.5-14B-Instruct
- Supported Languages: Japanese, English
Setup
# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git
Usage
from transformers import AutoConfig, AutoModel
model_path = "turing-motors/Heron-NVILA-Lite-15B"
# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")
# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")
# show chat_template
print(model.tokenizer.chat_template)
# examples generate with raw text
response = model.generate_content(["ใใใซใกใฏ"])
print(response)
print("---" * 40)
# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"])
print(response)
print("---" * 40)
# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
"max_new_tokens": 512,
"temperature": 0.5,
"do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
[image, "็ปๅใ่ชฌๆใใฆใใ ใใใ"],
generation_config=generation_config
)
print(response)
print("---" * 40)
# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
images[0],
"ใใใฏๆฅๆฌใฎ็ปๅใงใ",
images[1],
"ใใใฏใชใผในใใชใขใฎ็ปๅใงใ",
"ๅ็ปๅใฎ้ใใ่ชฌๆใใฆ"])
print(response)
print("---" * 40)
Training Summary
Stage | Training | Data Sources | Samples |
---|---|---|---|
Stage1 | Projector | Japanese image text pairs, LLaVA-Pretrain | 1.1M |
Stage2 | Projector, LLM | Filtered MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) | 13M |
Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions | 20M | ||
Stage3 | Vision Encoder, Projector, LLM | llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock | 1.1M |
Evaluation
I used llm-jp-eval-mm for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from llm-jp-eval-mm leaderboard as of March 2025 and the Asagi website. Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the official blog using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.
Model | LLM Size | Heron-Bench overall LLM (%) | JA-VLM-Bench-In-the-Wild LLM (/5.0) | JA-VG-VQA-500 LLM (/5.0) |
---|---|---|---|---|
Heron-NVILA-Lite-1B | 0.5B | 45.9 | 2.92 | 3.16 |
Heron-NVILA-Lite-2B | 1.5B | 52.8 | 3.52 | 3.50 |
Heron-NVILA-Lite-15B | 14B | 59.6 | 4.2 | 3.82 |
LLaVA-CALM2-SigLIP | 7B | 43.3 | 3.15 | 3.21 |
Llama-3-EvoVLM-JP-v2 | 8B | 39.3 | 2.92 | 2.96 |
VILA-jp | 13B | 57.2 | 3.69 | 3.62 |
Asagi-14B | 13B | 55.8 | 3.44 | 3.84 |
Sarashina2-Vision-14B | 13B | 50.9 | 4.1 | 3.43 |
Qwen2-VL 7B Instruct | 7B | 55.5 | 3.61 | 3.6 |
GPT-4o | - | 87.6 | 3.85 | 3.58 |
Risks and Limitations
This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.
License
- Model weights are licensed under Apache License 2.0.
- Users must comply with OpenAI terms of use due to the inclusion of GPT-4-generated synthetic data.
Acknowledgements
This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).
I would like to acknowledge the use of the following open-source repositories:
- Downloads last month
- 543