Heron-NVILA-Lite-15B

Heron-NVILA-Lite-15B is a vision language model trained for Japanese, based on the NVILA-Lite architecture.

Model Overview

Setup

# I have confirmed that 4.46.0 and 4.49.0 also work. Other versions of Transformer may also work, but I have not tested them.
pip install transformers==4.45.0 accelerate opencv-python torchvision einops pillow
pip install git+https://github.com/bfshi/scaling_on_scales.git

Usage

from transformers import AutoConfig, AutoModel

model_path = "turing-motors/Heron-NVILA-Lite-15B"

# you can use config
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True, device_map="auto")

# or directly from_pretrained
model = AutoModel.from_pretrained(model_path, trust_remote_code=True, device_map="auto")

# show chat_template
print(model.tokenizer.chat_template)

# examples generate with raw text
response = model.generate_content(["ใ“ใ‚“ใซใกใฏ"])
print(response)
print("---" * 40)

# examples generate with text + image
from PIL import Image
import requests
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content([image, "็”ปๅƒใ‚’่ชฌๆ˜Žใ—ใฆใใ ใ•ใ„ใ€‚"])
print(response)
print("---" * 40)

# examples generate using generation_config
from PIL import Image
import requests
from transformers import GenerationConfig
generation_config = {
    "max_new_tokens": 512,
    "temperature": 0.5,
    "do_sample": True,
}
generation_config = GenerationConfig(**generation_config)
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
response = model.generate_content(
    [image, "็”ปๅƒใ‚’่ชฌๆ˜Žใ—ใฆใใ ใ•ใ„ใ€‚"],
    generation_config=generation_config
)
print(response)
print("---" * 40)

# examples generate with text + image + text + image + text
from PIL import Image
import requests
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url, stream=True).raw).convert("RGB") for url in url_list
]
response = model.generate_content([
    images[0],
    "ใ“ใ‚Œใฏๆ—ฅๆœฌใฎ็”ปๅƒใงใ™",
    images[1],
    "ใ“ใ‚Œใฏใ‚ชใƒผใ‚นใƒˆใƒชใ‚ขใฎ็”ปๅƒใงใ™",
    "ๅ„็”ปๅƒใฎ้•ใ„ใ‚’่ชฌๆ˜Žใ—ใฆ"])
print(response)
print("---" * 40)

Training Summary

Stage Training Data Sources Samples
Stage1 Projector Japanese image text pairs, LLaVA-Pretrain 1.1M
Stage2 Projector, LLM Filtered MOMIJI (CC-MAIN-2024-46, CC-MAIN-2024-51, CC-MAIN-2025-05) 13M
Japanese image text pairs (subset), Japanese interleaved data (subset), mmc4-core (subset), coyo-700m (subset), wikipedia_ja, llava_pretrain_ja, stair_captions 20M
Stage3 Vision Encoder, Projector, LLM llava-instruct-v1_5-en-subset-358k, llava-instruct-ja, japanese-photos-conv, ja-vg-vqa, synthdog-ja (subset), ai2d, synthdog-en, sherlock 1.1M

Evaluation

I used llm-jp-eval-mm for this evaluation. Scores for models other than Heron-NVILA-Lite and Sarashina2-Vision-14B were taken from llm-jp-eval-mm leaderboard as of March 2025 and the Asagi website. Heron-NVILA-Lite and Sarashina2-Vision-14B were evaluated using llm-as-a-judge with "gpt-4o-2024-05-13". Sarashina2-Vision-14B was evaluated on the official blog using "gpt-4o-2024-08-06"; please note that due to differing evaluation conditions, the results for Sarashina2-Vision-14B should be treated as reference only.

Model LLM Size Heron-Bench overall LLM (%) JA-VLM-Bench-In-the-Wild LLM (/5.0) JA-VG-VQA-500 LLM (/5.0)
Heron-NVILA-Lite-1B 0.5B 45.9 2.92 3.16
Heron-NVILA-Lite-2B 1.5B 52.8 3.52 3.50
Heron-NVILA-Lite-15B 14B 59.6 4.2 3.82
LLaVA-CALM2-SigLIP 7B 43.3 3.15 3.21
Llama-3-EvoVLM-JP-v2 8B 39.3 2.92 2.96
VILA-jp 13B 57.2 3.69 3.62
Asagi-14B 13B 55.8 3.44 3.84
Sarashina2-Vision-14B 13B 50.9 4.1 3.43
Qwen2-VL 7B Instruct 7B 55.5 3.61 3.6
GPT-4o - 87.6 3.85 3.58

Risks and Limitations

This model is experimental and has not been thoroughly calibrated for ethical compliance or legal standards. Caution is advised for sensitive applications.

License

Acknowledgements

This model is based on results obtained from a project, JPNP20017, subsidized by the New Energy and Industrial Technology Development Organization (NEDO).

I would like to acknowledge the use of the following open-source repositories:

Downloads last month
543
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for turing-motors/Heron-NVILA-Lite-15B

Space using turing-motors/Heron-NVILA-Lite-15B 1

Collection including turing-motors/Heron-NVILA-Lite-15B