Gemma 3 MM model card

Terms of Use: Terms

Model Summary

Gemma-3-MM is a open multimodal instruction models that extend the capabilities of the original Gemma-3 models to include speech processing.

These models leverage the language and vision research used in the original Gemma-3 models and incorporate additional speech processing capabilities through a Speech Adapter.

The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).

Evaluation

Model evaluation metrics and results.

Here is Script to evaluate model.

ASR

Benchmark Task BLEU ↑ CER ↓ WER ↓ Result
Covost2 ASR (English) 86.09 4.12 7.83 Link
Fleurs ASR (English) 89.61 2.28 5.23 Link
LibriSpeech-Clean ASR (English) 94.28 0.98 2.91 Link
LibriSpeech-Other ASR (English) 87.60 3.10 6.55 Link

AST

Benchmark Task BLEU ↑ Result
Covost2 AST (0-shot, English-Korean) 31.55 Link
Fleurs AST (0-shot, English-Korean) 11.05 Link

(Experimental) ASR : Korean Branch

Score is lower because Korean Normalizer is not applied

Benchmark Task BLEU ↑ CER ↓ WER ↓ Result
Zeroth ASR (Korean) 94.91 1.31 2.50 Link
Fleurs ASR (Korean) 62.83 9.08 23.0 Link
Covost2 ASR (Korean) 43.66 22.5 41.4 Link

Model Details

Developed by: junnei

Model type: Multimodal (Text, Vision, Speech) Language Model

Language(s): Multilingual

License: Gemma

Base model: google/gemma-3-4b-it

Inspiration: Phi-4-multimodal-instruct

Training Details

  • The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.

  • Due to limited computational resources, the model was only trained for limited datasets and epochs on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.

  • The training data was limited to English and Korean languages within less than 30 seconds in duration.

Datasets

ASR / AST

Limitations

Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use. To improve the model's performance and reliability, the following areas need further development:

  • More computational resources for extended training needed.

  • For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).

  • Due to the lack of computing resources, this model primarily recognizes audio files less than 30 seconds in duration. As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.

  • If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.

Usage

Below, there are some code snippets on how to get quickly started with running the model.

First, upgrade your Transformers library. AudioInput for chat_template is supported now.

$ pip install -U transformers

Then, copy the snippet from the section that is relevant for your use case.

Running the model with chat_template

from transformers import AutoProcessor, AutoModel
import torch

model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "korean".

model = AutoModel.from_pretrained(
    model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(
    model_id, revision = revision, trust_remote_code=True
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "Transcribe this audio clip into text."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

# What is shown in this image?

Running the model with raw data

from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image


# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'


messages = [
    {'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)


inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

Finetune the model

Here is finetuning script : Link

You must change output_dir, upload_dir and fit your Datasets

python finetune_speech.py

Citation

@article{gemma3mm_2025,
    title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
    author={Seongjun Jang},
    year={2025}
}
Downloads last month
305
Safetensors
Model size
5.35B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for junnei/gemma-3-4b-it-speech

Finetuned
(105)
this model

Dataset used to train junnei/gemma-3-4b-it-speech