MedGemma-4B-IT GGUF (Multimodal)

This repository provides GGUF-formatted model files for google/medgemma-4b-it, designed for use with llama.cpp. MedGemma is a multimodal model based on Gemma-3, fine-tuned for the medical domain.

These GGUF files allow you to run the MedGemma model locally on your CPU, or offload layers to a GPU if supported by your llama.cpp build (e.g., Metal on macOS, CUDA on Linux/Windows).

For multimodal (vision) capabilities, you MUST use both a language model GGUF file AND the provided mmproj (multimodal projector) GGUF file.

Original Model: google/medgemma-4b-it

Files Provided

Below are the GGUF files available in this repository. It is recommended to use the F16 version of the mmproj file with any of the language model quantizations.

Language Model GGUFs:

medgemma-4b-it-F16.gguf:
- Quantization: F16 (16-bit floating point)
- Size: ~7.77 GB (Verify this with your actual file size)
- Use: Highest precision, best quality, largest file size.
medgemma-4b-it-Q8_0.gguf:
- Quantization: Q8_0
- Size: ~4.13 GB (Verify this with your actual file size)
- Use: Excellent balance between model quality and file size/performance.

Multimodal Projector GGUF (Required for Image Input):

mmproj-medgemma-4b-it-Q8_0.gguf:
- Quantization: Q8_0
- Size: ~591 MB
- Use: This file is essential for image understanding. It should be used alongside any of the language model GGUF files listed above. mmproj-medgemma-4b-it-F16.gguf:
- Quantization: F16 (Recommended precision for projector)
- Size: ~851 MB
- Use: This file is essential for image understanding. It should be used alongside any of the language model GGUF files listed above.

How to use?

Dowload the models mmproj files.

Install llama.cpp (https://github.com/ggml-org/llama.cpp)

Run the server via: llama-server -m ~/models/medgemma-4b-it-f16.gguf --mmproj ~/models/mmproj-medgemma-4b-it-f16.gguf -c 2048 --port 8080

Then use the model. Example usage via a visual chat: https://github.com/kelkalot/medgemma-visual-chat