|
--- |
|
license: gemma |
|
pipeline_tag: image-text-to-text |
|
tags: |
|
- vision |
|
- gemma |
|
- llama.cpp |
|
--- |
|
|
|
# <span style="color: #7FFF7F;">Gemma-3 4B Instruct GGUF Models</span> |
|
|
|
|
|
## How to Use Gemma 3 Vision with llama.cpp |
|
|
|
To utilize the experimental support for Gemma 3 Vision in `llama.cpp`, follow these steps: |
|
|
|
1. **Clone the lastest llama.cpp Repository**: |
|
```bash |
|
git clone https://github.com/ggml-org/llama.cpp.git |
|
cd llama.cpp |
|
``` |
|
|
|
|
|
2. **Build the Llama.cpp**: |
|
|
|
Build llama.cpp as usual : https://github.com/ggml-org/llama.cpp#building-the-project |
|
|
|
Once llama.cpp is built Copy the ./llama.cpp/build/bin/llama-gemma3-cli to a chosen folder. |
|
|
|
3. **Download the Gemma 3 gguf file**: |
|
|
|
https://huggingface.co/Mungert/gemma-3-4b-it-gguf/tree/main |
|
|
|
Choose a gguf file without the mmproj in the name |
|
|
|
Example gguf file : https://huggingface.co/Mungert/gemma-3-4b-it-gguf/resolve/main/google_gemma-3-4b-it-q4_k_l.gguf |
|
|
|
Copy this file to your chosen folder. |
|
|
|
4. **Download the Gemma 3 mmproj file** |
|
|
|
https://huggingface.co/Mungert/gemma-3-4b-it-gguf/tree/main |
|
|
|
Choose a file with mmproj in the name |
|
|
|
Example mmproj file : https://huggingface.co/Mungert/gemma-3-4b-it-gguf/resolve/main/google_gemma-3-4b-it-mmproj-bf16.gguf |
|
|
|
Copy this file to your chosen folder. |
|
|
|
5. Copy images to the same folder as the gguf files or alter paths appropriately. |
|
|
|
In the example below the gguf files, images and llama-gemma-cli are in the same folder. |
|
|
|
Example image: image https://huggingface.co/Mungert/gemma-3-4b-it-gguf/resolve/main/car-1.jpg |
|
|
|
Copy this file to your chosen folder. |
|
|
|
6. **Run the CLI Tool**: |
|
|
|
From your chosen folder : |
|
|
|
```bash |
|
llama-gemma3-cli -m google_gemma-3-4b-it-q4_k_l.gguf --mmproj google_gemma-3-4b-it-mmproj-bf16.gguf |
|
``` |
|
|
|
``` |
|
Running in chat mode, available commands: |
|
/image <path> load an image |
|
/clear clear the chat history |
|
/quit or /exit exit the program |
|
|
|
> /image car-1.jpg |
|
Encoding image car-1.jpg |
|
Image encoded in 46305 ms |
|
Image decoded in 19302 ms |
|
|
|
> what is the image of |
|
Here's a breakdown of what's in the image: |
|
|
|
**Subject:** The primary subject is a black Porsche Panamera Turbo driving on a highway. |
|
|
|
**Details:** |
|
|
|
* **Car:** It's a sleek, modern Porsche Panamera Turbo, identifiable by its distinctive rear design, the "PORSCHE" lettering, and the "Panamera Turbo" badge. The license plate reads "CVC-911". |
|
* **Setting:** The car is on a multi-lane highway, with a blurred background of trees, a distant building, and a cloudy sky. The lighting suggests it's either dusk or dawn. |
|
* **Motion:** The image captures the car in motion, with a slight motion blur to convey speed. |
|
|
|
**Overall Impression:** The image conveys a sense of speed, luxury, and power. It's a well-composed shot that highlights the car's design and performance. |
|
|
|
Do you want me to describe any specific aspect of the image in more detail, or perhaps analyze its composition? |
|
``` |
|
|
|
# <span id="testllm" style="color: #7F7FFF;">π If you find these models useful</span> |
|
|
|
Please click like β€οΈ . Also Iβd really appreciate it if you could test my Network Monitor Assistant at π [Network Monitor Assitant](https://freenetworkmonitor.click). |
|
π¬ Click the **chat icon** (bottom right of the main and dashboard pages) . Choose a LLM; toggle between the LLM Types TurboLLM -> FreeLLM -> TestLLM. |
|
|
|
### What I'm Testing |
|
I'm experimenting with **function calling** against my network monitoring service. Using small open source models. I am into the question "How small can it go and still function". |
|
π‘ **TestLLM** β Runs **Phi-4-mini-instruct** using phi-4-mini-q4_0.gguf , llama.cpp on 6 threads of a Cpu VM (Should take about 15s to load. Inference speed is quite slow and it only processes one user prompt at a timeβstill working on scaling!). If you're curious, I'd be happy to share how it works! . |
|
|
|
### The other Available AI Assistants |
|
π’ **TurboLLM** β Uses **gpt-4o-mini** Fast! . Note: tokens are limited since OpenAI models are pricey, but you can [Login](https://freenetworkmonitor.click) or [Download](https://freenetworkmonitor.click/download) the Free Network Monitor agent to get more tokens, Alternatively use the TestLLM . |
|
π΅ **HugLLM** β Runs **open-source Hugging Face models** Fast, Runs small models (β8B) hence lower quality, Get 2x more tokens (subject to Hugging Face API availability) |
|
|
|
|
|
## **Choosing the Right Model Format** |
|
|
|
Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**. |
|
|
|
### **BF16 (Brain Float 16) β Use if BF16 acceleration is available** |
|
- A 16-bit floating-point format designed for **faster computation** while retaining good precision. |
|
- Provides **similar dynamic range** as FP32 but with **lower memory usage**. |
|
- Recommended if your hardware supports **BF16 acceleration** (check your deviceβs specs). |
|
- Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32. |
|
|
|
π **Use BF16 if:** |
|
β Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs). |
|
β You want **higher precision** while saving memory. |
|
β You plan to **requantize** the model into another format. |
|
|
|
π **Avoid BF16 if:** |
|
β Your hardware does **not** support BF16 (it may fall back to FP32 and run slower). |
|
β You need compatibility with older devices that lack BF16 optimization. |
|
|
|
--- |
|
|
|
### **F16 (Float 16) β More widely supported than BF16** |
|
- A 16-bit floating-point **high precision** but with less of range of values than BF16. |
|
- Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs). |
|
- Slightly lower numerical precision than BF16 but generally sufficient for inference. |
|
|
|
π **Use F16 if:** |
|
β Your hardware supports **FP16** but **not BF16**. |
|
β You need a **balance between speed, memory usage, and accuracy**. |
|
β You are running on a **GPU** or another device optimized for FP16 computations. |
|
|
|
π **Avoid F16 if:** |
|
β Your device lacks **native FP16 support** (it may run slower than expected). |
|
β You have memory limtations. |
|
|
|
--- |
|
|
|
### **Quantized Models (Q4_K, Q6_K, Q8, etc.) β For CPU & Low-VRAM Inference** |
|
Quantization reduces model size and memory usage while maintaining as much accuracy as possible. |
|
- **Lower-bit models (Q4_K)** β **Best for minimal memory usage**, may have lower precision. |
|
- **Higher-bit models (Q6_K, Q8_0)** β **Better accuracy**, requires more memory. |
|
|
|
π **Use Quantized Models if:** |
|
β You are running inference on a **CPU** and need an optimized model. |
|
β Your device has **low VRAM** and cannot load full-precision models. |
|
β You want to reduce **memory footprint** while keeping reasonable accuracy. |
|
|
|
π **Avoid Quantized Models if:** |
|
β You need **maximum accuracy** (full-precision models are better for this). |
|
β Your hardware has enough VRAM for higher-precision formats (BF16/F16). |
|
|
|
--- |
|
|
|
### **Summary Table: Model Format Selection** |
|
|
|
| Model Format | Precision | Memory Usage | Device Requirements | Best Use Case | |
|
|--------------|------------|---------------|----------------------|---------------| |
|
| **BF16** | Highest | High | BF16-supported GPU/CPUs | High-speed inference with reduced memory | |
|
| **F16** | High | High | FP16-supported devices | GPU inference when BF16 isnβt available | |
|
| **Q4_K** | Low | Very Low | CPU or Low-VRAM devices | Best for memory-constrained environments | |
|
| **Q6_K** | Medium Low | Low | CPU with more memory | Better accuracy while still being quantized | |
|
| **Q8** | Medium | Moderate | CPU or GPU with enough VRAM | Best accuracy among quantized models | |
|
|
|
|
|
## **Included Files & Details** |
|
|
|
### `google_gemma-3-4b-it-bf16.gguf` |
|
- Model weights preserved in **BF16**. |
|
- Use this if you want to **requantize** the model into a different format. |
|
- Best if your device supports **BF16 acceleration**. |
|
|
|
### `google_gemma-3-4b-it-f16.gguf` |
|
- Model weights stored in **F16**. |
|
- Use if your device supports **FP16**, especially if BF16 is not available. |
|
|
|
### `google_gemma-3-4b-it-bf16-q8.gguf` |
|
- **Output & embeddings** remain in **BF16**. |
|
- All other layers quantized to **Q8_0**. |
|
- Use if your device supports **BF16** and you want a quantized version. |
|
|
|
### `google_gemma-3-4b-it-f16-q8.gguf` |
|
- **Output & embeddings** remain in **F16**. |
|
- All other layers quantized to **Q8_0**. |
|
|
|
### `google_gemma-3-4b-it-q4_k_l.gguf` |
|
- **Output & embeddings** quantized to **Q8_0**. |
|
- All other layers quantized to **Q4_K**. |
|
- Good for **CPU inference** with limited memory. |
|
|
|
### `google_gemma-3-4b-it-q4_k_m.gguf` |
|
- Similar to Q4_K. |
|
- Another option for **low-memory CPU inference**. |
|
|
|
### `google_gemma-3-4b-it-q4_k_s.gguf` |
|
- Smallest **Q4_K** variant, using less memory at the cost of accuracy. |
|
- Best for **very low-memory setups**. |
|
|
|
### `google_gemma-3-4b-it-q6_k_l.gguf` |
|
- **Output & embeddings** quantized to **Q8_0**. |
|
- All other layers quantized to **Q6_K** . |
|
|
|
### `google_gemma-3-4b-it-q6_k_m.gguf` |
|
- A mid-range **Q6_K** quantized model for balanced performance . |
|
- Suitable for **CPU-based inference** with **moderate memory**. |
|
|
|
### `google_gemma-3-4b-it-q8.gguf` |
|
- Fully **Q8** quantized model for better accuracy. |
|
- Requires **more memory** but offers higher precision. |
|
|
|
|
|
# Gemma 3 model card |
|
|
|
**Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core) |
|
|
|
**Resources and Technical Documentation**: |
|
|
|
* [Gemma 3 Technical Report][g3-tech-report] |
|
* [Responsible Generative AI Toolkit][rai-toolkit] |
|
* [Gemma on Kaggle][kaggle-gemma] |
|
* [Gemma on Vertex Model Garden][vertex-mg-gemma3] |
|
|
|
**Terms of Use**: [Terms][terms] |
|
|
|
**Authors**: Google DeepMind |
|
|
|
## Model Information |
|
|
|
Summary description and brief definition of inputs and outputs. |
|
|
|
### Description |
|
|
|
Gemma is a family of lightweight, state-of-the-art open models from Google, |
|
built from the same research and technology used to create the Gemini models. |
|
Gemma 3 models are multimodal, handling text and image input and generating text |
|
output, with open weights for both pre-trained variants and instruction-tuned |
|
variants. Gemma 3 has a large, 128K context window, multilingual support in over |
|
140 languages, and is available in more sizes than previous versions. Gemma 3 |
|
models are well-suited for a variety of text generation and image understanding |
|
tasks, including question answering, summarization, and reasoning. Their |
|
relatively small size makes it possible to deploy them in environments with |
|
limited resources such as laptops, desktops or your own cloud infrastructure, |
|
democratizing access to state of the art AI models and helping foster innovation |
|
for everyone. |
|
|
|
### Inputs and outputs |
|
|
|
- **Input:** |
|
- Text string, such as a question, a prompt, or a document to be summarized |
|
- Images, normalized to 896 x 896 resolution and encoded to 256 tokens |
|
each |
|
- Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and |
|
32K tokens for the 1B size |
|
|
|
- **Output:** |
|
- Generated text in response to the input, such as an answer to a |
|
question, analysis of image content, or a summary of a document |
|
- Total output context of 8192 tokens |
|
|
|
|
|
## Credits |
|
|
|
Thanks [Bartowski](https://huggingface.co/bartowski) for imartix upload. And your guidance on quantization that has enabled me to produce these gguf file. |